Subgraph Matching with Set Similarity in a Large Graph Database

Size: px

Start display at page:

Download "Subgraph Matching with Set Similarity in a Large Graph Database"

Beryl Strickland
6 years ago
Views:

1 1 Sbgraph Matching with Set Similarity in a Large Graph Database Liang Hong, Lei Zo, Xiang Lian, Philip S. Y Abstract In real-world graphs sch as social networks, Semantic Web and biological networks, each vertex sally contains rich information, which can be modeled by a set of tokens or elements. In this paper, we stdy a sbgraph matching with set similarity (SMS 2 ) qery over a large graph database, which retrieves sbgraphs that are strctrally isomorphic to the qery graph, and meanwhile satisfy the condition of vertex pair matching with the (dynamic) weighted set similarity. To efficiently process the SMS 2 qery, this paper designs a novel lattice-based index for data graph, and lightweight signatres for both qery vertices and data vertices. Based on the index and signatres, we propose an efficient two-phase prning strategy inclding set similarity prning and strctre-based prning, which exploits the niqe featres of both (dynamic) weighted set similarity and graph topology. We also propose an efficient dominating-set-based sbgraph matching algorithm gided by a dominating set selection algorithm to achieve better qery performance. Extensive experiments on both real and synthetic datasets demonstrate that or method otperforms state-of-the-art methods by an order of magnitde. Index Terms sbgraph matching, set similarity, graph database, index 1 INTRODUCTION With the emergence of many real applications sch as social networks, Semantic Web, biological networks, and so on [1, 2, 3, 4, 5, 6], graph databases have been widely sed as important tools to model and qery complex graph data. Mch prior work has extensively stdied varios types of qeries over graphs, in which sbgraph matching [7, 8, 9, 10] is a fndamental graph qery type. Given a qery graph Q and a large graph G, a typical sbgraph matching qery retrieves those sbgraphs in G that exactly match with Q in terms of both graph strctre and vertex labels [7]. However, in some real graph applications, each vertex often contains a rich set of tokens or elements representing featres of the vertex, and the exact matching of vertex labels is sometimes not feasible or practical. Motivated by the observation above, in this paper, we focs on a variant of the sbgraph matching qery, called sbgraph matching with set similarity (SMS 2 ) qery, in which each vertex is associated with a set of elements with dynamic weights instead of a single label. The weights of elements are specified by sers in different qeries according to different application reqirements or evolving data. Specifically, given a qery graph Q with n vertices i (i = 1,..., n), the SMS 2 qery retrieves all the sbgraphs X with n vertices v j (j = 1,..., n) in a large graph G, sch that (1) the weighted set similarity between S( i ) and S(v j ) is larger than a ser specified similarity threshold, {Sbgraph Matching} Liang Hong is with the State Key Laboratory of Software Engineering and School of Information Management, Whan University, Whan, China, hong@wh.ed.cn. Lei Zo is with the ICST, Peking University, Beijing, China, {Protein Interaction {Protein Interaction zolei@pk.ed.cn. The corresponding athor of this paper is Lei Zo. Network, Search} Network, Search} Xiang Lian is with the Dept. of Compter Science, University of Texas (a) Qery Graph Q Pan American, TX, USA, lianx@tpa.ed. Philip S. Y is with the Dept. of Compter Science, University of Illinois, Chicago, Chicago, USA, and Institte for Data Science, Tsingha Fig. 1. University, Beijing, China, psy@ic.ed. where S( i ) and S(v j ) are sets associated with i and v j, respectively; (2) X is strctrally isomorphic to Q with i mapping to v j. We discss the following two examples to demonstrate the seflness of SMS 2 qeries. Example 1 (Finding papers in DBLP) The DBLP 1 compter science bibliography provides a citation graph G (Fig. 1(b)) in which vertices represent papers and edges represent citation relations between papers. Each paper contains a set of keywords, in which each keyword is assigned a weight to measre the importance of a keyword with regard to the paper. In reality, a researcher searches for similar papers from DBLP based on both citation relationships and the content similarity of papers [11]. For example, a researcher wants to find papers on sbgraph matching that are cited by both social network papers and papers on protein interaction network search. Frthermore, she/he reqires papers on protein interaction network search being cited by social network papers. Sch qery can be modeled as an SMS 2 qery, which obtains sbgraph matches of the qery graph Q (Fig. 1(a)) in G. Each paper (vertex) in Q and its matching paper in G shold have similar set of keywords, and each citation relation (edge) exactly follows the researcher s reqirements. {Social Network} 2 1 {Sbgraph Matching} 3 {Social Network} {Social Network, {Social Network, Sbgraph Matching} Sbgraph v 4 Matching} {Social Network, {Social Network, Search} v 2 Search} {Social v Network} 2 v 1 {Sbgraph Matching} {Sbgraph Matching} v 3 {Protein Interaction {Protein {Sbgraph Interaction Network, Search} Network, Matching} Search} v 1 v 5 v 6 v 3 {Social Network} (b) Citation Graph G of DBLP An Example of Finding Grops of Cited Papers in DBLP that Match with the Qery Citation Graph 1. ley/db/ v 4 v 5 v 6 {Sbgraph Matching}

2 2 Example 2 (Qerying DBpedia) DBPedia extracts entities and facts from Wikipedia and stores them in an RDF graph [12]. As shown in Fig. 2(b), in a DBPedia RDF graph G, each entity (i.e. vertex) has an attribte dbpedia-owl:abstract that provides a hmanreadable description (i.e. a set of words) of the entity, and each edge is a fact that indicates the relationship between entities. Typically, sers isse SPARQL qeries to find sbgraph matches of the qery graph (i.e., a graph of connected entities with known attribte vales) by specifying exact qery criteria. However, in reality, a ser may not know (or remember) the exact attribte vales or the RDF schema (sch as property names). For example, a ser wants to find two physicists who both won Nobel prizes and are related to Denmark from DBpedia, while he/she does not know the schema of DBpedia data. In this case, the ser can isse an SMS 2 qery Q, as shown in Fig. 2(a), in which each vertex is described by a short text. The answer to the SMS 2 qery Q is Niels Bohr and Aage Bohr, becase Danish Physicist who won Nobel Prize, and research the sbgraph on atomic match strctre is strctrally and qantm isomorphic mechanics. to Q and the text similarity (measred by the word set similarity) of the The Kingdom Nobel Prize in Physics, awarded by the matching of Denmark. vertex pairsroyal is high. Swedish Interestingly, Academy of Sciences. we find that Niels Bohr is the father of Aage Bohr. Danish nclear physicist and Nobel lareate. Danish Physicist who won Nobel Prize, and research on atomic strctre and qantm mechanics. The Kingdom of Denmark. Nobel Prize in Physics, awarded by the Royal Swedish Academy of Sciences. Danish nclear physicist and Nobel lareate. (a) Qery Graph Q withot the Schema of DBpedia Data Niels Bohr was a Danish physicist who made fondational contribtions to atomic strctre and qantm mechanics, for which he received the Nobel Prize in Physics in dbpedia owl:birthplace dcterms:sbject Niels Bohr Denmark was a Danish physicist Nobel_Prize_in_Physics who made fondational The contribtions Kingdom of Denmark to atomic is strctre The and Nobel qantm Prize mechanics, in Physics for is a constittional which he received monarchy the Nobel Prize awarded in Physics once in a year by the Royal and sovereign dbpedia owl:birthplace state. Swedish Academy dcterms:sbject of Sciences. dbpedia owl:birthplace dcterms:sbject Denmark Nobel_Prize_in_Physics The Kingdom of Denmark is The Nobel Prize in Physics is Aage a Niels constittional Bohr was monarchy a Danish nclear awarded physicist once a and year Nobel by the lareate, Royal and and the sovereign son of the state. famos physicist Swedish and Nobel Academy lareate of Sciences. Niels Bohr. dbpedia owl:birthplace (b) RDF Graph G of DBpediadcterms:sbject Fig. 2. Aage AnNiels Example Bohr was of a Danish Qerying nclear Sbgraph physicist and Matches Nobel lareate, from the RDF Graph and the son of DBpedia of the famos physicist and Nobel lareate Niels Bohr. The motivation examples show that SMS 2 qeries are very sefl in many real-world applications. To the best of or knowledge, no prior work stdied the sbgraph matching problem nder the semantic of strctral isomorphism and set similarity with dynamic element weights (called dynamic weighted set similarity). Traditional weighted set similarity [13] that focses on fixed element weight is actally a special case of dynamic weighted set similarity. De to different matching semantic on vertices (i.e., dynamic weighted set similarity rather than exact label matching), previos techniqes on exact or approximate sbgraph matching [7, 8, 9, 14, 15, 16] cannot be directly applied to answering SMS 2 qeries. It is challenging to tilize both dynamic weighted set similarity and strctral constraints to efficiently answer SMS 2 qeries. There are two straightforward methods that answer SMS 2 qeries by modifying existing algorithms. The first method condcts the sbgraph isomorphism sing existing sbgraph isomorphism algorithms (e.g. [14, 17]). Then, reslting candidate sbgraphs are refined by checking the weighted set similarity between each pair of matching vertices. The second method is in a reverse order, that is, first finding candidate vertices in the data graph that have similar sets to vertices in the qery graph by calclating weighted set similarity on-the-fly (as weights change dynamically), which is comptationally expensive, and then obtaining matching sbgraphs from the candidate vertices. However, these two methods sally incr very high qery cost, especially for a large graph database. This is becase the first method ignores the weighted set similarity constraints, whereas the second one ignores the strctral information when filtering candidate reslts. De to the inefficiency of existing methods, we propose an efficient SMS 2 qery processing approach in this paper. Or approach adopts a filter-and-refine framework, which exploits niqe featres of both graph topology and dynamic weighted set similarity. In the filtering phase, we bild a lattice-based index over freqent patterns of element sets of vertices in data graph G. Then, data vertices are encoded into signatres, and organized into signatre bckets. We design an efficient two-phase prning strategy based on the lattice-based index and signatre bckets to redce the SMS 2 search space. In the refinement phase, we propose a dominating set 2 (DS)-based sbgraph matching algorithm to find sbgraph matches with set similarity (defined in Definition 1). A dominating set selection method is proposed to select a cost-efficient dominating set of the qery graph. In smmary, we make the following contribtions: 1) We design a novel strategy to efficiently process SMS 2 qeries. An inverted pattern lattice based indexing and a strctral signatre-based locality sensitive hashing are first constrcted offline. Dring the online phase, a set of prning techniqes facilitated by the offline data strctres are introdced and integrated together to greatly redce the search space of SMS 2 qeries. 2) We propose set similarity prning techniqes (Section 4) that tilize a novel inverted pattern lattice over the element sets of data vertices to evalate dynamic weighted set similarity. It introdces an pper bond on the dynamic weighted similarity measre to apply the anti-monotone principle to achieve high prning power. 3) We propose strctre-based prning techniqes (Section 5) that explore a novel strctral-signatre-based data strctre, where the signatre is designed to captre the set and neighborhood information. An aggregate dominance principle is devised to gide the prning. 2. In graph theory, a dominating set for a graph Q = (V, E) is a sbset DS(Q) of V (Q) sch that every vertex of Q is either in DS(Q), or adjacent to some vertex in DS(Q).

3 3 4) Instead of directly qerying and verifying the candidates of all the vertices in the qery graph, we design an efficient algorithm (Section 6) to perform sbgraph matching based on the dominating set of qery graph. When filling p the remaining (non-dominating) vertices of the graph, a distance preservation principle is devised to prne candidate vertices that do not preserve the distance to dominating vertices. 5) Last bt not least, we demonstrate throgh extensive experiments that or approach can effectively and efficiently answer the SMS 2 qeries in a large graph database. 2 RELATED WORK Exact sbgraph matching qery reqires that all the vertices and edges are matched exactly. The Ullmann s sbgraph isomorphism method [17] and VF2 [14] algorithm do not tilize any index strctre, ths they are sally costly for large graphs. GADDI [18] is a strctre distance based sbgraph matching algorithm in a large graph. Zhao et al. [8] investigated the SPath algorithm, which tilizes shortest paths arond the vertex as basic index nits. Cheng et al. [2] proposed a new two-step R-join algorithm to efficiently find matching graph patterns from a large graph. Zo et al. [9] proposed a distance-based mlti-way join algorithm for answering pattern match qeries over a large graph. Shang et al. [19] proposed QickSI algorithm for sbgraph isomorphism optimized by choosing an search order based on some featres of graphs. SING [20] is a novel indexing system for sbgraph isomorphism in a large scale graph. GraphQL [21] is a qery langage for graph databases which spports graphs as the basic nit of information. Sn et al. [10] tilized graph exploration and parallel compting to process sbgraph matching qery on a billion node graph. Recently, an efficient and robst sbgraph isomorphism algorithm Trbo ISO [15] was proposed. RINQ [22] and GRAAL [23] are graph alignment algorithms for biological networks, which can be sed to solve isomorphism problems. To solve sbgraph isomorphism problems, graph alignment algorithms introdce additional cost as they shold first find candidate sbgraphs of similar size from the large data graph. In addition, existing exact sbgraph matching and graph alignment algorithms do not consider weighted set similarity on vertices, which will case high post-processing cost of set similarity comptation. Approximate sbgraph matching qery sally concerns the strctre information and allows some of the vertices or edges not being matched exactly. Closre-tree [16] is the first graph index that spports both sbgraph qeries and graph similarity qeries. SAGA [24] is an approximate sbgraph matching techniqe that finds sbgraphs in the database that are similar to the qery, allowing for node mismatches, node gaps, and graph strctral differences. Torqe [4] is a topology-free qerying tool of protein interaction network. Torqe does not reqire precise information on interaction pattern of the qery. TALE [7] is an indexbased method for approximate sbgraph matching which allows node mismatches, and node/edge insertions and deletions. Or work differs from the approximate sbgraph matching problems in that no node/edge mismatches are allowed, and the matching vertices shold have similar sets. Recently, several novel sbgraph similarity search problems have been investigated. Ma et. al [25] stdied the problem of graph simlation by enforcing dality and locality conditions on sbgraph matches. NeMa [26] focses on the sbgraph matching qeries that satisfy the following two conditions (1) many-to-one sbgraph matching with a cost fnction, and (2) label similarity of matching vertices. S 4 system [27] finds the sbgraphs with identical same strctre and semantically similar entities of qery sbgraph. SMS 2 qery differs from the above problems in that it considers both one-to-one strctral isomorphism and dynamic set similarity of matching vertices. Zo et al. [28] proposed a top-k sbgraph matching problem that considers the similarity between objects associated with two matching vertices. This work assmes that all vertex similarities are given, and does not exploit set similarity prning techniqes to optimize sbgraph matching performance. As for the weighted set similarity qery, Hadjieleftherio et al. [29] proposed varios index strctres and algorithms. Recently, a Heaviest First strategy [13] has been proposed for efficiently answering all-match weighted string similarity qery. However, dynamic element weights (i.e. qery dependent weights) in SMS 2 qeries make most of existing index strctres and qery processing techniqes for weighted set similarity inefficient, or even infeasible. The reason is that these methods rely on element canonicalization according to fixed weights, while elements with dynamic weights cannot be canonicalized in advance. 3 PRELIMINARIES 3.1 Problem Definition In this sbsection, we formally define or problem of sbgraph matching with set similarity (SMS 2 ) qery. Specifically, we consider a large graph G represented by V (G), E(G), where V (G) is a set of vertices, E(G) is a set of edges. Each vertex v V (G) is associated with a set, S(v), of elements. Qery graph Q is represented by V (Q), E(Q). The set of element domain is denoted by U, in which each element a has a weight W (a) to indicate the importance of a. Note that, weights can change dynamically in different qeries de to varying reqirements or evolving data in real applications (e.g. Example 1 and 2). Definition 1 (Sbgraph Match with Set Similarity): For a qery graph Q with n vertices ( 1,..., n ), a data graph G, and a ser-specified similarity threshold τ, a sbgraph match of Q is a sbgraph X of G containing n vertices (v 1,..., v n ) of V (G), iff the following conditions hold: 1) there is a mapping fnction f, for each i in V (Q) and v j V (G) (1 i n, 1 j n), it holds that f( i ) = v j ; 2) sim(s( i ), S(v j )) τ, where S( i ) and S(v j ) are the sets associated with i and v j, respectively, and sim(s( i ), S(v j )) otpts a set similarity score between S( i ) and S(v j ); 3) For any edge ( i, k ) E(Q), there is (f( i ), f( k )) E(G) (1 k n).

4 4 Since the sbgraph match with set similarity is independent of directed or ndirected edges, the proposed techniqes can be applied to both directed and ndirected graph. We define the SMS 2 qery as follows: Definition 2 (SMS 2 Qery): Given a qery graph Q and a data graph G, a sbgraph matching with set similarity (SMS 2 ) qery retrieves all sbgraph matches of Q in graph G nder the semantic of the set similarity. Note that, the choice of the similarity fnction sim(.,.) highly depends on the application domain. In this paper, we choose weighted Jaccard similarity, which is one of the most widely sed similarity measres. Definition 3 (Weighted Jaccard Similarity): Given element sets S() and S(v) of vertices and v, the weighted Jaccard similarity between S() and S(v) is: a S() S(v) sim(s(), S(v)) = W (a) a S() S(v) W (a) (1) where W (a) is the weight of element a, W (a) 0. In Definition 3, when W (a) = 1 for each element a, the weighted Jaccard similarity is exactly the classical Jaccard similarity [30]. As mentioned in [30], some other poplar similarity measres sch as cosine similarity, Hamming distance and overlap similarity can be converted to Jaccard similarity. Therefore, given another similarity measre (e.g. cosine similarity, Hamming distance and overlap similarity) and a threshold α, we can transform it into Jaccard similarity with a corresponding threshold τ. Then, we process the SMS 2 qery sing a constant lower bond to τ. Finally, we verify each candidate by the original similarity measre (detailed in Appendix A). In real applications, the weight of each element a can be specified by the qery isser or weighting methods sch as TF/IDF [13]. For instance, in Example 1, the weight of each keyword represents the correlation between the keyword and the paper, which is specified by researchers. In Example 2, each token in DBpedia can be assigned with a TF/IDF weight, which changes dynamically de to evolving data. 3.2 Framework In this sbsection, we present a filter-and-refine framework for or proposed approaches, which incldes offline processing and online processing, as shown in Fig. 3. Offline processing: We constrct a novel inverted pattern lattice to facilitate efficient prning based on the set similarity. Since the dynamic weight of each element makes existing indices inefficient for answering SMS 2 qeries, we need to design a novel index for SMS 2 qery. Motivated by the anti-monotone property of the lattice strctre (see details in Section 4), we mine freqent patterns (defined in Definition 5) from element sets of vertices in the data graph G, and organize them into a lattice. We store data vertices in the inverted list for each freqent pattern P, if P is contained in the element sets of these vertices. The lattice together with the inverted lists is called inverted pattern lattice, which can greatly redce the cost of dynamic weighted set similarity search. To spport strctre-based prning, we encode each qery vertex and data vertex into Offline Processing Online Processing Freqent Pattern Sets Inverted Pattern Lattice Qery Graph Q Cost-efficient Dominating Set Data Graph G Set Similarity Prning Vertical Prning Horizontal Prning Anti-monotone Prning Data Signatres Signatre Bckets Strctre-based Prning Qery Signatres Strctral Prning Refinement DS-Match Fig. 3. Framework for SMS 2 Qery Processing a qery signatre and a data signatre respectively by considering both the set and topology information, and hash all the data signatres into signatre bckets. Online processing: We propose finding a cost-efficient dominating set (defined in Section 6) of the qery graph Q, and only search candidates for vertices in the dominating set. Note that, different dominating sets will lead to different qery performances. Ths, we propose a dominating set selection algorithm to select a cost-efficient dominating set DS(Q) of qery graph Q. The dominating-set-based sbgraph matching is motivated by two observations: (1) finding candidates in SMS 2 qeries is mch more expensive than that in typical sbgraph search, becase set similarity calclation is more costly than vertex label matching. As a reslt, the filtering cost can be redced by searching dominating vertices of V (Q) rather than all qery vertices. (2) Some qery vertices may have a large amont of candidate vertices, which leads to many nnecessary intermediate reslts dring sbgraph matching. Therefore, the sbgraph matching cost can also be redced by decreasing the size of intermediate reslts. For each vertex DS(Q), we propose a twophase prning strategy, inclding set similarity prning and strctre-based prning. The set similarity prning incldes anti-monotone prning, vertical prning, and horizontal prning, which are based on or proposed inverted pattern lattice (Section 4). Based on the signatre bckets, we also propose the strctre-based prning techniqe by tilizing novel vertex signatres (Section 5). After the prning, we propose an efficient DS-Match sbgraph matching algorithm to obtain sbgraph matches of Q based on candidates of dominating vertices in DS(Q) (Section 6). DS-match tilizes topological relations between dominating vertices and non-dominating vertices to redce the scale of intermediate reslts dring sbgraph matching, and therefore redces the matching cost. 4 SET SIMILARITY PRUNING For a vertex in a dominating set of qery graph Q, we need to find its candidate vertices in graph G. Let s recall the definition of SMS 2 qery in Definition 1. If a vertex v in graph G match with, sim(s(), S(v)) > τ holds. This section concentrates on finding candidate vertices v

5 5 of sch that sim(s(), S(v)) > τ. How to select a costefficient dominating set will be introdced in Section 6.2. As discssed in Section 2, existing indices relying on element canonicalization are not sitable for SMS 2 qeries de to dynamic weights of elements. Nevertheless, we note that the inclsion relation between two sets does not change even if element weights vary dynamically. For two element sets S(v) and S(v ) of vertex v and v respectively, if S(v) S(v ), the relationship of S(v) being a sbset of S(v ) is called inclsion relation. Based on inclsion relation, we derive the following pper bond. Definition 4: (AS Upper Bond) Given a qery vertex s set S() and a data vertex v s set S(v), an Antimonotone Similarity (AS) pper bond is: UB(S(), S(v)) = a S() W (a) a S() S(v) (2) where W (a) denotes the weight assigned to element a, and sim(, ) is given by Eqation 1. Since a S() W (a) does not change once the qery is given, AS pper bond is anti-monotone with regard to S(v). That is, for any set S(v) S(v ), if UB(S(), S(v)) < τ, then UB(S(), S(v )) < τ. Apparently, the anti-monotone property of AS pper bond enables s to prne vertices based on inclsion relations. However, inclsion relations between element sets of data vertices are few. In contrast, since most of element sets contain freqent patterns, inclsion relations between element sets and freqent patterns are nmeros. Motivated by this observation, we mine freqent patterns from element sets of all the data vertices, and design a novel index strctre named inverted pattern lattice to organize freqent patterns. The lattice-based index enables efficient anti-monotone prning, and therefore is sitable for set similarity search with dynamic weights. Here, we recall the definition of the freqent pattern [31]. Definition 5: (Freqent Pattern) Let U be the set of distinct elements in V (G). A pattern P is a set of elements in U, i.e., P U. If an element set S(v) contains all the elements of a pattern P, then we say S(v) spports P and S(v) is a spporting element set of P. The spport of P, denoted by spp(p ), is the nmber of element sets that spport P. If spp(p ) is larger than a ser-specified threshold minsp, then P is called a freqent pattern. The comptation of spport of pattern P (i.e., spp(p )) can be referred to [31]. 4.1 Inverted Pattern Lattice Constrction To bild an inverted pattern lattice, we first mine freqent patterns from element sets of all vertices in data graph G. Then, we organize these freqent patterns into a lattice. In the lattice, each intermediate node P is a freqent pattern, which is a sbset of all its descendant nodes. We denote the freqent pattern having k elements as a k-freqent pattern. To ensre the completeness of the indexing, the 1-freqent patterns (i.e. all elements in the niverse U) are indexed in the first (top) level of the lattice. For each k-freqent P 1 {a 1} P 5 {a 1, a 2} P 6 {a 2, a 3} P 10 {a 1, a 2, a 4} P 2 {a 2} P 7 {a 2, a 4} P 11 {a 1, a 2, a 3} P 3 {a 3} P 13 {a 1, a 2, a 3, a 4} P 8 {a 1, a 3} P 4 {a 4} P 9 {a 3, a 4} P 12 {a 2, a 3, a 4} v 1 v 5... v 6 Inverted Lists Fig. 4. Example of Inverted Pattern Lattice inclding a Lattice of Freqent Patterns and Inverted Lists pattern (k > 1) P in the lattice, we insert each vertex v s element set S(v) into the inverted list of P (denoted as L(P )), iff S(v) spports P. Note that, an element set S(v) sim(s(), S(v)) may spport mltiple freqent patterns in the lattice. We W (a) insert S(v) to the inverted lists of these freqent patterns, respectively. Example 3: Fig. 4 is an example of inverted pattern lattice bilt by data vertices in Fig. 1(b). The elements a 1, a 2, a 3 and a 4 correspond to keywords Sbgraph Matching, Protein Interaction Network, Social Network, and Stree P 1 Stree P 2 Search, respectively. Then, v 1 = {a 1 }, v 2 = {a 3, a 4 }, 0111 v 3 = {a 2, a 4 }, v 4 = {a 1, a 3 }, v 5 = {a 3 }, v 6 = {a 1 } Prning Techniqes 0101 v 3 v 2 v Anti-monotone Prning {v 1, v 2} {v 3, v 7} {v 10} Considering the anti-monotone property of AS pper bond v 6 v 5 v 12 and the characteristics of inverted lattice0100 pattern 0101, we have 0011 {v 5} {v 6} {v the following theorem. 9} TheoremFreqent 1: Given Pattern a qery Set Lattice vertex, for Signatre each Tree accessed of P 1 freqent pattern P in the inverted pattern lattice, if UB(S(), P ) < τ, all vertices in the inverted list L(P ) and L(P ) can be safely prned, where P is a descendant node of P in the lattice. Proof: For each element set S(v) in the inverted list of P, since P S(v), UB(S(), S(v)) < τ according to the anti-monotone property of AS pper bond. Similarly, for any descendant node P of P, since P P, UB(S(), P ) will be also less than τ. The theorem can be proved. Based on the theorem above, we can efficiently prne freqent patterns in the inverted pattern lattices regardless of dynamic weights of elements. Example 4: Considering the qery vertex 3 in Fig. 1(a), 3 s element set S( 3 ) = {a 2, a 4 }. Assme W (a 1 ) = 0.5, W (a 2 ) = 0.4, W (a 3 ) = 0.5, W (a 4 ) = 0.2, and the similarity threshold τ = 0.6. As shown in Fig. 5(a), since UB(S( 3 ), P 6 ) = 0.55 < 0.6, P 6 and all its descendant nodes in the lattice, i.e. the gray nodes, can be safely prned. In the same way, we can prne P 1, P 3 and all their descendent nodes. All vertices in the inverted lists of these patterns can be prned safely Vertical Prning Vertical prning is based on the prefix filtering principle [32]: if two canonicalized sets are similar, the prefixes of these two sets shold overlap with each other, as otherwise these two sets will not have enogh common elements. v 2 v 3... v 4... St

6 P10 {a1, a2, a4} P11 {a1, a2, a3} P12 {a2, a3, a4} P10 {a1, a2, a4} Prned Freqent Patterns P13 {a1, a2, a3, a4} 6 P1 P2 P1 P1 {a2} P3 P2 P2 {a2} {a3} {a2} P4 P3 P3 {a3} {a4} {a3} P4 P4 {a4} {a4} P1 P1 P1 P2 {a2} P2 P2 P3 {a2} {a3} {a2} P3 P3 P4 {a3} {a4} {a3} P4 P4 {a4} {a4} P1 P1 P1 P2 {a2} P2 P2 P3 {a2} {a3} P3 P4 {a3} {a4} P4 {a4} P1 P2 {a2} P3 {a3} P4 {a4} P5 {a1, a2} P6P5 P5 P7P6 P6 P8P7 P7 P9P8 P8 P9 P9 P5 {a2, {a1, a3} a2} a2} {a2, {a2, a4} a3} a3} {a1, {a2, a3} a4} {a3, {a1, a4} a3} {a3, {a1, a4} a2} P6 P5 P5 P7 P6 P6 P8 P7 P7 {a2, {a1, a3} a2} {a2, {a2, a4} a3} {a1, {a2, a3} a4} P9 P8 P8 P9 P9 {a3, {a1, a3} a4} {a3, a4} P5 {a1, a2} P5 P6 P5 P6 P6 P7 P7 P8 P8 P9 P9 P5 {a1, {a2, a2} a3} {a2, {a2, a3} a4} {a2, {a1, a4} a3} {a1, {a3, a3} a4} {a3, a4} {a1, a2} P6 {a2, a3} P7 {a2, a4} P8 {a1, a3} P9 {a3, a4} P10 {a1, a2, a4} P10 P11 {a1, {a1, a2, a2, a4} a2, a4} a3} P11 P12 {a1, {a2, a2, a3} a3, a4} P12 {a2, a3, a3, a4} P10 {a1, a2, a4} P10 P11 {a1, a2, a2, a4} { a1, a2, a3} { a1, a2, a3} P11 P12 P12 { a1, {a2, a3} a3, a4} {a2, a3, a4} P10 P10 P11 P11 P12 P12 {a1, a2, a4} {a1, a2, {a1, a4} a2, a3} {a1, a2, {a2, a3} a3, a4} {a2, a3, a4} P10 {a1, a2, a4} P11 {a1, a2, a3} P12 {a2, a3, a4} Prned Freqent P13 P13 Patterns {a1, a2, a3, a4} {a1, a2, a2, a3, a3, a4} Prned Freqent Prned Freqent Patterns P13 P13 {a1, a2, a3, a4} {a1, a2, a3, a4} P13 P13 {a1, a2, a3, {a1, a4} a2, a3, a4} P13 {a1, a2, a3, a4} (a) Anti-monotone Prning (b) Vertical Prning Fig. 5. An Example of Set Similarity Prning on Inverted Pattern Lattice We canonicalize all the elements in qery set S() in a P1 P1 P2 P2 P3 P3 P4 P4 descending P1 P2 order P3 {a2} of their P4 {a2} {a3} {a4} {a3} weights {a4} dring online processing. The first p elements in the canonicalized set S() is denoted P5 P5 P6 P6 P7 P7 P8 P8 P9 P9 as P5 thep6 {a1, p-prefix {a1, a2} a2} P7 {a2, {a2, a3} a3} ofp8 {a2, {a2, S(). a4} a4} P9 {a1, {a1, a3} We a3} {a3, {a3, find a4} a4} the maximm prefix length {a1, a2} {a2, a3} {a2, a4} {a1, a3} {a3, a4} p sch that if S() and S(v) have no overlap in p-prefix, P10 P10 P11 P11 P12 P12 P10 P11 P12 {a1, {a1, a4} a4} {a1, {a1, a2, a2, a3} a3} {a2, {a2, a3, a3, S(v) a4} a4} {a1, a2, a4} can {a1, bea2, safely a3} {a2, prned, a3, a4} becase they do not have enogh P13 overlap to meet P13 P13 the similarity threshold [32]. {a1, {a1, a2, a2, a3, a3, a4} a4} {a1, a2, a3, a4} To find p-prefix of S(), each time we remove the element with the largest weight from S(), we check whether the remaining set S () meets the similarity threshold with S(). We denote L 1 -norm of S() as S() 1 = a S() W (a). If S () 1 < τ S() 1, the removal stops. The vale of p is eqal to S () 1, where S () is the nmber of elements in S (). For any set S(v) that does not contain the elements in S() s p-prefix, we have sim(s(), S(v)) S () 1 S() 1 < τ, so S() and S(v) will not meet the set similarity threshold. Theorem 2: Given a qery set S() and a freqent pattern P in the lattice, if P is not a 1-freqent pattern (or its descendant) in S() s p-prefix, all vertices in the inverted list L(P ) can be safely prned. Proof: For each vertex v in L(P ), P S(v). According to prefix filtering principle and Theorem 1, all vertices in L(P ) can be prned. After we find the p-prefix of S(), we only need to access all the descendent nodes of the corresponding 1- freqent patterns in p-prefix in a vertical manner. Example 5: For the qery vertex 3 with S( 3 ) = {a 2, a 4 }, we can determine the p-prefix of S( 3 ) is {a 2 }. As shown in Fig. 5(b), we only need to access P 2 and its all descendent nodes in the lattice. All vertices in the inverted lists of other freqent patterns can be prned Horizontal Prning Intitively, a qery element set S() cannot be similar to a freqent pattern of a large size set or a freqent pattern of a very small size. The size of a freqent pattern P (denoted by P ) is the nmber of elements in P. In the inverted pattern lattice, each freqent pattern P is a sbset of data vertices (i.e., element sets) in P s inverted list. Sppose we can find the length pper bond for S() (denoted by LU()). If the size of P is larger than LU(), (i.e. the sizes of all data vertices in P s inverted list are larger than LU()) then P and its inverted list can be prned. De to dynamic element weights, we need to find S() s length interval on the fly. We find LU() by adding elements in (U S()) to S() in an increasing order of their (c) Horizontal Prning (d) Final Reslts weights. Each time an element is added, a new set S () is formed. We calclate the similarity vale between S() and S (). If sim(s(), S ()) τ holds, we contine to add elements to S (). Otherwise, the pper bond LU() eqals to S () 1. Note that, freqent patterns at the same level of the inverted pattern lattice have the same size, and the size of freqent patterns at Level k eqals to k. Ths, after obtaining the length pper bond LU() of S(), we can determine that the horizontal pper bond eqals to LU(). All freqent patterns nder Level LU() will be prned. Example 6: Considering the qery vertex 3 with S( 3 ) = {a 2, a 4 }, and U = { } a 1, a 2, a 3, a 4. To find S( 3 ) s length pper bond LU(), we add a 1 (or a 3 ) to S( 3 ) to form S ( 3 ). It is clear that sim(s( 3 ), S ( 3 )) < 0.6. Therefore, LU( 3 ) is 2. As shown in Fig. 5(c), all freqent patterns below Level 2 can be prned Ptting All Prning Techniqes Together In this sbsection, we apply all the set similarity prning techniqes and obtain candidates for a qery vertex. For each qery vertex, we first se vertical prning and horizontal prning to filter ot false positive patterns and jointly determine the nodes (i.e., freqent patterns) that shold be accessed in the inverted pattern lattice. Then, we traverse them in a breadth-first manner. For each accessed node P in the lattice, we check whether UB(S(), P ) is less than τ. If yes, P and its descendant nodes are prned safely. As shown in Fig. 5(d), the grey nodes can be prned after all prning techniqes above have been applied. Assme that P 1, P 2,..., and P k are the remaining nodes (i.e., white nodes) that cannot be prned, k the candidate set of is a sbset of L(P i ). Note that, i=1 all the child nodes of P i (1 i k), denoted by P N, have been prned based on AS pper bond. The candidate set C() of can be obtained by Eqation 3. k C() = L(P i ) L(P ) (3) i=1 P P N To increase the qery performance, the lattice of freqent patterns resides in the main memory, and the inverted lists of freqent patterns are on the disk. 4.3 Optimization for Inverted Pattern Lattice In this section, we propose two techniqes to optimize the qery time and space cost of the inverted pattern lattice.

7 Vertex Insertion Criterion For the element set, S(v), of a data vertex v, sppose S(v) spports freqent patterns P 1,..., and P n, as mentioned above, we insert v into inverted lists L(P 1 ),..., and L(P n ). However, it is not necessary to insert v to all the inverted lists based on or proposed anti-monotone prning. Sppose freqent patterns P and P are both spported by S(v), and P P. Then, for any qery set S(), we have UB(S(), P ) > UB(S(), P ). Therefore, if S(v) in P s inverted list cannot be prned by UB(S(), P ) (i.e., UB(S(), P ) > τ), it will definitely not be prned by UB(S(), P ), as UB(S(), P ) will also be larger than τ. In this case, we only need to insert vertex v into P s inverted list, instead of P s inverted list. In smmary, let {P 1,..., P n } be the set of freqent patterns spported by S(v), and sppose there are k (k n) paths formed by {P 1,..., P n }. A path is a seqence of connected nodes (i.e., freqent patterns) from higher level to lower level withot cycle in the inverted pattern lattice. We only insert v into the inverted lists of the freqent patterns that are spported by S(v) at the lowest level of each path. In this way, the space cost of inverted lists can be redced Freqent Pattern Selection Since the nmber of freqent patterns is large, if we index all the freqent patterns in the inverted pattern lattice, the lattice cannot be fitted to memory. As mentioned in Section 4.2.1, freqent patterns with more elements otpt tighter AS pper bond. Therefore, we shold select freqent patterns with nmeros elements, so that the qerying time can be redces by prning more patterns. In addition, we shold also garantee the representativeness of each freqent pattern. Motivated by the observations, we mine closed freqent patterns [31] from the sets of vertices. Definition 6: (Closed Freqent Pattern) A freqent pattern P is closed if there is no pattern P sch that P P and spp(p ) = spp(p ). Note that, the size of lattice is controlled by the spport threshold minsp. Given a memory size M, we need to determine the vale of minsp to ensre that the lattice fits into the main memory. Sppose the average memory M occpation of each pattern is Z, there are at most Z patterns are allowed to be resident in the main memory. In most of real applications, M Z is no less than the size of the set of 1-freqent patterns (i.e. patterns containing only one element) F 1 (denoted by F 1 ). We sort all the closed k-freqent patterns P (k 2) in a descending order of spp(p ), and select the top-( M Z F 1 ) closed k-freqent patterns. We bild the lattice by all the 1-freqent patterns and the selected closed k-freqent patterns (k 2). 5 STRUCTURE-BASED PRUNING A matching sbgraph shold not only have its vertices (element sets) similar to corresponding qery verices, bt also preserve the same strctre as Q. Ths, in this section, we design lightweight signatres for both qery vertices and data vertices to frther filter the candidates after set similarity prning by strctral constraints. 5.1 Strctral Signatres We define two distinct types of strctral signatre, namely qery signatre Sig() and data signatre Sig(v) for each qery vertex and data vertex v, respectively. To encode strctral information, Sig()/Sig(v) shold contain the element information of both /v and its srronding vertices. Since the qery graph is sally small, we generate accrate qery signatres by encoding each neighbor vertex separately. On the contrary, the data graph is mch larger than the qery graph, so the aggregation of neighbor vertices can save a lot of space. The prning cost can be also redced de to limited nmber of data signatres. Specifically, we first sort elements in element sets S() and S(v) according to a predefined order (e.g., alphabetic order). Based on the sorted sets, we encode the element set S() by a bit vector, denoted by BV (), for the former part of Sig(). In particlar, each position BV ()[i] in the vector corresponds to one element a i, where 1 i U and U is the total nmber of elements in the niverse U. If an element a j belongs to set S(), then in bit vector BV (), we have BV ()[j] = 1; otherwise (if a j / S()), BV ()[j] = 0 holds. Similarly, S(v) is also encoded sing the above techniqe. For the latter part of Sig() and Sig(v) (i.e., encoding srronding vertices), we propose two different encoding techniqes for Sig() and Sig(v), respectively. The difference is that, we encode every neighbor vertex separately in Sig(), bt aggregate all neighbor vertices in Sig(v). We formally define the qery signatre and the data signatre as follows: Definition 7 (Qery Signatre): Given a vertex with m adjacent neighbor vertices i ( i = 1,..., m) in a qery graph Q, the qery signatre Sig() of vertex is given by a set of bit vectors, that is, Sig() = {BV (), BV ( 1 ),..., BV ( m )}, where BV () and BV ( i ) are the bit vectors that encode elements in set S() and S( i ), respectively. Definition 8 (Data Signatre): Given a vertex v with n adjacent neighbor vertices v i (i = 1,..., n) in a data graph G, the data signatre, Sig(v), of vertex v is given by: Sig(v) = [BV (v), n i=1 BV (v i)], where is a bitwise OR operator, BV (v) is the bit vector associated with v, and n i=1 BV (v i) (denoted as BV (v)) is called a nion bit vector, which eqals to bitwise-or over all bit vectors of v s one-hop neighbors. 5.2 Signatre-based LSH To enable efficient prning based on strctral information, we se a set of Locality Sensitive Hashing (LSH) [33] hash fnctions to hash each data signatre Sig(v) into a signatre bcket, which is defined as follows. Definition 9: A signatre bcket is a slot of hash table that stores a set of data signatres with the same hash vale. Since the probability of collision (i.e the same hash vale) is mch higher for signatres that are similar to each other than for those that are dissimilar, the maximm hamming distance of data signatres in each bcket can be minimized. The choice of LSH hash fnctions can be referred to [33]. In addition, we store the bcket signatre

8 8 Sig(B), which is formed by ORing all data signatres in each bcket B. That is Sig(B) = [ BV (v t ), BV (v t )], where t = 1,..., n, and n is the nmber of signatres in B. 5.3 Strctral Prning Based on the SMS 2 qery definition (Definition 1), a data vertex v can be prned if the similarity between BV () and BV (v) is smaller than τ, or there does not exist BV (v j ) that satisfies the similarity constraint with BV ( i ), where v j (j = 1,..., n) and i (i = 1,..., m) are one-hop neighbors of v and, respectively. We define the similarity between BV () and BV (v) as follows, which is analogos to the weighted Jaccard similarity (Definition 3). Definition 10: Given bit vectors BV () and BV (v), the similarity between BV () and BV (v) is: a BV () BV (v) sim(bv (), BV (v)) = W (a) a BV () BV (v) W (a) (4) where is a bitwise AND operator and is a bitwise OR operator, a (BV () BV (v)) means the bit corresponding to element a is 1, W (a) is the assigned weight of a. For each BV ( i ), we need to determine whether there exists a BV (v j ) so that sim(bv ( i ), BV (v j )) τ holds. To this end, we estimate the nion similarity pper bond between BV ( i ) and BV (v), which is defined as follows. Definition 11: Union similarity pper bond between a bit vector BV ( i ) and a nion bit vector BV (v) is: UB a BV ( (BV ( i ), BV (v)) = W (a) i) BV (v) a BV ( W (a) (5) i) Based on Definitions 10 and 11, we have the following aggregation dominance principle. Theorem 3 (Aggregate Dominance Principle): Given a qery signatre Sig() and a data signatre Sig(v), if UB (BV ( i ), BV (v)) < τ, then for each one-hop neighbor v j of v, sim(bv ( i ), BV (v j )) < τ. Proof: sim(bv ( i ), BV (v j )) a BV ( = i) BV (v W (a) j) a BV ( i) BV (v W (a) j) a BV ( i) BV (v j) W (a) a BV ( i) W (a) a BV ( i) BV (v) W (a) a BV ( i) W (a) = UB (BV ( i ), BV (v)) < τ Example 7: Considering a one-hop neighbor 3 of the qery vertex 1 in Fig. 1(a), where BV ( 3 ) = 0101, and one-hop neighbors v 2 and v 4 of the data vertex v 5, where BV (v 2 ) = 0011 and BV (v 4 ) = Since BV (v) = BV (v 2 ) BV (v 4 ) = 1011, UB (BV ( 3 ), BV (v)) < 0.6 = τ. Based on Theorem 3, sim(bv ( 3 ), BV (v j )) < τ, where j = 2, 4. Therefore, v 5 is not a candidate vertex of 1 even thogh S( 1 ) = S(v 5 ). Based on aggregate dominance principle, we have the following Lemmas of Theorem 3. Lemma 1: Given a qery signatre Sig() and a bcket signatre Sig(B), assme bcket B contains n data signatres Sig(v t ) (t = 1,..., n), if UB (BV (), BV (v t )) < τ or there exists at least one neighboring vertex i (i = 1,..., m) of sch that UB (BV ( i ), BV (v t )) < τ, then all data signatres in bcket B can be prned. Lemma 2: Given a qery signatre Sig() and a data signatre Sig(v), if sim(bv (), BV (v)) < τ or there is at least one neighboring vertex i (i = 1,..., m) of sch that UB (BV ( i ), BV (v)) < τ, Sig(v) can be prned. Proofs of Lemmas 1 and 2 are provided in Appendix B. In smmary, strctral prning works as follows. We first prne the signatre bckets that do not contains candidates of qery vertices Then, we frther prne bckets as a whole based on Lemma 1. For the each candidate v in a bcket B that cannot be prned, we seqentially check the similarity constraints between Sig() and Sig(v), and prne Sig(v) based on Lemma 2. The aggregate dominance principle garantees that strctral prning will not prne legitimate candidates. Therefore, the reslts are exact. 6 DOMINATING-SET-BASED SUBGRAPH MATCHING In this section, we propose an efficient dominating-set(ds)- based sbgraph matching algorithm (denoted by DS-Match) facilitated by a dominating set selection method. 6.1 DS-Match Algorithm DS-Match algorithm first finds matches of a dominating qery graph Q D (defined in Definition 14) formed by the vertices in dominating set DS(Q), then verifies whether each match of Q D can be extended as a match of Q. DS- Match is motivated by two observations: First, compared with typical sbgraph matching over vertex-labeled graph, the overhead of finding candidates in SMS 2 qeries is relatively higher, as the comptation cost of set similarity is mch higher than that of label matching. We can save filtering cost by only finding candidate vertices for dominating vertices rather than all vertices in Q. Second, we can speed p sbgraph matching by only finding matches of dominating qery vertices. The candidates of remaining (non-dominating) qery vertices can be filled p by the strctral constraints between dominating vertices and nondominating vertices. In this way, the size of intermediate reslts dring sbgraph matching can be greatly redced. In the following, we formally define the dominating set. Definition 12: (Dominating Set) Let Q = (V, E) be a ndirected, simple graph withot loops, where V is the set of vertices and E is the set of edges. A set DS(Q) V is called a dominating set for Q if every vertex of Q is either in DS(Q), or adjacent to some vertex in DS(Q). Based on Definition 12, we have the following Theorem. Theorem 4: Assme that is a dominating vertex in Q s dominating set DS(Q). If DS(Q) 2. Then, there exists at least one vertex DS(Q) sch that Hop(, ) 3, where Hop(, ) is the minimal nmber of hops between two vertices. The dominating vertex is called a neighboring dominating vertex of. Proof: Assme there does not exist any vertex DS(Q) sch that Hop(, ) 3, then there can be at least three non-dominating vertex on the path between

9 0100 v v 5 v v v v 4 v 4 v i i i j j j (a) i j i i i i i i i i i i i i i i j j j j j j jj j j j j j j j (b) (c) (d) Fig. 6. Possible Topologies of Neighboring Dominating Vertices and any other vertex DS(Q). In this case, at least one non-dominating ab ab vertex ab ab ab ab cd is cd ab not cd ab adjacent cd to or, which contradicts withab ab cd cd cd Definition cdab cd ab cd cd ab 12. cd ab cd Hence, the theorem holds. cd ab cd Definition 13: Given a vertex in a graph Q, s onehop neighbor set (N 1 ()) and two-hop neighbor set (N 2 ()) are defined as follows: N 1 () = { there exists a length-1 path connecting and } ; and N 2 () = { there exists a length-2 path connecting and }. As shown in Fig. 6, there are for possible topologies of the shortest path between two neighboring dominating vertices i and j. Based on Theorem 4, we define the dominating qery graph as follows: Definition 14: (Dominating Qery Graph) Given a dominating set DS(Q), the dominating qery graph Q D is defined as V (Q D ), E(Q D ), and there is an edge ( i, j ) in E(Q D ) iff at least one of the following conditions holds: 1) i is adjacent to j in qery graph Q (Fig. 6(a)); 2) N 1 ( i ) N 1 ( j ) > 0 (Fig. 6(b)); 3) N 1 ( i ) N 2 ( j ) > 0 (Fig. 6(c)); 4) N 2 ( i ) N 1 ( j ) > 0 (Fig. 6(d)). where the conditions correspond to the for possible topologies in Fig. 6, respectively. In Condition 1), the edge weight (i.e. the path length between i and j ) is 1. In Condition 2), the edge weight is 2. In Conditions 3) and 4), the edge weights are both 3. To transform a qery graph Q to a dominating qery graph Q D, we first find a dominating set DS(Q) of Q. Then for each pair of vertices i, j in DS(Q), we determine whether there is an edge ( i, j ) between them and the weight of ( i, j ) according to Definition 14. Example 8: Fig. 7 shows an example of a dominating qery graph Q D of the qery graph Q in Fig. 1(a). The dominating set of Q contains 1 and 3. Note that, edge ( 1, 3 ) in Q D has two weights 1,2. The reason is that 1 is adjacent to 3 and N 1 ( 1 ) N 1 ( 3 ) > 0 in Q. To find matches of dominating qery graph, we propose the distance preservation principle. Theorem 5 (Distance Preservation Principle): Given a sbgraph match X D of Q D in data graph G, Q D and X D have n vertices 1,..., n and v 1,..., v n, respectively, where v i C( i ). Considering an edge ( i, j ) in Q D, then all the following distance preservation principles hold: 1) if the edge weight is 1, then v i is adjacent to v j ; 2) if the edge weight is 2, then N 1 (v i ) N 1 (v j ) > 0; 3) if the edge weight is 3, then N 1 (v i ) N 2 (v j ) > 0 or N 2 (v i ) N 1 (v j ) > Fig. 7. Example of a Dominating Qery Graph Proof: Since X D shold preserve the same strctral information of Q D, based on Definition 13, the above distance preservation principles can be proved. Now, we propose a dominating qery graph match (DQG-Match) algorithm based on distance preservation principle to find matches of the dominating qery graph. A sbgraph match is a mapping fnction M from V (Q D ) to V (X), where X is a matching sbgraph of Q D in data graph G. The process of finding the mapping fnction can be described by means of State Space Representation (SSR) [14]. Each state s is associated with a partial mapping soltion M(s), which contains only a sbset of M. As shown in Algorithm 1, the crrent state s is initially set to φ. We bild a candidate pair set P A(s) containing all the possible candidate pairs (, v) and add it to the crrent state s (Line 5). When a candidate pair (, v) is added to the crrent mapping M(s), we verify if the partial match M(s) satisfies the distance preservation principle (Lines 6-7). If yes, we contine to explore the state ntil a sbgraph match of Q D is fond (Lines 8-9). Otherwise, the corresponding search branch is terminated. For example, given the dominating qery graph Q D in Fig. 7, {v 2 } and {v 3 } are matching vertices of 1 and 3 in Q D, respectively. Ths, the mapping fnction M is {( 1, v 2 ), ( 3, v 3 )}. Algorithm 1: DQG-Match Inpt: a dominating qery graph Q D, an intermediate state s; the initial state s 0 has M(s 0 ) = φ Otpt: the mapping M(Q D ) between Q D and G s sbgraph 1 if M(s) covers all the vertices of Q D then 2 M(Q D )=M(s); 3 Otpt M(Q D ); 4 else 5 Compte the set P A(s) of all the pairs (, v), where V (Q D ) and v V (G); 6 for each pair (, v) in P A(s) do 7 if (, v) satisfies the distance preservation principle then 8 Add (, v) to M(s) and compte crrent state s ; 9 Call DQG-Match(s ); In the following, we propose DS-match algorithm ( Algorithm 2). Firstly, Algorithm 1 is called to find the mapping fnction M(Q D ) (Line 2). The mapping fnction M(s) of crrent state s is initialized to M(Q D ) (Line 3). Secondly, DS-match extends each match of the dominating qery graph Q D to a sbgraph match of the qery graph Q. If M(s) covers all the vertices of Q, then we otpt the mapping fnction M(Q) (i.e. a sbgraph match of Q) (Lines 4-6). Otherwise, for each non-dominating vertex V (Q) V (Q D ), we considering one-hop and twohop neighboring dominating vertices of (i.e. N 1 ( ) and N 2 ( )) (Lines 7-9). Note that, for each dominating vertex i N 1 ( ) and j N 2 ( ), candidate vertex sets C( i ) and C( j ) have been fond by Algorithm 1. Based on distance preservation principle, each candidate vertex v

10 10 of mst be one-hop neighbor and two-hop neighbor of the vertices in C( i ) and C( j ), respectively (Line 10). Then, we check whether the candidate pair (, v ) satisfies conditions of sbgraph match with set similarity (Definition 1) (Line 11). If yes, (, v ) is added to the crrent state (Line 12). We contine to explore the state ntil all nondominating vertices are considered (Line 13). Algorithm 2: DS-Match Inpt: a qery graph Q, a dominating qery graph Q D, and an intermediate state s; the initial state s 0 has M(s 0 ) = φ Otpt: the mapping M(Q) between qery graph Q and G s sbgraph 1 if M(s 0 ) = φ then 2 Call Algorithm 1 to find M(Q D ); 3 Initialize M(s) with M(Q D ); 4 if M(s) covers all the vertices of Q then 5 M(Q) = M(s); 6 Otpt M(Q); 7 else 8 for each V (Q) V (Q D ) do 9 for each dominating vertex i N 1 ( ) and dominating vertex j N 2 ( ) do 10 for v N 1 (v) N 2 (v) do v C( i) v C( j) 11 if pair (, v ) satisfies conditions of sbgraph match with set similarity then 12 Add (, v ) to M(s) and compte crrent state s ; 13 Call DS-Match(s ); Example 9: As discssed in Example 8, the nondominating vertex of the qery graph Q in Fig. 1(a) is 2. Ths, the neighboring dominating vertices of 2 are 1 and 3. Since the matching vertices of 1 and 3 are v 2 and v 3 respectively, the candidate vertex of 2 is N 1 (v 2 ) N 1 (v 3 ) = v Dominating Set Selection A qery graph may have mltiple dominating sets, leading to different performance of SMS 2 qery processing. Motivated by sch observation, in this sbsection, we propose a dominating set selection algorithm to select a cost-efficient dominating set of qery graph Q, so that the cost of answering SMS 2 qery can be redced. Intitively, a cost-efficient dominating set shold contain minimm nmber of dominating vertices, so that the filtering cost can be minimized. In addition, we shold also garantee the nmber of candidates for each dominating vertex is minimized to redce the size of intermediate reslts dring sbgraph isomorphism. To problem of finding a cost-efficient dominating set is actally a Minimm Dominating Set (MDS) problem. MDS problem is eqivalent to Minimm Edge Cover problem [34], which is NP-hard. As a reslt, we se a best effort Branch and Redce algorithm [34]. The algorithm recrsively select a vertex with minimm nmber of candidate vertices from qery vertices that have not been selected, and add to the edge cover an arbitrary edge incident to. The overhead of finding the dominating set is low, becase the qery graph is sally small. Since we do not know a qery vertex s nmber of candidate vertices before the SMS 2 qery is processed, We se a Hash Sampling method [35] to make qick estimates of the nmber. The hash sampling method can constrct the sample nion P = P 1... P n, where P 1,..., and P n are precompted samples of inverted lists L(P 1 ),..., and L(P n ). The estimate of the nmber of s candidate vertices from the sample can be calclated as: A() = A() P L(P ) d P d (6) where A() P is the nmber of similarity search answers of S() on the niform random sample, L(P ) d is the distinct nmber of vertex ids contained in the mlti-set nion L(P ), and P d is the distinct nmber of vertex ids in the sample nion. Example 10: In Fig. 1, we find 3 has the minimm nmber of candidates, i.e. v 3. Ths, { 1, 3 } is selected as the cost-efficient dominating set according to the branch and redce algorithm. 7 EXPERIMENTAL EVALUATION 7.1 Datasets and Setp All experiments are implemented in C++, and condcted on a 2.5GHZ Core class machine with 4GB RAM. The operating system is Windows 7 Ultimate edition. We se two real datasets Freebase and the DBpedia, and synthetic graphs that follow scale-free graph model. The datasets sed in or experiment are described as follows. 1) Freebase ( is a large collaborative knowledge base of strctred data. We se the entity relation graph of Freebase (denoted by FB), in which each vertex represents an entity, sch as an actor, a movie, etc., and each edge represents the relationship between two entities. Each vertex is associated with a set of elements, which describes featres of the corresponding entity. The weight of each featre specifies its importance, which is normalized to the range of [0, 1]. This graph contains 1,047,829 nodes and 18,245,370 edges. The nmber of distinct elements is 243, and the average nmber of elements of each vertex is 6. 2) DBpedia ( extracts strctred information from Wikipedia ( In or DBpedia dataset (denoted by DBP), each vertex corresponds to an entity (i.e., article) extracted from Wikipedia, which contains a set of words (tokens). The weight of each word is specified by TF/IDF. We se the classical featre selection method [36] to select 2000 words with the highest TF/IDF vale in the DBPedia as the elements. This graph contains 1,010,205 nodes and 1,588,181 edges. The average nmber of elements of each vertex is ) We generate synthetic scale-free data graphs (denoted by SF) sing the graph generator of [37], in which node degree follows power law distribtion. Three SF datasets

11 11 are sed in or experiment: SF1M, SF5M and SF10M which contain 1 million, 5 million and 10 million vertices, 1,260,704, 6,369,405 and 24,750,113 edges, respectively. The nmber of distinct elements is 100, and the average nmber of elements of each vertex is 5.5. Each element is randomly assigned a weight in the range of [0, 1]. The weight distribtion among all the elements follows Uniform distribtion by defalt. SF1M is the defalt SF dataset. According to a recent experimental track paper [38], the graph with 10 million vertices and 24 million edges sed in or experiment is the largest data graph reported in a single machine in the literatres on sbgraph search. Althogh there is a recent work STW on sbgraph search over a billion node graph, it works on a clod of 8 machines with 32 GB RAM and two 4-core CPUs [10]. We extract 100 qery graphs Q from data graph G by starting from a random vertex in G and then traversing the graph throgh connecting random edges, where the maximm nmber, n max, of vertices in qery graph Q is set to {3, 5, 8, 10, 12}. The similarity threshold τ is set to {0.5, 0.6, 0.7, 0.8, 0.9}, and the spport threshold minsp is set to The nmbers in bold font are defalt vales. In sbseqent experiments, each time we vary one parameter, other parameters are set to defalt vales. The performance of SMS 2 qeries is measred in terms of the qery response time per qery graph, and the average nmber of candidates of each qery vertex. The sizes of FB, DBP, SF1M, SF5M and SF10M are 118MB, 51MB, 38MB, 206MB and 415MB, respectively. 7.2 Competing Methods We compare or method (denoted by SMS 2 ) with the baseline method (denoted by BL). BL method bilds an inverted index for all the element sets of data vertices, based on which it searches each qery vertex s candidates. After candidate search, BL finds matching sbgraphs of the qery graph sing QickSI algorithm [19]. We also compare SMS 2 with existing exact sbgraph matching methods inclding Trbo ISO [15], GADDI [18], SPath [8], GraphQL [21], STW [10], D-Join[9] and R-join [2]. Since STW is a distribted graph matching algorithm, for fair comparison, we implement STW sing the same method as [15] in one machine. These methods first find candidate sbgraphs that are isomorphic to the qery graph Q, and then find the sbgraph matches by checking set similarity of each pair of matching vertices. To evalate the performance of set similarity prning, strctral prning and DS-match algorithms, we compare or method with SMS 2 -S, SMS 2 -Q and SMS 2 -R, respectively. SMS 2 -S method only ses set similarity prning to find candidate vertices of each dominating qery vertex. Then, it employs DS-match algorithm to find matching sbgraphs. SMS 2 -Q method finds candidate vertices of all qery vertices instead of dominating qery vertices sing the proposed prning techniqes inclding set similarity prning and strctre-based prning. Then, it employs the QickSI algorithm [19] to find matching sbgraphs based on candidates. The only difference between SMS 2 -R and SMS 2 is that SMS 2 -R randomly chooses a dominating set of the qery graph, while SMS 2 ses dominating set selection algorithm to select a cost-efficient dominating set. 7.3 Offline Performance Index Constrction Time: Fig. 8(a) reports the time cost of index constrction on real/synthetic datasets. For the inverted pattern lattice (denoted by Lattice), the constrction time ranges from seconds to seconds. For signatre bckets (denoted by Bckets), the constrction time ranges from 115 seconds to seconds. Space Cost: The space cost of inverted pattern lattice incldes the space of the lattice in the main memory and the inverted lists in the disk. The space cost of signatre bckets is the size of signatre bckets of all the data vertices. As analyzed in Appendix D, the space complexity of the inverted pattern lattice and signatre bckets are O(2 U ) and O( V (G) ) respectively, where U is the nmber of distinct elements in element niverse U. As shown in Fig. 8(b), the space cost of inverted pattern lattice ranges from 45.9 MB to 523 MB, and the space cost of signatre bckets ranges from 192 MB to MB. We can find that the space cost is proportional to the size of each dataset. Index Constrction Time(sec) Lattice Bckets FB DBP SF1M SF5M SF10M Data Sets (a) Index Constrction Time (sec) Spae Cost(MB) Lattice Bckets FB DBP SF1M SF5M SF10M Data Sets (b) Space Cost Fig. 8. Offline Performance vs. Datasets, minsp = Online Performance Performance vs. Datasets In this sbsection, we first compare SMS 2 with the competing methods on an nlabeled and single-labeled data graph, respectively. Then, we compare SMS 2 with the BL method. Since set similarity becomes invalid for an nlabeled or single-labeled graph, we trn off the set similarity prning process in SMS 2. In addition, we se hashing mechanism rather than nion similarity pper bond (in strctral prning) to directly locate candidate signatre bckets. To enable the comparison on an nlabeled graph, we ignore all sets associated with each vertex and apply competing methods to find sbgraph matches. Unfortnately, all competitors except Trbo ISO fail to finish their qery processing in a reasonable time. The reason is that nlabeled vertices generate a large nmber of intermediate reslts. Althogh Trbo ISO is the only competitor srvived in DBP and SF1M, the expensive verification process leads to long qery response time. As shown in Table 1, or method otperforms Trob ISO by p to times. For a single-labeled graph, we assign a single label to each vertex and se mltiple distinct labels on all vertices. The distinct nmber of labels is set to 10% of the total

12 12 CPU Time(sec) BL SMS 2 -S SMS 2 -Q SMS 2 -R SMS FB DBP SF1M SF5M SF10M FB DBP SF1M SF5M SF10M Datasets Datasets (a) Qery Response Time (sec) (b) Nmber of Candidates Fig. 9. Performance vs. Datasets, n max = 5, τ = 0.9 Nmber of Candidates BL SMS 2 -S SMS 2 -Q SMS 2 -R SMS 2 Qery Response Time(sec) BL-SF SMS 2 -SF BL-FB SMS 2 -FB BL-DBP SMS 2 -DBP Qery Graph Size (a) Qery Response Time (sec) (b) Nmber of Candidates Fig. 10. Performance vs. Qery Graph Size, τ = 0.9 Nmber of Candidates BL-SF SMS 2 -SF BL-FB SMS 2 -FB BL-DBP SMS 2 -DBP Qery Graph Size Qery Response Time(sec) BL-SF SMS 2 -SF BL-FB SMS 2 -FB BL-DBP SMS 2 -DBP Nmber of Candidates BL-SF SMS 2 -SF BL-FB SMS 2 -FB BL-DBP SMS 2 -DBP (a) Qery Response Time (sec) (b) Nmber of Candidates Fig. 11. Performance vs. Similarity Threshold, n max = 5 nmber of vertices. The similarity threshold τ is set to 1. As a reslt, the SMS 2 qery is degraded to exact sbgraph search. We compare the qery response time of SMS 2 with the state-of-the-art sbgraph isomorphism algorithm Trbo ISO [15]. Trbo ISO also fails to finish its qery processing in SF5M and SF10M datasets, respectively. Ths, we se SF1.5M and SF2M datasets instead. As shown in Table 2, SMS 2 reslts in shorter qery response time than Trbo ISO, which proves that the proposed strctral prning and DS-match algorithms can be efficiently applied to exact sbgraph search. Note that, the inverted pattern lattice for a single-labeled graph is actally an inverted index consists of a list of all the distinct labels, and for each label, a list of vertices in which it belongs to. TABLE 1 TABLE 2 Qery Response Time (sec) Qery Response Time (sec) on Unlabeled Graph on Exact Sbgraph Search Dataset SMS 2 Trbo ISO FB 35.7 Fail DBP SF1M SF5M Fail SF10M Fail Dataset SMS 2 Trbo ISO FB DBP SF1M SF1.5M SF2M As shown in Fig. 9(a), SMS 2 reslts in mch shorter qery response time than BL, SMS 2 -S, SMS 2 -Q, and SMS 2 -R on both real and synthetic datasets. The qery response time of SMS 2 changes from seconds to seconds. SMS 2 otperforms SMS 2 -S becase SMS 2 ses both set similarity prning and strctre-based prning while SMS 2 -S only ses set similarity prning. SMS 2 otperforms SMS 2 -Q by at least 58% qery response time. This is becase SMS 2 saves prning cost by only finding candidates of dominating qery vertices. In addition, DS-match algorithm has better performance than QickSI becase it saves the sbgraph matching cost by redcing 0 Uniform Gass Zipf Weight Distribtion 0 Uniform Gass Zipf Weight Distribtion (a) Qery Response Time (sec) (b) Nmber of Candidates Fig. 12. Performance vs. Wt. Distribtion, n max = 5, τ = 0.9 the nmber of intermediate reslts. Since SMS 2 employs dominating set selection algorithm to select a cost-efficient dominating qery graph, SMS 2 has better performance than SMS 2 -R which randomly selects the dominating set. SMS 2 otperforms BL becase set similarity prning and strctre-based prning techniqes reslt in less candidates and prning cost compared with the inverted-index-based similarity search, and DS-match algorithm has better performance than QickSI algorithm. As analyzed in Appendix C, the time complexities of set similarity prning and strctre-based prning techniqes are O( I ) and O( DS(Q) C() ) respectively. We can also observe from Fig. 9(a) that the qery response time of SMS 2 grows linearly with the size of data graph from 1 million to 10 million, which indicates the scalability of or method. As shown in Fig. 9(b), SMS 2 -Q, SMS 2 -R and SMS 2 generate similar nmber of candidates, becase these method se the same prning techniqes. SMS 2 -S and BL also generate similar nmber of candidates, becase they both consider weighted set similarity between each qery vertex and data vertex. SMS 2 generates smaller nmber of candidates than BL by at least 8.5%, and at most 60% on different datasets. These reslts indicate that strctrebased prning techniqe can prne at least 8.5% candidates, and set similarity prning techniqe can prne at least 40% candidates. It is worth noting that the nmber of candidates shows a similar growth trend to the size of data graph. For example, there are 43.4, 170 and candidates for datasets SF1M, SF5M and SF10M Performance vs. Qery Graph Size In this sbsection, we compare SMS 2 with BL by varying qery graph size (i.e., nmber of qery vertices in qery graph) from 3 to 12. To represent different datasets, BL and SMS 2 are frther divided into BL-SF, SMS 2 -SF, BL- FB, SMS 2 -FB, BL-DBP, SMS 2 -DBP.

13 13 As shown in Figres 10(a) and 10(b), the qery response time of SMS 2 increases mch slower than that of BL as the qery graph size changes from 3 to 12. This is becase BL incrs mch more overhead than SMS 2 in both prning phase and sbgraph matching phase. The above reslts also confirm that SMS 2 are more scalable than BL against different qery graph sizes. From Fig. 10(b), the nmber of candidates of SMS 2 and BL decreases as qery graph size increases, and SMS 2 reslts in smaller nmber of candidates than BL. This is becase a small qery graph probably has more sbgraph matches than a large qery graph, and the prning techniqes of SMS 2 have greater prning power than that of BL. Note that, althogh SMS 2 - FB generates more candidates than BL-SF and BL-DBP, it reslts in shorter qery response time. The reason is that both set similarity prning and strctre-based prning save mch qery processing cost compare to existing methods Performance vs. Properties of Element Sets The performance of the set similarity prning techniqes highly depends on the element sets of qery vertices. In this sbsection, we evalate how the qery response time and the average nmber of candidates of each vertex are affected by the following types of sets. These sets contain: high term freqency elements (the term freqency of all elements is larger than 0.98, denoted by HighTF), and low term freqency elements (the term freqency of all elements is lower than 0.01, denoted by LowTF), large nmber of elements (the nmber of elements is larger than 80, denoted by LargeN), small nmber of elements (the nmber of elements is smaller than 5, denoted by SmallN), respectively. We generate qery graphs that only contain the vertices that have one of the for element set types. TABLE 3 Impact of Properties of Element Sets, n max = 5, τ = 0.9 BL-FB SMS 2 -FB BL-DBP SMS 2 -DBP BL-SF SMS 2 -SF HighTF LowTF LargeN SmallN Time (sec) Candidates Time (sec) Candidates Time (sec) Candidates Time (sec) Candidates Time (sec) Candidates Time (sec) Candidates From Table 3, we observe that the qery performance degrades as the term freqency of elements become higher. This is becase there are more candidates of a qery vertex with HighTF set than that of a qery vertex with LowTF set. Moreover, the sizes of inverted lists of high term freqency elements are larger than that of low term freqency elements, which also contribtes to higher qery response time of qery vertices with HighTF sets. Table 3 demonstrates that qeries containing LargeN sets sally have better performance than qeries with SmallN sets. The reason is that LargeN leads to smaller nmber of candidates than SmallN. For LargeN, the prning power of vertical prning decreases as longer prefix (larger p) will be fond, while the prning power of horizontal prning increases as more small size freqent patterns will be prned. For SmallN, the prning power of both vertical prning and horizontal prning increases as shorter prefix will be left and larger size freqent patterns will be prned Performance vs. Similarity Threshold We compare BL with SMS 2 by varying the similarity threshold τ from 0.5 to 1 in different datasets. As shown in Figres 11(a) and 11(b), SMS 2 has mch shorter qery response time and smaller nmber of candidates than BL, especially when the threshold is small. For example, when τ = 0.5, the gap between the average nmber of candidates of SMS 2 -FB and BL-FB is less than 37, while SMS 2 -FB redces more than 1130 seconds qery response time compared to BL-FB. This is becase larger similarity threshold will reslt in fewer candidate vertices, ths redcing the qery response time. Note that, when τ is set to 1, the problem is degraded to an exact sbgraph search problem. In sch case, the set similarity prning is still valid Performance vs. Weight Distribtion Finally, we compare SMS 2 with BL with different weight distribtions of the elements: Uniform distribtion, Gassian distribtion, and Zipf distribtion. We observe from Fig. 12 that the performance does not change mch as the weight distribtion varies. This is expected, becase the similarity fnction and pper bonds cannot be strongly affected by the weight distribtion. The reslts in Fig. 12 confirm that or approach greatly otperforms BL. 8 CONCLUSIONS In this paper, we stdy the problem of sbgraph matching with set similarity, which exists in a wide range of applications. To tackle this problem, we propose efficient prning techniqes by considering both vertex set similarity and graph topology. A novel inverted pattern lattice and strctral signatre bckets are designed to facilitate the online prning. Finally, we propose an efficient dominating-setbased sbgraph match algorithm to find sbgraph matches. Extensive experiments have been condcted to demonstrate the efficiency and effectiveness of or approaches. ACKNOWLEDGMENTS Liang Hong and Lei Zo are spported by National Science Fondation of China (NSFC) nder Grant , and , and Tencent. Philip S. Y is spported by NSF throgh grants CNS , and OISE , US Department of Army throgh grant W911NF and the Pinnacle Lab at Singapore Management University. REFERENCES [1] B. Ci, H. Mei, and B. C. Ooi, Big data: the driver for innovation in databases, National Science Review, vol. 1, no. 1, pp , [2] J. Cheng, J. X. Y, B. Ding, P. S. Y, and H. Wang, Fast graph pattern matching, in ICDE, 2008, pp

14 [3] X. Zh, S. Song, X. Lian, J. Wang, and L. Zo, Matching heterogeneos event data, in SIGMOD, 2014, pp. 1211 1222. [4] S. Brckner1, F. Hffner, R. M. Karp, R. Shamir, and R.

X, Parallel sbgraph listing in a large-scale graph, in SIGMOD, 2014. [6] Y. Shao, L. Chen, and B. Ci, Efficient cohesive sbgraphs detection in parallel, in SIGMOD, 2014. [7] Y. Tian and J. M.

T. Ozs, Distance-join: Pattern match qery in a large graph database, PVLDB, vol. 2, no. 1, 2009. [10] Z. Sn, H. Wang, H. Wang, B. Shao, and J.

Zhang, Node similarity in the citation graph, Knowledge and Information Systems, vol. 11, no. 1, pp. 105 129, 2006. [12] S. Aer, C. Bizer, G. Kobilarov, J. Lehmann, and Z.

14 14 [3] X. Zh, S. Song, X. Lian, J. Wang, and L. Zo, Matching heterogeneos event data, in SIGMOD, 2014, pp [4] S. Brckner1, F. Hffner, R. M. Karp, R. Shamir, and R. Sharan, Torqe: topology-free qerying of protein interaction networks, Ncleic Acids Research, vol. 37, no. sppl 2, pp. W106 W108, [5] Y. Shao, B. Ci, L. Chen, L. Ma, J. Yao, and N. X, Parallel sbgraph listing in a large-scale graph, in SIGMOD, [6] Y. Shao, L. Chen, and B. Ci, Efficient cohesive sbgraphs detection in parallel, in SIGMOD, [7] Y. Tian and J. M. Patel, Tale: A tool for approximate large graph matching, in ICDE, [8] P. Zhao and J. Han, On graph qery optimization in large networks, PVLDB, vol. 3, no. 1-2, [9] L. Zo, L. Chen, and M. T. Ozs, Distance-join: Pattern match qery in a large graph database, PVLDB, vol. 2, no. 1, [10] Z. Sn, H. Wang, H. Wang, B. Shao, and J. Li, Efficient sbgraph matching on billion node graphs, PVLDB, vol. 5, no. 9, [11] W. L, J. Janssen, E. Milios, N. Japkowicz, and Y. Zhang, Node similarity in the citation graph, Knowledge and Information Systems, vol. 11, no. 1, pp , [12] S. Aer, C. Bizer, G. Kobilarov, J. Lehmann, and Z. Ives, Dbpedia: A ncles for a web of open data, in ISWC, [13] M. Hadjieleftherio and D. Srivastava, Weighted set-based string similarity, in IEEE Data Engineering Blletin, [14] L. P. Cordella, P. Foggia, C. Sansone, and M. Vento, A (sb)graph isomorphism algorithm for matching large graphs, PAMI, vol. 26, no. 10, [15] W.-S. Han, J. Lee, and J.-H. Lee, Trboiso: Towards ltrafast and robst sbgraph isomorphism search in large graph databases, in SIGMOD, [16] H. He and A. K. Singh, Closre-tree: An index strctre for graph qeries, in ICDE, [17] J. R. Ullmann, An algorithm for sbgraph isomorphism, Jornal of the ACM, vol. 23, no. 1, [18] S. Zhang, S. Li, and J. Yang, Gaddi: Distance index based sbgraph matching in biological networks, in EDBT, [19] H. Shang, Y. Zhang, X. Lin, and J. X. Y, Taming verification hardness: An efficient algorithm for testing sbgraph isomorphism, PVLDB, vol. 1, no. 1, [20] R. Di Natale, A. Ferro, R. Gigno, M. Mongiovì, A. Plvirenti, and D. Shasha, Sing: Sbgraph search in nonhomogeneos graphs, BMC bioinformatics, vol. 11, no. 1, p. 96, [21] H. He and A. K. Singh, Graphs-at-a-time: qery langage and access methods for graph databases, in SIGMOD, 2008, pp [22] G. Gülsoy and T. Kahveci, Rinq: Reference-based indexing for network qeries, Bioinformatics, vol. 27, no. 13, pp , [23] V. Memišević and N. Pržlj, C-graal: Common-neighborsbased global graph alignment of biological networks, Integrative Biology, vol. 4, no. 7, pp , [24] Y. Tian, R. C. McEachin, C. Santos, D. J. States, and J. M. Patel, Saga: a sbgraph matching tool for biological graphs, Bioinformatics, vol. 23, no. 2, pp , [25] S. Ma, Y. Cao, W. Fan, J. Hai, and T. Wo, Captring topology in graph pattern matching, PVLDB, vol. 5, no. 4, [26] A. Khan, Y. W, C. C. Aggarwal, and X. Yan, Nema: Fast graph search with label similarity, Proceedings of the VLDB Endowment, vol. 6, no. 3, pp , [27] X. Y, Y. Sn, P. Zhao, and J. Han, Qery-driven discovery of semantically similar sbstrctres in heterogeneos networks, in KDD, [28] L. Zo, L. Chen, and Y. L, Top-k sbgraph matching qery in a large graph, in PIKM, [29] M. Hadjieleftherio, A. Chandel, N. Kodas, and D. Srivastava, Fast indexes and algorithms for set similarity selection qeries, in ICDE, [30] R. J. Bayardo, Y. Ma, and R. Srikant, Scaling p all pairs similarity search, in Proceeding of WWW, [31] D. Xin, J. Han, X. Yan, and H. Cheng, Mining compressed freqent-pattern sets, in VLDB, [32] S. Chadhri, V. Ganti, and R. Kashik, A primitive operator for similarity joins in data cleaning, in ICDE, [33] A. Gionis, P. Indyky, and R. Motwaniz, Similarity search in high dimensions via hashing, in VLDB, [34] F. V. Fomin, F. Grandoni, and D. Kratsch, A measre and conqer approach for the analysis of exact algorithms, Jornal of ACM, vol. 56, no. 5, [35] M. Hadjieleftherio, X. Y, N. Kodas, and D. Srivastava, Hashed samples: Selectivity estimators for set similarity selection qeries, in Proceeding of VLDB, [36] Y. Yang and P. J. O., A comparative stdy on featre selection in text categorization, in ICML, [37] F. Viger and M. Latapy, Efficient and simple generation of random simple connected graphs with prescribed degree seqence, in COCOON, [38] J. Lee, W.-S. Han, R. Kasperovics, and J.-H. Lee, An indepth comparison of sbgraph isomorphism algorithms in graph databases, PVLDB, vol. 6, no. 2, pp , Liang Hong received his BS degree and Ph.D. degree in Compter Science at Hazhong University of Science and Technology (HUST) in 2003 and 2009, respectively. Now, he is an assistant professor in School of Information Management of Whan University. His research interests inclde graph database, spatio-temporal data management and social networks. Lei Zo received his BS degree and Ph.D. degree in Compter Science at Hazhong U- niversity of Science and Technology (HUST) in 2003 and 2009, respectively. Now, he is an associate professor in Institte of Compter Science and Technology of Peking University. His research interests inclde graph database and semantic data management. Xiang Lian received the BS degree from the Department of Compter Science and Technology, Nanjing University, in 2003, and the PhD degree in compter science from the Hong Kong University of Science and Technology, Hong Kong. He is now an assistant professor in the Department of Compter Science at the University of Texas - Pan American. His research interests inclde probabilistic/ncertain data management, probabilistic RDF graphs. Philip S. Y is a Distingished Professor in Compter Science at the University of Illinois at Chicago and also holds the Wexler Chair in Information Technology. Dr. Y was manager of the Software Tools and Techniqes grop at the Watson Research Center of IBM. His research interest is on big data, inclding data mining, data stream, database and privacy. He has pblished more than 780 papers in refereed jornals and conferences. He holds or has applied for more than 250 US patents. Dr. Y is a Fellow of the ACM and the IEEE.

Cohesive Subgraph Mining on Attributed Graph

Cohesive Subgraph Mining on Attributed Graph Cohesive Sbgraph Mining on Attribted Graph Fan Zhang, Ying Zhang, L Qin, Wenjie Zhang, Xemin Lin QCIS, University of Technology, Sydney, University of New Soth Wales fanzhang.cs@gmail.com, {Ying.Zhang,