SAPPER: Subgraph Indexing and Approximate Matching in Large Graphs

Size: px

Start display at page:

Download "SAPPER: Subgraph Indexing and Approximate Matching in Large Graphs"

Marjorie Daniels
6 years ago
Views:

1 SAPPER: Sugrph Indexing nd Approximte Mtching in Lrge Grphs Shijie Zhng, Jiong Yng, Wei Jin EECS Dept., Cse Western Reserve University, {shijie.zhng, jiong.yng, ABSTRACT With the emergence of new pplictions, e.g., computtionl iology, new softwre engineering techniques, socil networks, etc., more dt is in the form of grphs. Locting occurrences of query grph in lrge dtse grph is n importnt reserch topic. Due to the existence of noise (e.g., missing edges) in the lrge dtse grph, we investigte the prolem of pproximte sugrph indexing, i.e., finding the occurrences of query grph in lrge dtse grph with (possile) missing edges. The SAPPER method is proposed to solve this prolem. Utilizing the hyrid neighorhood unit structures in the index, SAPPER tkes dvntge of pre-generted rndom spnning trees nd crefully designed grph enumertion order. Rel nd synthetic dt sets re employed to demonstrte the efficiency nd sclility of our pproximte sugrph indexing method.. INTRODUCTION Grph dt hs ppered in mny recent pplictions, rnging from ioinformtics, softwre engineering to socil networks. Mnging, processing, nd nlyzing these grph dt ecomes n urgent prcticl prolem. Sugrph query is one of the most fundmentl procedures in mnging grphs. In mny pplictions, e.g., iologicl networks, grphs re lrge with thousnds or tens of thousnds of vertices nd millions of edges. A sugrph query is to identify the occurrences of the query sugrph in the dtse grph. Although sugrph query hs een studied previously [24], the sic ssumption is tht the networks of interest re perfectly clen. In order to qulify n occurrence of query sugrph q, ll edges of the query grph hve to occur in the dtse grph G. In other words, the occurrence hs to e exct. However, noise commonly exists in mny pplictions or the pproximte mtches themselves re more interesting. For exmple:. A chllenging prolem in the computtionl iology is to n- Acknowledgement: This presenttion ws mde possile, in prt, through finncil support from the School of Grdute Studies t Cse Western Reserve University. This project ws prtilly supported y grnts of NSF nd NSF060. Permission to mke digitl or hrd copies of ll or prt of this work for personl or clssroom use is grnted without fee provided tht copies re not mde or distriuted for profit or commercil dvntge nd tht copies er this notice nd the full cittion on the first pge. To copy otherwise, to repulish, to post on servers or to redistriute to lists, requires prior specific permission nd/or fee. Articles from this volume were presented t The 6th Interntionl Conference on Very Lrge Dt Bses, Septemer -7, 200, Singpore. Proceedings of the VLDB Endowment, Vol., No. Copyright 200 VLDB Endowment /0/09... $ notte, index nd serch sugrphs in lrge networks generted with high throughput experiments. Specificlly, the prolem is to serch for well chrcterized pthwys/ptterns in less studied model orgnism [7]. Sugrph Indexing is useful in querying for pthwys/ptterns from well studied model orgnisms in other unfmilir orgnisms with known protein-protein interction networks where vertices nd edges represent proteins nd interctions, respectively. However, due to possile errors in dt collection nd different thresholds used in experiments, the dt re highly noisy. Missing interctions re common nd it is very difficult to clen the dt. By discovering nd nlyzing the pproximte mtches, iologists would generte solid hypotheses for future studies in understnding nd identifying pthwys/ptterns in not so well studied model orgnisms. 2. In oject-oriented progrmming, developers nd testers hndle multiple ojects of the sme or different clsses. The oject dependency grph of progrm run, where ech vertex is n oject nd ech edge is n interction etween two ojects through method cll or field ccess, helps developers nd testers understnd the flow of the progrm nd identify ugs. The ptterns to e queried tht re confirmed y the developers s typicl oject usges cn e used to utomticlly detect the loctions in progrms tht devite from them (tht is similr to the pttern ut not exctly the sme) [8]. Hence, y retrieving the pproximte occurrences of typicl pttern, developers nd testers cn quickly locte where the possile ugs re. In this pper, we investigte the prolem of discovering the occurrences of query grph q in G. The query grph my contin dozens of vertices. Sugrph indexing hs een studied efore [24, 9, 9]. In previous work, to qulify n occurrence of q in G, ll edges of q hve to occur. On the other hnd, we re studying the sugrph indexing prolem in the context of noises, e.g., missing edges. Therefore, in this pper, n pproximte mtch model is developed. In this model, the edge edit distnce (i.e., the numer of edge modifictions needed to trnsform one grph to nother) is used to qulify n occurrence of q. If the edge distnce etween the query grph q nd sugrph q of G is no more thn some threshold θ, then q is considered s n pproximte occurrence of q. This pproximte mtching model tkes into ccount missing edges in the dtse grph G. Note tht we do not consider the pproximte mtches with dditionl edges to the query grphs ecuse such mtches re lwys contined y the mtches of the query grphs. We do not consider lel mismtches ecuse the numer of possile cndidte grphs with lel mismtches to given query grph cn e huge. For exmple, let us ssume the size of the query grph is n nd the numer of vertex lels in the dtse grph is m, then the totl numer of cndidte grphs with only two lel 8

2 mismtches is n (n ) (m ) 2 even without considering ny missing edges. There is very strightforwrd solution for pproximte query mtching. We cn first find ll grphs whose edge edit distnce to q is no more thn θ. Next, for ech of these grphs q, the exct occurrences of q in G cn e discovered. In this wy, the pproximte sugrph mtching cn e reduced to the prolem of exct sugrph mtching nd previous existing methods, e.g., GADDI [24] cn e pplied. However, this pproch hs two shortcomings. First, the exct sugrph mtching itself is very difficult prolem since sugrph isomorphism is known to e n NP-hrd prolem. Secondly, there could e potentilly lrge numer of grphs whose edge edit distnce is no more thn θ wy from q (Denote these grphs s AI(q,θ)). For instnce, if q hs m edges, then the numer of grphs in AI(q,θ) could e O(m θ ), which could e very lrge. Thus, it is crucil to devise n efficient wy to process the group of queries. In this pper, we im to solve the ove two prolems. To efficiently identify the occurrences of one sugrph, novel indexing structure, hyrid neighorhood unit (HNU), is devised. Let N i(v, G) e the set of vertices u in G such tht there exists n i-edge pth in G tht connects u nd v. For ech vertex v in the dtse grph G, HNU stores the degree of v nd the lels of v, v s neighors (N ((v, G)), nd v s neighors neighors (N 2(v, G)). In most cses, N (v, G) is reltively smll set, ut N 2(v, G) could e lrge. For grph with verge degree d, there could e d 2 vertices in N 2(v, G). During the query time, when mtching one vertex u in q to vertex v in G, we need to find out whether the lels in N 2(u, q) re suset of those in N 2(v, G), which could e costly if these sets re lrge. To efficiently determine the set reltionship, the loom filter [] dt structure is used to represent the lels in N 2(v, G). The loom filter is n L-it vector which cn e used to determine whether one set is suset of nother. It hs the following dvntges. It is time efficient nd spce compct. Moreover, it hs no flse negtives nd only smll rte ( %) of flse positives. Therefore, the vertices in the query grph q cn e efficiently mtched to the vertices in G with high ccurcy. To improve the efficiency of processing set of sugrph queries (grphs in AI(q,θ)), we mke the following oservtion. Although there could e m θ grphs in AI(q,θ), these grphs re highly overlpped. Therefore, it is eneficil to query the overlpping prts first since they hve the gretest pruning power, i.e., cn e used in mny of grphs in AI(q,θ). As result, the spnning trees of q re used for the query first ecuse (i) mny grphs in AI(q,θ) contin some spnning tree of q nd (ii) the time to identify tree in G is quite smll. Bsed on the mtches of the spnning trees, we cn mp vertices in q to vertices in G. The grph occurrences hve similr property s the Apriori Property [2] ecuse n occurrences of supergrph hs to contin n occurrence of sugrph. Therefore, finding the mtches of grphs in AI(q,θ) is similr to tht of discovering frequent ptterns. As result, depth-first enumertion order similr to tht of FP-tree [0] is constructed for mtching grphs in AI(q,θ) so tht previous discovered occurrences of grph q cn e used for the mtching of lter enumerted supergrphs of q. The reminder of this pper is orgnized s follows. Section 2 is the relted work nd section is the preliminries. We present how to preprocess the dtse nd construct the index in section 4. Section descries the query processing. The experiment results re presented in section 6. Lst, the finl conclusion is drwn in section RELATED WORK These dys grph dtse reserch hs ttrcted gret ttention, relted works of sugrph indexing for pproximte grph mtching include sugrph isomorphism lgorithms, grph indexing nd sugrph indexing, pproximte sugrph mtching nd grph similrity serch. The first ctegory of relted reserch lies in sugrph isomorphism lgorithms. Ullmnn [20] proposed sugrph mtching lgorithm sed on stte spce serch method with cktrcking. However, this lgorithm is prohiitively expensive for querying ginst lrge grph. Cordell [6] proposed new sugrph isomorphism lgorithm for lrge grphs. These lgorithms do not utilize ny index structure y preprocessing the dtse grphs. Mny index-sed grph mtching nd serching schemes hve een proposed to find where the query grph occurs in the grph dtses [4, 2, 8, 22, 2, 9], which cn e further divided into the grph indexing nd sugrph indexing. In grph indexing, e.g., gindex[22], TreePi[2], FG-Index[4], the grph dtse consists of set of smll grphs. The grph indexing ims to find ll dtse grphs tht contin or re contined y given query grph. On the other hnd, in the sugrph indexing e.g., GrphGrep [9], TALE [9], GADDI [24], the gol is to index very lrge dtse grph, so tht we cn find ll or suset of the mtches of given query grph efficiently in the very lrge dtse grph. The proposed method, SAPPER, lso dels with sugrph indexing in very lrge dtse grph, nd thus flls into this ctegory. Recently, numer of lgorithms re proposed which support pproximte grph mtching or similrity serch through different mens [7,, 2, 9, 2,, 9, 6]. C-Tree[] orgnizes dtse grphs in tree sed structure, where interior nodes re grph closures, nd lef nodes re dtse grphs. The design of its dt structure enles it to perform similrity queries efficiently. In TALE [9], importnt nodes re mtched first nd then the mtch is progressively extended. The method is very effective nd fst in pproximtely finding mtches in lrge grph. In G-Hsh [2], wvelet grph mtching kernels re pplied long with hshing scheme. In [], top-k query scheme is proposed to find the most similr k nswers. However, most of these lgorithms re not designed for finding ll pproximte mtches for the query grph with given threshold in very lrge grph. In [6], the uthors im to find the dtse grphs tht re similr to the query grph. Since the dtse grphs nd the query grph re ll smll, they trnsform the pproximte grph mtching to the SET-COVER prolem. Another ctegory of reserch relted to the sugrph mtching is grph lignment [4, 7]. Insted of mtching sugrphs in lrge dtse grph, these methods imed to lign pir of iologicl grphs. In the prolem studied in this pper, the size of the query grph my e much smller thn tht of the dtse grph. Thus, the grph lignment method my not e directly pplicle.. PRELIMINARIES In this section, we introduce the fundmentl definitions used in this pper nd give the forml prolem sttement. We investigte the pproximte grph mtching methods for undirected nd unweighted leled grphs. Without loss of generlity, it is esy to extend our methods to directed nd weighted leled grphs. DEFINITION. A leled grph G is five element tuple G = (V,E,Σ V, Σ E,L G) where V is set of vertices nd E V V is set of edges. Σ V nd Σ E re the sets of vertices nd edge lels, respectively. The leling function L G defines the mppings V Σ V nd E Σ E. 86

3 DEFINITION 2. The edge edit distnce from grph g to g 2 is defined s the minimum numer of dded edges required to trnsform g into g 2. We denote the edge edit distnce s D edit (g,g 2). is one, while it is zero for the right mtch, which is lso n exct mtch. Before presenting the pproximte sugrph indexing method, we will introduce the sugrph mtching property which will e used extensively lter in this pper. PROPERTY. Given query grph q nd dtse grph G, for ny exct mtch g of q in G, let q e sugrph of q, g must contin mtch of q in G. Figure : An Exmple of the Edge Edit Distnce Figure 2: The Dtse Grph, Query Grph nd Mtches For exmple, in Figure, y dding two edges to g, we cn trnsform g to g 2. This leds to D edit (g,g 2)=2. Edge edit distnce is not symmetric, i.e., g,g 2,D edit (g,g 2) D edit (g 2,g ). When grph g is not possile to e trnsformed to nother grph g y dding edges, we hve D edit (g,g )=+. DEFINITION. Given dtse grph G, connected query grph q, nd n integer θ s threshold, connected sugrph s of G is defined s n pproximte mtch of q in G if nd only if D edit (s, q) θ; ny grph isomorphic to s is defined s pproximtely isomorphic to q. The set of grphs pproximtely isomorphic to q is denoted s AI(q,θ). If the edge edit distnce from n pproximte mtch m to q is exctly zero, m is n exct mtch of q in G. Apprently, the set of pproximte mtches of ny query grph is the superset of the set of exct mtches of the sme query grph. In this pper, two restrictions on the pproximte mtch re imposed: (i) the pproximte mtch hs to e connected nd (ii) only edge dditions re considered, ut not the edge deletions. A rief discussion on pproximte mtches without these two restrictions is presented in the ppendix. Prolem Sttement: We im to solve the following two prolems. () Given lrge dtse grph G, we wnt to construct n index. (2) Given query grph q nd threshold integer θ, we wnt to efficiently find ll mtches of grphs tht re pproximtely isomorphic to q in G with the help of the indexed informtion. Our gol is not to find some of the mtches to the grphs in AI(q,θ), ut to find ll mtches to the grphs in AI(q,θ). The word pproximte refers to the mtches of grphs tht re pproximtely isomorphic to the query grph. In Figure 2, given the query grph nd threshold θ =,two distinct pproximte mtches exist in the dtse grph. The edge edit distnce from the left pproximte mtch to the query grph Figure 2 lso illustrtes this property: the right exct mtch in dtse grph contins ny mtch of sugrph q of the query grph q. This property is similr to the Apriori property in the frequent pttern mining [2]. With this property, we cn devise n lgorithm tht serches the mtches of sugrph first. By refining these mtches, we cn uild the mtches for lrger sugrphs. The processing of grph queries in our pper cn e divided into two mjor steps. In the first step we construct the index from the dtse grph. The hyrid neighorhood unit (HNU) is used to store the useful locl informtion for ech vertex. In the second step, pproximte mtches of the query grph q re identified. 4. HYBRID NEIGHBORHOOD UNIT INDEX In GrphGrep [9], the effectiveness of pths is first reveled, while in TALE [9], neighoring unit proves to e compct nd powerful index unit. In GADDI [24], neighoring distnces sed index shows its strength in grph mtching in single lrge grph. Tking the usefulness of these three models into ccount, we crete new index unit, clled hyrid neighorhood unit (HNU). For ech vertex v in G, let N i(v, G) e the set of vertices u in G such tht there exists pth of i edges etween u nd v. For exmple, N (v, G) is the set of vertices tht re djcent to v in G. For the dtse grph G, we construct the HNU for ech vertex v in G. The HNU of v includes four prts: the lel v, the degree of v, the lels of vertices in N (v, G) nd the lels of vertices in N 2(v, G). The first three prts re esy to compute nd efficient to store. However, the lst prt could e too lrge. For grph with the verge degree of d, N 2(v, G) could e O(d 2 ). The loom filter [] is used to store the lels in N 2(v, G). A loom filter B is n L-it vector nd set of m independent hsh functions {f,f 2,...,f m}. It is used to determine whether n element x is memer of set X. Ech of the m hsh functions f i mps n element into n integer etween nd L. Initilly, ll its in B re set to 0. If f i mps n element in X into the integer k, then the kth it in B (B[k]) is set to. After mpping every element in X with m hshing functions, some it in B is while others re 0. To determine whether x is in X, x is mpped to m integers with the m independent hsh functions. Assume tht f i(x) =k i.ifx X, then B[k i] hs to e for ll k i ( i m). If k i, B[k i]=0, then x cn not e memer of X. There is no flse negtive in the loom filter. However, there could e flse positive, i.e., if ll mpped its of x re in B, then there is still chnce tht x is not memer of X. The error rte depends on L, X (numer of elements in X), nd m. The optiml numer of independent hsh functions is pproximtely 0.7 L/ X. In ddition, if the positive error rte is set to %, then L/ X should e 9.6 []. Since X re the lels of vertices in N 2(v, G), X cn e pproximted y d 2 where d is the verge degree of vertex in G. Without loss of the generlity, we choose L nd m to e 9.6d 2 nd 7, respectively to ensure the flse positive rte no more thn 0.0. If lower flse positive rte is needed, ech time we dd out 4.8 its per element to the length of the loom filter, the flse positive rte is reduced y ten times. In the HNU of vertex v, the lels 87

4 of N 2(v, G) re collected nd n L-it loom filter is uilt during index construction time. The time complexity to otin the first three prts of the HNU is O(d) for ech vertex while the loom filter tkes O(d 2 m + L) time to uild. Since L is in the order of d 2, the time complexity of loom filter construction cn e simplified s O(md 2 ). Thus, the totl index construction time for ll vertices in G is O(md 2 V G ) where V G is the numer of vertices in G.. SAPPER QUERY PROCESSING In this section, we introduce the pproximte sugrph mtching lgorithm, nmely SAPPER. During the query of sugrph q in G, SAPPER consists of four min steps: vertex mtching, constructing rndom spnning trees of q, generting mtching order of grphs in AI(q,θ), nd the finl grph mtching. SAPPER first finds cndidte mtches of ech vertex v q q to vertices in G sed on the HNUs. Next, we rndomly generte set of spnning trees of q. The mtches of the spnning trees re discovered sed on the vertices mtch. The spnning tree mtches re used for mtching the pproximte grphs. Since there re multiple grphs need to e mtched, n order on mtching these grphs is determined. Finlly, mtches of ll these grphs re discovered.. Vertex Mtching For ech vertex v q in the query sugrph q, we serch for its mtches in G sed on the HNUs. A vertex v G in G is mtch of v q if ll the following conditions re stisfied: ) The lel of v q is the sme s tht of v G. 2) The degree of v q is less thn or equl to tht of v G. ) The lels of vertices in N (v q,q) is suset of those of N (v G,G). 4) The lels of vertices in N 2(v q,q) is suset of those of N 2(v G,G). In the lst step, the loom filter B is employed. Ech lel in N 2(v q,q) is hshed vi the m hsh functions nd check whether the corresponding its in B of v G re. After this step, ech v q is ssocited with set of mtched vertices in G, denoted s M(v q). The totl time complexity in this step is O(d 2 m V (q) V (G) ) where d, m, V (q), nd V (G) re the mximum of the verge degree of G nd q, the numer of hsh functions for the loom filter, the numer of vertices in q, nd the numer of vertices in G, respectively. There re some flse positives in the fourth step due to the loom filter. The totl flse positive rte is ( e) l where e nd l re the flse positive rte of determining whether one element is in the loom filter nd the numer of distinct lels in N 2(v q,q), respectively. This is ecuse if ny lel out of the l lels is reported s flse positive y the loom filter of v G, then v G is flse positive mtch of v q.ifend l re 0.0 nd 0, then the totl flse positive rte is less thn 0.. Since the vertex mtching is to find cndidte set of mtches for vertex in q, the flse positive rte is well in the tolernce..2 Rndom Spnning Tree Genertion nd Mtching Although mtches for vertices hve een discovered, these mtches re determined sed on the locl informtion (within 2-edge distnce). It is possile tht some of these mtches re flse positives. Therefore, more informtion needs to e used to prune the mtches. Since our ultimte gol is to find mtches for ll grphs in AI(q,θ), it is desirle to use the glol informtion existing in lrge numer of the grphs of AI(q,θ). All grphs in AI(q,θ) re θ or less edge edit distnce wy from q, nd hence they re hevily overlpped. Therefore, spnning trees of q will e used for the glol informtion ecuse grphs in AI(q,θ) would shre mny spnning trees. In ddition, we wnt ech edge in q to hve the sme proility to e selected into spnning tree. This could ensure tht ech grph in AI(q,θ) would contin similr numer of spnning trees, nd thus hve similr mount of pruning power. A rndom spnning tree T of q hs the following property: ech edge e in q hs the sme proility to e selected into T []. For grph q with vertices V (q) nd edges E(q), rndom spnning tree T of q is constructed vi rndom wlk. A rndom wlk on q is discrete-time Mrkov chin with the following trnsition proilities from vertex v to nother vertex w: P (v,w) =/d v (d v is the degree of vertex v) if there is n edge from v to w. Otherwise, P (v, w) =0. Initilly, vertex v 0 V (q) is rndomly chosen s the strting point nd the spnning tree T only contins vertex v 0. The rndom wlk strts t v 0. An edge (v, w) is rndomly chosen sed on the proility P.Ifw is not in T, edge (v, w) nd w re inserted into T. Otherwise, no edge will e dded into T. Next the wlk is repeted on w. This process termintes until T includes ll vertices of V q. The forml rndom tree construction lgorithm is descried in Algorithm in Appendix nd n exmple is depicted in Figure. In the exmple, t time step t 0, T only includes v 0 nd no edge. In t time stmp, the edge (v 0,v ) nd v re dded into T nd T contins vertices v 0, v, nd one edge. At the time stmp t 2, no edge or vertex is dded into T since the rndom wlk is ck to v 0.Att, the edge (v 0,v 2) nd vertex v 2 re dded into T nd spnning tree is formed. the query grph v0 v time t2 v0 time t0 v0 v time t v0 v0 v2 v v2 v time t the spnning tree Figure : The Rndom Spnning Tree Genertion A tree generted y this rndom wlk lgorithm is uniform rndom spnning tree, i.e., the proility of spnning tree t of grph q to e generted y Algorithm is /T N(q), where TN(q) is the numer of distinct spnning trees of q. This cn e proved y showing tht the set of trees constructed y rndom wlk hs sttionry distriution proportionl to the degree of the vertex from which it strts. The detiled proof ws presented in []. In this step, we generte V (q) + rndom spnning trees so tht () ech edge hs 8% proility to e included in t lest one of the spnning trees nd (2) the complexity is still not too lrge. A vertex v in q is rndomly chosen s the prime vertex. For ech generted spnning tree T, we find its mtches in G sed on the vertices mtch. The mtching strts from the prime vertex v in T nd tries to mtch v s neighors in T. For exmple, let s ssume tht v s mtches in G re M(v) ={u,u 2} nd v is connected to v in T. Then we try to see whether v mtches to ny neighor of u or u 2 in G. In other words, we wnt to see whether ny neighor of u or u 2 is in M(v ). Ifv only could e mtched to some neighor of u, ut not ny neighor of u 2, we know tht u 2 could not e mtch to v for the occurrence of T in G nd hence, u 2 could e removed from the mtch for v of T. The process continues until ll mtches of T re locted. The mtching process is performed in depth-first trversl mnner. Since the tree is very 88

5 specil form of grph, the mtch of tree in G is rther efficient nd simple. Due to the spce limittions, we omit the detils of tree mtching in this pper. After the mtching process for T, the prime vertex v hs set of mtched vertices in G for T. M(v, T i) is denoted s the set of vertices in G tht could e mtched to the prime vertex v for the query grph T i. For exmple, in Figure, v hs lel in the query grph, then M(v, T )={, 0} (circled y the solid ellipses), where nd 0 re the ids of the mpped vertices of v in the two mtches of the spnning tree T. Since there re V (q) + spnning trees, there re V (q) + sets of M(v, T i). These mtched sets of v serve s the strting point for the lter grph mtching. Given query grph q, nd threshold θ, there re pproximtely ( E(q) θ ) sugrphs of q of E(q) θ edges. After generting V (q) + spnning trees, sugrph of q with E(q) θ edges hs the proility P to contin t lest one of these rndom spnning trees, where P is P = ( ( E(q) θ ) V (q) ) V (q) +. E(q) For instnce, if q consists of 0 vertices nd 20 edges nd θ is 2, P would e lrger thn This mens tht most of these grphs could utilize the mtch informtion of the spnning trees.. Query Grph Enumertion Order Since there re mny grphs in AI(q,θ), we need devise n order on enumerting these grphs. This prolem is similr to tht of frequent pttern mining in the dt mining field. There re two min pproches to enumerte ptterns in frequent pttern mining: redth-first enumertion nd depth-first enumertion. In the red-first enumertion [2], ll ptterns (grphs) with i items (edges) re first enumerted. Bsed on the occurrences (mtches) of these pttern (grphs), their super-ptterns (super grphs) with one extr item (edge) re enumerted nd so on. In the depth-first pttern (grph) enumertion [0], one pttern (sugrph) is generted first, if it hs sufficient occurrences (mtches), one item (edge) is dded into the pttern (sugrph), nd the occurrences (mtches) of the new pttern is serched nd so on. It hs een shown tht the depth-first enumertion hs n dvntge over the redth-first serch ecuse in depth-first serch, () pttern genertion is simpler nd more efficient, (2) the mtch of pttern cn e directly uilt on its predecessor, nd () mny ptterns re not enumerted. Bsed on this knowledge, we devise depth-first enumertion of our grphs in AI(q,θ). We ssign unique id to ech edge in q nd lexicogrphicl order is ssumed on these edge ids. Assume tht there re z edges in q, whose ids re e <e 2 < <e z ccording to the lexicogrphicl order. (We will discuss how to ssign the lexicogrphicl order shortly.) Thus, ech grph in AI(q,θ) cn e uniquely represented y sequence of edges (sorted ccording to the lexicogrphicl order of the edges). The order of two distinct grphs q nd q in AI(q,θ) cn e determined sed on their corresponding edge lists. Let edge lists of q nd q e e,e 2,...,e i nd e,e 2,...,e j. respectively. If one sequence is prefix of nother, e.g., q is prefix of q, then we define q <q. Otherwise, there exists n integer k (k i nd k j) such tht e k e k, then the order of q nd q cn e determined s follows. Let k e the smllest integer such tht e k e k. q <q if nd only if e k <e k. By defining the lexicogrphicl order of grphs, the grphs in AI(q,θ) cn e enumerted in depth-first mnner from the lexicogrphiclly smllest to the lexicogrphiclly lrgest. First, the edge sequence (grph) with the smllest lexicogrphicl order q is enumerted, which is e,e 2,...,e l (l = E(q) θ). If q hs t lest one mtch, then n edge with the smllest lexicogrphicl order fter e l is ppended into q to form new grph q 2 s descried in Algorithm 2 in Appendix. (This procedure is illustrted s next in Figure 4.) This process continues on q 2 until no edge cn e ppended into q 2 or there is no mtch for q 2. In such cse, it is not necessry to enumerte ny edge sequences contining q 2 s prefix. The enumertion process will resume from the lexicogrphiclly smllest grph tht does not contin q 2 nd is lrger thn q 2. This procedure is descried in Algorithm in Appendix. (This procedure is illustrted s jump in Figure 4.) Let s tke look t n exmple. Assume tht q consists of four edges e < e 2 < e < e 4 nd θ =2. The lexicogrphiclly smllest grph in AI(q,θ) is (e,e 2). If(e,e 2) hs t lest one mtch, then e is ppended nd (e,e 2,e ) will e enumerted next. In the cse of (e,e 2,e ) hs no mtch, then ny sequence whose prefix is (e,e 2,e ) will not e enumerted, nmely the sequence (e,e 2,e,e 4). Next, the lexicogrphiclly smllest grph tht does not contin (e,e 2,e ) s prefix nd is lrger thn (e,e 2,e ) is enumerted, which is (e,e 2,e 4). Figure 4 shows the enumertion order of grphs in this exmple. The grphs re enumerted from top-down nd left-to-right fshion. In this method, ech grph in AI(q,θ) will e enumerted or reched t most once. Thus, t most AI(q,θ) grphs will e enumerted under this method. next e e 2e e 4 e e 2 jump e e e e 4 e 2e e 2e 4 e e 4 next next next e e 2e e e 2e 4 e e e 4 pruned jump jump jump jump e 2e e 4 Figure 4: The Enumertion Order Although ny lexicogrphicl order mong edges will work, our gol is to prune the grphs in AI(q,θ) s erly s possile. As result, it is eneficil to serch the grphs with the smllest numer of mtches first so tht it cn prune the grphs in AI(q,θ) the most. Therefore, the lexicogrphicl order of edges is set ccording to the numer of mtches of ech edge. e i <e j if edge e i occurs less times in G thn e j. If two edges hve the sme numer of occurrences/mtches, thn n order is ssigned ritrrily..4 Grph Mtching After determining the enumertion order of query grphs, we continue to mtch these grphs in the enumertion order. When mtching grph q, there re two cses: q is connected nd q is not connected. In the cse tht q is not connected, it is not necessry to find mtches of q since we re only interested in connected query grphs. However, it is possile tht some supergrph of q is connected. Thus, we pretend there is mtch of q (without serching for the mtches of q ), nd continue to enumerte the supergrphs of q y ppending n edge to q. In the second cse tht q is connected, we need to find mtches of q in G. The mtching process cn e divided into two cses gin ccording to q, () we hve not yet serched ny prefix of q nd (2) we hve found mtch(es) of some prefix of q. In the first suctegory, since q is very likely to contin t lest one pregenerted spnning tree. Thus, the mtching of q often could strt from the spnning trees. In the rre scenrio tht q does not contin ny rndomly generted spnning tree, the mtch hs to strt 89

6 from the vertex mtches without the help of the spnning trees. The vertices re mtched in depth-first order. To mtch dtse grph vertex v g nd query grph vertex v q, we require tht () v g is in M(v q) nd (2) for ech edge djcent to v q in q (v q,u q), there exists vertex u g such tht the edge lel of (v g,u g) is the sme s (v q,u q) nd u g is mtched to u q. This process is similr to other existing grph mtching lgorithms, e.g., GADDI [24] nd hence we will not present it here due to the spce limittions. When q contins t lest one spnning trees, the following procedure is employed. First, the spnning trees contined y q will e identified vi the edges in q nd those contined in the spnning trees. Assume q contins r spnning trees T,T 2,...,T r. Ech mtch of q hs to contin t lest one occurrence of T, T 2,..., nd T r. Therefore, the mtched vertices of the prime vertex v for q should e in M(v, T i) for ll i r. Thus, M(v, q )= r i=m(v, T i) will serve s the strting point for finding the mtches of q in G. Bsed on the mtch set of M(v, q ), we serch for the mtches of q s neighors nd so on. After finding the mtches of q. For ech mtch of q in G, we keep the mpping from the vertices in the mtch of q to vertices in q. Figure shows n exmple of mtching q sed on the mtches of the spnning trees. We cn see tht M(v, T )={, 0} (circled y the solid ellipses) nd M(v, T 2)={8, 0} (circled y the dotted ellipses). The intersection of the two sets is {0}, which is the strting point to mtch q. 2 The dtse grph G The query grph q T Prime vertex Prime vertex Figure : Mtching q sed on the mtches of the spnning trees q contins In the second su-ctegory, mtches of some of q s prefix hve een discovered. Let q 2 e the longest prefix of q such tht the mtches of q 2 hve een identified. Also denote tht e,e 2,...,e i e the edges in q, ut not in q 2. For ech mtch of q 2, we check whether e,e 2,...,e i exist in G. If so, this will e mtch of q. Otherwise, this mtch of q 2 could not e extended to mtch of q. This process continues until ll mtches of q 2 re exmined. The forml lgorithm is descried in Algorithm 4. Figure 6 depicts n exmple of mtching q sed on its sugrph q 2 corresponding to the longest prefix of q. Then when mtching q, we only need to check the mtches of q 2. Although the SAPPER lgorithm employs pproximtion to ccelerte the mtching process, it cn find ll mtches to the grphs tht re pproximtely isomorphic to query grph. Due to the spce limittions, the proof is in the Appendix. It is difficult to determine the exct time complexity of the SAPPER method since it depends on how mny grphs in AI(q,θ) re enumerted. Since the sugrph isomorphism test is n NP-hrd prolem, the worst cse time complexity is exponentil. We will empiriclly nlyze the time efficiency nd sclility of the SAPPER method in the next section. T2 Prime vertex The dtse grph G Prime vertex The query grph q Figure 6: Mtching q sed on its sugrph q 2 q2 Prime vertex 6. EXPERIMENTAL RESULTS In this section, we empiriclly nlyze the performnce of SAP- PER ginst TALE, GADDI, two of the most recent sugrph mtching tools tht designed for lrge grphs, nd Bsic SAPPER (BSAP- PER). TALE is efficient in index construction nd heuristiclly finds the pproximte mtches of the query grph. GADDI enumertes ll possile pproximte isomorphic grphs (AI(q,θ)) of the query grph nd finds ll exct mtches for ech of these grphs. To show the pruning power of the rndom spnning trees nd lexicogrphicl order, we lso include BSAPPER in the comprison results. BSAPPER employs the sme indexing structure s SAPPER, ut it differs from SAPPER in the following two spects. (i) BSAPPER does not use spnning trees. (ii) BSAPPER uses redth-first enumertion order similr to the level-wise serch lgorithm in [2]. In the first level, ll the grphs θ edge edit distnce wy from the query grph q re enumerted nd queried. Next it enumertes grphs θ edge edit distnce wy in the second level, grph will e enumerted in the second level if there exists t lest one mtch for ll its sugrphs in the first level. This process continues until either the level contining q or no grph cn e enumerted sed on the sugrph property. The performnce difference etween BSAPPER nd SAPPER is essentilly the effects of the rndom spnning trees nd the lexicogrphicl order query grph enumertion while the performnce difference etween BSAPPER nd GADDI is the effects of the loom filters. All methods re implemented with C++ nd run on Dell PowerEdge 290, with two.0 GHZ dul-core CPUs nd 6 GB min memory, nd Linux smp system. 6. Protein Interction Network In this set of experiments, the grph is generted from suset of the protein interction network for homo spiens. Ech vertex represents protein nd the lel of the vertex is its gene ontology term from [2]. An edge in the grph represents n interction etween the two proteins it connects. There re 640 vertices, 844 edges, nd the verge degree of vertex is 6.8. There re totl of 62 distinct lels. SAPPER spends out 2 minutes to construct n index of 60MB, while TALE spends 0 minutes to construct n index of MB, nd GADDI spends minutes to construct n 00MB index. As SAP- PER processes more informtion thn TALE, it tkes more time to construct the index. Since we only need to uild n index structure for ech dtse grph once, the query time is much more importnt thn the index uilding time. To evlute these four methods, we use eight known signl trnsduction pthwys from the KEGG dtse [] to query the protein interction network. These known pthwys re from species

7 other thn homo spiens, e.g., flies nd yest, etc. Since some protein interction only exists in yest or flies nd does not exist in humn, there re missing edges in the homo spiens protein interction network. If θ is set to 2, ll eight signl trnsduction pthwys should e recovered in our homo spiens protein interction network. Thus, we use these eight pthwys s the query grphs nd set θ to 2. SAPPER, BSAPPER nd GADDI find ll these eight pthwys successfully. Among these three methods, SAPPER is much fster thn the remining two due to its dvnced pruning techniques. Since TALE is heuristic lgorithm, it only finds two out of these eight pthwys. Although TALE runs very fst, its ccurcy (e.g., recll) is not high. The execution time of SAPPER, BSAPPER, GADDI, nd TALE is shown in Figure 7. The numer of vertices on the eight known pthwys re 9, 0,, 2, nd 4. Thus, we report the verge execution time with respect to the numer of vertices in ech query grph. of G. This ffects SAPPER more on the index construction time since the numer of 2-hop neighor vertices grows exponentilly with respect to the verge degree. () Index Construction Time () Index Size (c) Index Construction Time (d) Index Size Figure 8: Comprisons of the Indices Figure 7: The Performnces of the Queries on Protein Interction Network 6.2 Synthetic Dt Sets In this portion of the experimentl studies, we nlyze the performnce of SAPPER, BSAPPER nd GADDI y independently vrying ech of six prmeters on set of syntheticlly generted grphs. We do not include TALE ecuse lthough it cn efficiently finish the queries, only round 20% of ll the pproximte mtches re discovered y TALE s shown in the rel dt set. To systemticlly nlyze the performnce of these methods, we vry one prmeter t time. The defult vlues of the prmeters re listed in Tle. Tle : Defult Prmeter Vlue Prmeter Defult Vlue Numer of vertices in G 000 Numer of vertices in q 20 Numer of lels 20 θ Averge degree of G 8 Averge degree of q 4 The index construction comprisons re shown in in Figure 8. We first vry the numer of vertices in G. GADDI needs more time to construct the index thn SAPPER ecuse it needs to clculte the NDS distnces for neighoring vertices. Due to the nture of the compctness of the loom filter, the size of the index of SAPPER is consistently smller thn tht of GADDI. When the numer of vertices in G is 0,000, SAPPER tkes round 8000 seconds to uild n 80 MB index. Next, we vry the verge vertex degree Now the verge query time of these methods on different prmeters re nlyzed. The first prmeter is the numer of vertices in G. The V (G) is vried from 200 to 0,000. SAPPER nd BSAPPER chieve etter mtching efficiency thn GADDI s they cn quickly mtch vertices y the index nd optimize the pproximtion mtching process. SAPPER outperforms BSAPPER due to the effectiveness of the rndom spnning trees nd lexicogrphicl order pruning techniques. The results re shown in Figure 9(). Next we vry the numer of vertices in the query grph q. We show the result in Figure 9 (). With more vertices in q, more vertices nd edges need to e compred in the query process, so the query times of ll three methods increses. The increse is more evident with V (q) 40, s the methods need to find ll pproximte mtches, especilly GADDI, which processes more cndidte grphs for lrge query grph without pruning techniques. The third prmeter we vry is the numer of distinct lels. From Figure 9 (c), we cn see tht more lels in G increses the pruning power of GADDI, ut hs mixed effect on SAPPER. This my e due to the fct tht SAPPER only indexes suset of lels of neighoring vertices. Incresing the numer of distinct lels reduces the numer of cndidte mtches etween ny pir of vertices in G nd q, ut lso decreses the pruning power of SAPPER s index. The pproximte threshold prmeter θ is vried nd the results re shown in Figure 9 (d). With the increse of θ, the query time of SAPPER is still less thn GADDI nd BSAPPER ecuse GADDI needs to generte ll possile cndidte grphs, whose numer increses drmticlly with θ. On the other hnd, due to the use of the dvnced pruning techniques, the query time of SAPPER increses t slower pce. The fifth prmeter we vry is the verge degree of G nd the results re shown in Figure 9 (e). The high degree in G mens more edges hve to e exmined when mtching pttern nd siclly the query time of these three methods grows t similr rte. Lst we vry the verge degree of vertex in q. The results re 9

8 () V (G) () V (q) sugrph indexing, i.e., finding the occurrences of query grph in lrge dtse grph with (possile) missing edges. In this pper, we hve proposed sugrph indexing nd mtching method (SAP- PER) to find ll pproximte mtches of query grph. SAPPER constructs the HNU index to ccelerte query processing. During the query time, SAPPER improves mtching efficiency y using pre-generted rndom spnning trees nd lexicogrphicl query grph enumertion order. To the est of our knowledge, this is the first ttempt to find the complete set of pproximte mtches in single lrge grph. With lrge set of rel nd synthetic dt, we demonstrte tht the SAPPER pproch cn outperform the lterntive methods in ccurcy while chieve good efficiency. (c) Numer of Lels (e) Averge Degree of G (d) Different vlues of θ (f) Averge Degree of q Figure 9: Query Time on Different Prmeters shown in Figure 9 (f). It is ovious tht the higher verge degree of q is, the more informtion tht q possesses for pruning vertices in G. However, high vertex degree will lso generte more potentil cndidte query grphs since the numer of cndidte query grphs is exponentil to the verge degree of q. When the verge degree of q is 2, there re few edges to e exmined nd ll lgorithms re efficient. When the verge vertex degree of q is lrger thn 6, the numer of edges tht need to e compred grows exponentilly, which results in GADDI s long response time. The min difference etween TALE nd SAPPER is the ccurcy. TALE is heuristic method which does not find ll pproximte mtches of pttern while SAPPER is n exct method to find the complete set of the pproximte mtches. Thus, if the gol is to tke quick look of the pproximte mtches of ny query grph in the dtse, TALE is n efficient nd convenient tool. On the other hnd, SAPPER is etter choice if the complete set of pproximte mtches needs to e retrieved. The min difference etween GADDI nd SAPPER is the efficiency. Although GADDI cn find ll pproximte mtches y enumerting ll pproximte isomorphic grphs of the query grph, this is very time consuming process. The performnce of BSAPPER is etween GADDI nd SAPPER since it utilizes the loom filter to mtch vertices nd the sugrph property to prune query grphs without the help of the rndom spnning trees nd lexicogrphicl order. Therefore, when the gol is to discover ll pproximte mtches, SAPPER is preferred. 7. CONCLUSION Due to the existence of noises (e.g., missing edges) in the lrge dtse grph, we re investigting the prolem of pproximte 8. REFERENCES [] D. J. Aldous, The rndom wlk construction of uniform spnning trees nd uniform lelled trees, SIAM J. Discrete Mth, 990. [2] R. Agrwl nd R. Sriknt, Fst lgorithms for mining ssocition rules, Prof. of VLDB, 994. [] B. H. Bloom, Spce/time trde-offs in hsh coding with llowle errors, Communictions of the ACM (7), 970. [4] J. Cheng, Y. Ke, W. Ng nd A. Lu, FG-Index: towrds verifiction-free query processing on grph dtses. Proc. of SIGMOD, [] B. Chzelle, J. Kilin, R. Ruinfeld nd A. Tl, The loomier filter: n efficient dt structure for sttic support lookup tles, Proc. of th Annul ACM-SIAM Symposium on Discrete Algorithms, [6] L. Cordell, P. Foggi, C. Snsone nd M. Vento, A (su)grph isomorphism lgorithm for mtching lrge grphs. PAMI, [7] B. Dost, T. Shlomi, N. Gupt, E. Ruppin, V. Bfn nd R. Shrn, QNet: tool for querying protein interction networks, Proc. of RECOMB, [8] T. Nguyen, H. Nguyen, N. Phm, J. AI-Kofhi nd T. Nguyen, Grph-sed mining of multiple oject usge ptterns, Proc. of the Joint Meeting of ESEC nd ACM SIGSOFT, [9] R. Giugno nd D. Shsh, GrphGrep: A fst nd universl method for querying grphs. Proc. of ICPR, [0] J. Hn, J. Pei nd Y. Yin, Mining frequent ptterns without cndidte genertion, Proc. of SIGMOD, [] H. He nd A. K. Singh, Closure-Tree: n index structure for grph queries. Proc. of ICDE, [2] H. Jing, H. Wng, P. Yu nd S. Zhou, GString: A novel pproch for efficient serch in grph dtses. Proc. of ICDE, [] M. Knehis nd S. Goto, KEGG: Kyoto encyclopedi of genes nd genomes, Nuc. Ac. Res, 2000, 28:27-0 [4] M. Koyuturk, A. Grm nd W. Szpnkowski, Pirwise locl lignment of protein interction networks guided y models of evolution. Proc. of RECOMB, 200. [] F. Mndreoli, R. Mrtogli, G. Villni nd W. Penzo, Flexile query nswering on grph-modeled dt. Proc. of EDBT, [6] M. Mongiovi, R. Ntle, R. Giugno, A, Pulvirenti, nd A. Ferro. A set-cover-sed pproch for inexct grph mtching. Proc. of CSB, [7] R. Pinter, O. Rokhlenko, E. Yeger-Lotem nd M. Ziv-Ukelson, Alignment of metolic pthwys, Bioinformtics, 200. [8] H. Shng, Y. Zhng, X. Lin, nd J. Yu, Tming verifiction hrdness: n efficient lgorithm for testing sugrph isomorphism. PVLDB, [9] Y. Tin nd J. Ptel, TALE: tool for pproximte lrge grph mtching, Proc. of ICDE, [20] J. Ullmnn, An lgorithm for sugrph isomorphism. J. ACM, 976. [2] X. Wng, A. Smlter, J. Hun, nd G. Lushington, G-Hsh: towrds fst kernel-sed similrity serch in lrge grph dtses, Proc. of EDBT, [22] X. Yn, P. Yu nd J. Hn, Grph indexing, frequent structure-sed pproch. Proc. of SIGMOD, [2] S. Zhng, M. Hu, nd J. Yng, Treepi: novel grph indexing method. Proc. of ICDE, [24] S. Zhng, S. Li, nd J. Yng, Gddi: distnce index sed sugrph mtching in iologicl networks. Proc. of EDBT, [2] Gene Ontology. 92

9 APPENDIX A. FORMAL ALGORITHM DESCRIPTION Algorithm Generting Rndom Spnning Tree Input: grph q. Output: Rndom Spnning Tree t of q. : Construct trnsition mtrix P from q. 2: Vertex set S, edge list E. : rndomly select vertex X 0 of q. 4: S S + X 0. : v X 0. 6: while S< V (q) do 7: rndomly select vertex w y P, e vw exists. 8: if!w S then 9: E E + e vw 0: S S + w : end if 2: v w : end while 4: Output the grph composed of edge list E. Algorithm 4 Algorithm SAPPER Input: dtse grph G, query grph q, threshold θ. Output: pproximte mtches of q. : Sort q s edges decresingly y their numer of mtches in G, l E(q) 2: edge list EL e,..., e l,( i, e i q). : s e,..s l θ 4: while s end do : if The grph corresponding to the longest prefix of s is not mtched yet then 6: Find nd output the exct mtches of g(s) with the help of mtches of the spnning trees if it contins ny 7: else 8: Find nd output the exct mtches of g(s) ccording to the mtches of the grph corresponding to the longest prefix of s 9: end if 0: if g(s) hs no mtch then : s LEXI JUMP(s, EL, θ) 2: else : s LEXI Next(s, EL, θ) 4: end if : end while Algorithm 2 LEXI Next Input: sequence s, edge list EL = {e,..., e l }, threshold θ. Output: the next sequence of s. : L Length(s ) 2: if s (L) <e l then : e x s (L) 4: return Sequence s (),..., s (L)e x+ : end if 6: LEXI JUMP(s,EL, θ) Algorithm LEXI Jump Input: sequence s, edge list e,..., e l, threshold θ. Output: the next sequence of s which is not super-sequence of s. : if i, s.t. s (i) <e l (L i) then 2: x MAX{i : s (i) <e l (L i) } : e t s (x) 4: if x l θ then : return Sequence s (),..s (x )e t+ 6: end if 7: return Sequence s (),..s (x )e t+ e t+2...e t+l θ x 8: end if 9: return end B. PROOF OF CORRECTNESS OF SAPPER The proof of the correctness of SAPPER is divided into two prts. First, we prove tht given query grph q, dtse grph G, nd n pproximtion threshold θ, for every connected grph s where Dist e(s, q) θ nd there exists t lest one mtch of s in G, SAPPER will enumerte s (descried in Section.). Second, we wnt to prove tht if s is enumerted in Section., ll of its mtches in G will e discovered. Lemm : SAPPER enumertes every cndidte grph s of query grph q such tht D edit (s, q) θ nd s hs t lest one exct mtch in G. Proof: The lexicogrphicl order enumertes every grph s such tht D edit (s,q) θ in depth first style. When we find tht such grph (denoted s s ) does not hve ny exct mtch in the dtse grph, we perform jump procedure. The grphs we skip re ll supergrphs of s, which cnnot hve ny exct mtch in the dtse grph, nd hence re not cndidte grphs. Therefore, we enumerte ll cndidte grphs s of query grph q such tht D edit (s, q) θ nd s hs t lest one exct mtch in G. Lemm 2: SAPPER finds ll exct mtches of ny cndidte grph s. Proof: For cndidte grph s, if we hve not yet serched for ny prefix of s nd s does not contin ny pre-generted rndom spnning trees, then we would perform depth first mtching for s, which will not miss ny exct mtch of s. Otherwise, we strt the serch from either the mtches of the prefix cndidte grph of s or the intersection of mtches of the pre-generted rndom spnning trees contined y s. Either the prefix cndidte grph of s or pre-generted rndom spnning tree contined y s is sugrph of s. Since ny exct mtch of s must contin t lest one exct mch of ny sugrph of s sed on Property, we will not miss ny exct mtch of s in this scenrio either. Therefore, SAPPER cn find ll exct mtches of ny cndidte grph s. Theorem : SAPPER finds ll pproximte mtches of query grph q. Proof: From Lemm, we prove tht SAPPER cn enumerte ll cndidte grphs of the query grph. From Lemm 2, we prove tht for ny cndidte grph s, SAPPER finds ll mtches of s. By the definition of pproximte mtches, SAPPER cn find ll pproximte mtches of q. C. EDGE ADDITIONS/DELETIONS AND DISCONNECTED MATCHES In this pper, we focus on pproximte mtches with the following two restrictions: () the mtch hs to e connected nd (2) only edge dditions ut not edge deletions re considered. The rtionle ehind these two restrictions re the following. If unconnected mtches re considered, there could e too mny of these mtches. Moreover, these unconnected mtches my not e useful in mny pplictions. Thus, in this pper, we focus on finding connected mtches. Edge deletions could e s importnt s edge dditions. In most cses, mtch with edge deletions is super-grph of some other pproximte mtches. For instnce, if g 2 cn e otined y deleting some edge from g, then g hs to contin g 2. For n pproximte mtch g 2, if the edit distnce etween g 2 nd the query grph q is less thn θ, then y dding different edges to g 2, (potentil) lrge numer of mtches will e discovered nd ll these mtches 9

COMP 423 lecture 11 Jan. 28, 2008

COMP 423 lecture 11 Jan. 28, 2008 COMP 423 lecture 11 Jn. 28, 2008 Up to now, we hve looked t how some symols in n lphet occur more frequently thn others nd how we cn sve its y using code such tht the codewords for more frequently occuring