documents 1. Introduction

Size: px

Start display at page:

Download "documents 1. Introduction"

Alexandrina Claire Hamilton
6 years ago
Views:

1 4 Efficient structurl similrity computtion etween XML documents Ali Aïtelhdj Computer Science Deprtment, Fculty of Electricl Engineering nd Computer Science Mouloud Mmmeri University of Tizi-Ouzou (UMMTO) Tizi-Ouzou, Algeri Astrct This work is minly motivted y the description of new pproch for clculting the structurl similrity of XML documents. Prcticlly, the mjority of existing work on XML documents clustering considers the tree structures of these documents s mere vectors nd, therefore, does not tke into ccount their hierrchicl contexts. Furthermore, in order to clculte the structurl similrity of XML documents, most methods encountered in these works perform depth-first trversl to visit the nodes of the tree structures of these documents. More precisely, it is the preorder tree wlk which is usully the most used. Recently, other studies present n lterntive pproch tht tkes into ccount the hierrchicl contexts of these tree structures, ut unfortuntely, they hve prticulrly high time complexity in the clcultion of structurl similrity. In this pper, we propose new method sed on redth-first trversl of these tree structures. The gol consists in clustering more rpidly XML documents shring similr structures. Besides the fct tht the method is fst, it lso tkes into ccount the hierrchicl contexts of XML documents. Reconciling the speed required for clustering XML documents with tking into ccount the hierrchicl contexts of their tree structures ensures higher reliility of the proposed method. To vlidte our proposl, experiments were conducted on oth rel nd synthetic XML dt. The results clerly demonstrte the viility of our pproch. Keywords: Clustering, Structurl similrity, hierrchicl context, Tree level, Ancestor nd descendnt levels, depth- nd redth-first trversls.. Introduction XML hs now ecome n unchllenged stndrd for the representtion nd exchnge of dt on the we. This hs led to the increse in heterogeneous XML sources. Furthermore, not only the collections of XML documents re reused ut their interchnge volume is continuously growing. However, with the current ville mens, the serch informtion in these documents is not trivil tsk. XML documents re chrcterized y content nd structure. However, such documents cnnot e exploited efficiently y the conventionl informtion retrievl methods. Indeed, these methods re sed on content oriented models, while the XML formt llows dding structurl constrints. This then requires dpting these models to etter exploit the ville XML dt. Similrly, trditionl pproches to dt processing, such s reltionl dtses hve proven ineffective. These re minly designed for strongly structured dt, wheres XML dt re semi-structured [0]. In ddition to this, given the heterogeneity nd prolifertion of XML documents on the we, it ecomes difficult for user to ccess the desired informtion. In this context mny uthors propose methods of clssifiction to orgnize nd nlyze lrge collections of XML documents. Our work flls within this perspective; we re interested in the clustering of XML documents sed on their structures. The ide ehind the clustering is tht if XML documents shre similr structures, they re more likely to correspond to the structurl prt of the sme query. This therefore llows reducing the response time nd incresing the ccurcy of serch engines. In other words, it cn sustntilly improve the process of informtion retrievl. Thus, the serch for relevnt informtion in lrge collection of documents will then return to interrogte smll clsses of documents. XML clustering tsk consists in grouping XML documents into clusters contining similr documents. This similrity could e themtic or structurl. In this pper, we re prticulrly interested in XML document clustering using the structurl similrity of their descriptions, i.e., the XML ordered leled tree providing the reltions etween the document elements. We will therefore ddress the structurl clustering of XML documents prolem s we would hve done with clustering of tree structures prolem [,, 3, 4, 4]. In other words, structurl clustering of XML documents pproch cn e exploited in vrious res tht require mngement of hierrchicl structures, such s the discovery of structurlly similr we nvigtionl pthwys, or tree-like ptterns, nd the discovery of structurlly similr mcromoleculr tree ptterns in ioinformtics [4, 34, 7]. The structurl similrity llows to group documents tht shre similr structures []. It will help to etter orgnize XML documents on the one hnd nd, on the other hnd, to etter nswer, in terms of efficiency nd effectiveness, queries contining structurl conditions. We recll tht queries in XML informtion retrievl could contin

2 4 keywords only or keywords nd structurl conditions. The min question we ddress in this context is how to cluster structurlly XML documents when their DTD is unknown. Our methodology in this pper is two-step. In the first step, ech XML document is represented y its tree summry structure, which is used s representtion model to clssify the corresponding XML document. In the second step, n efficient structurl similrity mesure sed on redth-first trversl of these tree summry structures is proposed. Within this frmework, the most importnt question dels with the wy to mesure the structurl similrity of XML documents. This is the question we ttempt to nswer in this work. This pper is orgnized s follows. Section provides summry view of relted work out clssifiction of XML documents y structure. Section 3 descries our clustering pproch. Section 4 is dedicted to the experimenttion. Finlly, in section 5 we conclude nd descrie the future work. siling nodes, while in the study y [,, 3], the tree summry is otined only y eliminting dupliction of siling nodes, i.e., hierrchicl reltionships etween XML elements re not completely chnged, nd so there is no loss of informtion. (i) (ii). Stte-of-the-Art The clssifiction pproches re divided in two min vrints clled supervised clssifiction nd unsupervised clssifiction (or clustering). Existing works on the clssifiction of XML documents cn e distinguished y the wy they represent the documents, ut lso y the clssifiction nd/or clustering methods used. We focus here on pproches tht represent documents y structure only. Within this frmework, we distinguish two min ctegories: document-sed clustering, where the clustering is sed on the document structure itself, nd DTD-sed clustering, where documents re clustered ccording to their DTD. We riefly descrie elow some of the most known pproches, highlighting their min fetures. We re prticulrly focus on pproches tht represent the document structures y leled trees. In the first ctegory, nmely clustering sed on document structure; the structure tht is used to clssify document is issued from the document itself. This structure could e either leled tree corresponding to the originl structure of the XML document (the whole structure of document) [5,, 9, 30, 3, 40] or rooted ordered leled tree summry [,, 3, 4]. The ltter hs the dvntge of reducing the computtionl complexity in the clustering, ut hs nevertheless the drwck of reking the reltionships etween XML elements. In the study y [4], tree summry is otined y two trnsformtions, s shown in Fig. : (i) the first one reduces the depth of the tree so tht the children of ny node hving the sme lel s one of its ncestors ecome direct descendnts (child) from this ncestor, (ii) the second one elimintes dupliction of Fig. Tree summry extrction In the study y [, 5, 39], nother wy for representing the structure of XML documents is proposed, nmely these pproches re sed on the discovery of su-trees tht most frequently occur in collections of leled trees representing XML documents. More precisely, in the study y [39], the frequent su-trees cn e unordered, wheres, in the other two pproches [, 5], they solutely must e ordered. All su-trees elow the rrow in Fig. re frequent ccording to the pproch [39]. The lst two of them re exct nd ordered, so they re lso frequent ccording to the pproch [, 5]. This indictes tht the pproch [39] is more generl nd esily pplied to heterogeneous XML documents, s opposed to the pproches [, 5] tht only pply to XML documents shring the sme DTD or XML Schem. Another proposl [0] consists of linerizing the structure of ech XML document, y representing it s numericl sequence nd, then, compring such sequences through the nlysis of their frequencies. The clustering consists then to compre the extrcted structure with cluster or its representtive. This representtive, usully clled centroid, is the most representtive tree summry of ll XML documents in the cluster [,, 3]. The centroid my chnge, i.e., it cn e replced y more pproprite tree, depending on the ssignment of new documents to the cluster. In the study y [39], cluster is chrcterized y the mximl frequent su-tree, i.e., the frequent su-tree tht hs the gretest numer of nodes mong ll su-trees in this cluster.

3 43 Note tht with frequent su-trees technique, n XML document my elong to severl clusters. In the study y [4,, 9, 30, 3, 40], ech cluster is represented y suset of similrly structured XML documents, wheres with [,, 4, 39] pproches, cluster hs only one representtive (the centroid or the mximl frequent sutree); this mens tht, in the process of detecting the pproprite cluster, the representtive of new document is compred only with the representtive of the cluster. Note tht in some pproches like [,, 4], new comprison is undertken for possile chnge of the centroid. Fig. Frequent su-trees detection Concerning the clustering process itself, severl pproches hve een proposed. Authors in [30, 40] propose n incrementl clustering sed on common pth similrity tking into ccount different criterions s the numer of common nodes etween the XML tree (tht of the XML document to e clssified) nd the trees of the cluster considered, the numer of common nodes pths, nd the order of the nodes of the XML tree. In the study y [,, 3], the uthors lso perform n incrementl clustering, ut it is sed on structurl similrity etween the centroid nd the structure of XML document to e clssified. For detecting structurl similrity etween XML documents, the uthors in [0] exploit the theory of discrete Fourier trnsform to effectively nd efficiently compre the encoded documents (i.e., signls) in the domin of frequencies. This pproch significntly differs from stndrd methods sed on grph-mtching lgorithms nd llows significnt reduction in the required computtion costs. Indeed, if N is the mximum numer of tgs in two documents, their mtching complexity is O(N log N), wheres it is O(N ) with those sed on edit distnce, s the Chwthe s lgorithm [6] nd one proposed in [4]. In the study y [, 4,, 3], the similrity etween two trees is sed on their edit distnce. Edit distnce mesures the numer of elementry opertions to trnsform one tree into nother. Most lgorithms for clculting the tree edit distnce re sed on the dynmic progrmming techniques [6 9, 35, 37, 4, 4]. Note, however, tht there my e severl sequences of edit opertions to trnsform one tree into nother. Therefore, the cost of the opertions in ech sequence is considered, nd the lowest cost sequence mong these defines the edit distnce etween trees [3]. Edit distnce llows performing clustering of these trees using ottom-up hierrchicl clssifiction method [, 4,, 3]. In the clustering pproches sed on frequent sutrees, the uthors [39] hve developed n lgorithm to detect the mximl su-tree. Similrly, the uthors in [5] lso hve developed their lgorithm very close to the lgorithms FREQT nd TREEMINER proposed respectively y [4] nd [45]. An XML tree cn pper in multiple clusters. In other words, n XML document cn e shred y severl clusters, i.e., it consists of severl su-trees ppering respectively in severl different clusters. Thus, under these chrcteristics, we cn sy tht these pproches elong to the fmily of non-exclusive (or overlpping) clustering. Concerning the second ctegory pproches, nmely DTD clssifiction, we list elow some of the most known. Recll tht the DTD is considered s context-free grmmr tht genertes potentilly infinite numer of the XML documents. From this fct, insted of clssifying directly the XML documents, the pproch proposes to clssify their DTDs in clusters. Thus, ech cluster ecomes the representtive of set of structurlly similr XML documents. The dvntge is tht it is possile to more quickly integrte considerle numer of XML documents together. The drwck is tht the nodes of DTD trees often denote regulr expressions whose hndling is not lwys trivil tsk nd furthermore cuses (in some cses) loss of informtion [8]. In [8], the uthors propose DTDs clustering model, nmed XClust. Ech DTD is represented y its tree structure. The similrity of two nodes is clculted y exploiting different levels of the tree, nmely the ontologicl similrity of the nodes (using dictionry), the similrity of their immedite descendnts (children), the similrity of their ncestors nd finlly, tht of the sutrees leves whose they re respectively the roots. In [36], the uthors propose mechnism which identifies syntcticlly the similrity of DTDs y dopting n scending clustering strtegy. Compred with XClust, the uthors in [4] exploit only the immedite descendnt context. In [7], the uthors develop n lgorithm which is sed on generic scheme of the DTD mtching. The mtching gol is to rech medin scheme corresponding to the DTDs tht re similrly structured. Like [8], in order to

4 44 clculte the DTDs similrities [7] relies on the dictionry, ut it only exploits the leves context. Finlly, the pproch proposed in [8] is sed on the lerning nd inference comined with n instnce of DTD. In fct, it works with supervised clssifiction, i.e., the clsses re known efore strting this clssifiction. There exist other methods such s pproches of [6, 9, 3, 4, 5, 4, 43, 44] originlly dedicted for clssifying or clustering XML documents using oth structure nd content. These methods re in fct more generl since they offer flexile models tht cn esily e dpted for deling with structure only. For instnce, in the study y [6, 43], the uthors propose network-sed stochstic model tht is le to descrie different kind of reltionships of XML elements. It ws proved tht this model is esily dpted to the structure lone. The model is sed on Byesin networks [3], to infer the different type of structurl reltionships in XML tree. 3. Clustering pproch sed on structurl similrity Our clustering pproch elongs to the first ctegory, nmely, clustering sed on document structure. The structure used to structurlly clssify document is issued from the document itself. Specificlly, for clssifying (clustering in our cse) XML documents y structure, we suggest the usge of their corresponding leled tree structurl summries. The lels correspond to tgs or ttriutes. In our pproch, ttriutes re treted s mere tgs. A leled tree summry representing n XML document is utomticlly extrcted from the document y prser. This extrcted tree summry is then used s model of representtion y clssifier to clssify the corresponding XML document. We show nd explin in Susection 3., how this prser works. Finlly, we give description of the proposed similrity mesure in Susection 3.. Note tht the ltter is sed on the redthfirst trversl of XML documents tree structures. 3. Tree structurl summry extrction We propose to represent XML documents y their tree structurl summries tht need miniml processing nd especilly void the loss of informtion. The sic ide is tht repetitions of tgs nd/or the possiility to hve optionl tgs (nd su-trees consequently) re one of the resons why XML documents cn e structurlly different even though they shre the sme DTD. In this context, our tree summry is regrded s generic structure in the sense tht, when siling tgs re duplicted, it is not necessry to hve this dupliction in the structure tht we wish to extrct. Note however, tht to void losing informtion, duplictions of nested nd/or cousin tgs re not removed s duplictions of siling tgs. One wy to void this is to consider them s immedite descendnts (su-tgs) of tgs in which they pper in XML documents. In Fig. 3 we show n overview of this representtion pproch focusing on ll its fetures. Indeed, the trnsformtion of the originl tree (i) in the tree summry (ii) shows tht the ttriute t ecomes in fct n immedite (direct) descendnt (son) node of the root node. As for the dupliction of siling nodes, it is removed while keeping the children ( c, nd c ) ttched to single occurrence of, ut the nodes c, which were originlly cousin nodes on (i), hve ecome rothers, whose we lso eliminted dupliction. However, s recommended, duplictions of nested nodes nd re mintined. Our extrcting lgorithm of tree summry is two-step. The first step is sed on SAX (Simple API for XML) API (Appliction Progrmming Interfce), which returns ll the tgs nd ttriutes encountered in n XML document. These tgs (or ttriutes) re intercepted, filtered nd then trnsformed y our prser into n intermedite form s shown through the prenthesized expression in Fig. 4. In the second step, this intermedite prenthesis expression is trnsformed y nother prser into the corresponding tree summry, ccording to projections of our pproch. i.e., y eliminting dupliction of siling nodes nd considering ech ttriute s n immedite descendnt of the element (tg) which it is ttched in the XML document. Most of the extrction tsk is performed during this second step. In fct, three essentil opertions re performed t this level: Pssge from the liner form of the XML document to its hierrchicl representtion; Removl of repetitions of siling nodes; Trnsforming possile ttriutes into immedite descendnts of the elements which they re ttched in the XML document. (i) c c t {Attriute} Fig. 3 Representtion pproch of n XML document Thus, insted of the originl trees to represent XML documents, we use their tree structurl summries, ut without loss of informtion, since we remove only the repetitions of siling nodes. This llows, on the one hnd, performing the mtching of these trees more quickly nd esily nd, on the other hnd, to provide high-qulity clustering. t (ii) c

5 45 Fig. 4 An XML document nd its corresponding tree summry 3. XML documents clustering 3.. Overview of the clustering technique used For clustering XML documents sed on structurl similrity we use well-known techniques in hierrchicl gglomertive clustering (lthough ny form of clustering could e used). Hierrchicl methods perform mergers etween dt sets; the peers of elements (or clusters) re successively merged until there is only one lrge set contining ll elements. The end result cn e schemticlly represented s tree of clusters nmed dendrogrm, s shown in Fig <> <--!comment --> <> TEXT <c> TEXT </c></> < t ="vl"> TEXT </> < t = s ="vl">text</> </> ( ( ( c ) ) ( ( t ) ) ( ( t ) ( s ) ) ) c t s Distnce threshold=.0 ( Cluster) Distnce threshold=0.5 (3 Clusters) Distnce threshold=0. (8 Clusters) Fig. 5 Dendrogrm of the scending hierrchicl clssifiction The dendrogrm shows the clusters tht were merged together, nd the minimum similrity etween these merged clusters. There re severl methods of hierrchicl scending clssifiction. They ll re sed on the following ide: ) Initilly, ech element of the dt set to e clssified is regrded s cluster. ) Clusters seprted y minimum distnce (i.e., mximum similrity) re grouped together. The distnces etween the remining clusters nd the new cluster set re reclculted. c) If there is more thn one cluster or hs not yet reched the minimum distnce (or mximum similrity threshold), go to step. With some methods, the distnce etween two clusters X nd Y is defined s the minimum distnce (mximum similrity) etween ll the peers of elements (x, y) such tht x is in X nd y in Y. With other methods, this is the verge distnce (verge similrity) which is considered s prmeter of the seprtion of clusters. We chose clustering method which is sed on the minimum distnce (i.e., the mximum similrity). We then used the single link clustering lgorithm using Prim s lgorithm [0] to clculte the MST (Minimum Spnning Tree or shortest pth) of grph. Given grph G = (N, A) with set of weighted edges A, nd set of nodes N. The minimum spnning tree (MST) of grph is n cyclic suset T A tht chin ll nodes whose totl weight (cost, distnce, vlue, etc.) denoted W (T) (the weight sum of T' s edges) is minimized. It ws shown in [] tht the MST contins ll the informtion required to implement the single link clustering. Given set of rooted leled ordered trees representing XML documents, we form complete grph G with n nodes N nd weighted edges A. The weight of n edge is the structurl distnce etween the nodes it connects. Nodes represent XML trees in our cse. For exmple, the single link clustering for threshold l cn e crried out y removing ll the edges hving weight l of MST in the G grph. The connected nodes of the remining grph re the single link clusters. It cn e seen in Fig. 6 grph with 7 nodes (corresponding to 7 XML documents), nd 0 edges Fig. 6 Grphicl representtion of the distnces etween XML trees

6 46 As indicted ove, the weight of n edge is the structurl distnce etween XML documents. For exmple, the structurl distnce etween the tree nd tree is 0.. Missing edges re the dditionl edges which mke the complete grph; their weights re equl to. Fig. 7 shows the shortest pth on the grph in Fig. 6. It cn e seen in Fig. 8 the prts of grph remining fter deleting ll edges with weight Fig. 7 The shortest pth in the grph of Fig Tle : Mtrix ssocited with the grph of Fig Tle : MSP mtrix of the mtrix of Fig cluster cluster Tle 3: Mtrix fter pplying threshold Fig. 8 Resulting grph fter deleting ll edges hving weight 0.4 There re two new components tht re formed, contining the nodes (,, 3, 6) nd the nodes (5, 7), respectively. This indictes the presence of two new clusters, nmely cluster with (,, 3, 6) s memers nd cluster with (5, 7) s memers. Nodes tht re not connected to other nodes re considered s single node clusters. The grph is represented y mtrix clled the ssocited mtrix. Is ssocited to the grph G = (N, A) of order n, squre mtrix of order n. This mtrix is formulted s follows:..,.., In Tles, nd 3 re shown mtrices respectively ssocited with grphs of Figs 6, 7 nd 8. It suffices now to use the mtrix otined fter pplying the threshold 0.4 to deduce the remining links etween nodes (representing XML documents) nd then uild the corresponding clusters. 3.. Overview of using Prim s lgorithm As nnounced ove, Prim s lgorithm [33] llows clculting the shortest pth (or MST) in given weighted grph G. In n informl wy, we pply the following points: Crete tree contining single node, chosen ritrrily from the grph G Crete set contining ll the edges in the grph G loop until every edge in the set connects two nodes in the tree remove from the set n edge with minimum weight tht connects node in the tree with node not in the tree dd tht edge to the tree Thus, the lgorithm continuously increses the size of tree, one edge t time, strting with tree consisting of single node, until it spns ll nodes of the initil grph G. A pseudo-code for Prim s lgorithm is given in Fig. 9. To show how to pply Prim s lgorithm to find minimum spnning tree in the weighted grph, we rely on the exmple of grph in Fig. 0. Prim s lgorithm will proceed s follows. First we ritrrily choose to strt with the node d, nd then we dd edge {d, e} of weight. Next,

7 47 we dd edge {c, e} of weight. Next, we dd edge {d, f} of weight. Next, we dd edge {, e} of weight 3. And finlly, we dd edge {, } of weight. This produces minimum spnning tree of weight = 0. The minimum spnning tree found is given in Fig.. Input: Given non-empty connected weighted grph G = (N, A), (the weights cn e negtive) Initiliztions: N new {x}; A new φ ; (where x is n ritrry node (strting point) from N) repet choose n edge {u, v} with miniml weight such tht u is in N new nd v is not (if there re multiple edges with the sme weight, ny of them my e picked) N new N new {v}; A new N new {u, v} until N new = N Output: N new nd A new descrie n MST Fig. 9 Prim's lgorithm pseudo-code Fig. 0 An exmple of weighted connected grph 3 6 c c Fig. Minimum spnning tree (MST) produced y pplying Prim s lgorithm on the grph in Fig. 0 We could strt with ny node to determine the MSP. In the cse of the previous exmple (in Fig. 0), we ritrrily chose to strt with the node d. But ny node cn e used to strt the process with Prim s lgorithm. The time complexity of the lgorithm depends hevily on how the choice is implemented in the edge / node to dd to the set t ech stge. With nive representtion, using n djcency mtrix grph representtion nd serching n rry of weights to find the minimum weight edge to dd requires O (N ) running time. Using simple inry hep dt structure nd n djcency list representtion, Prim s lgorithm cn e shown to run in time O (A log N). Using more sophisticted Fioncci hep, this cn e rought e d e d 4 f f down to O (A + N log N), which is symptoticlly fster when the grph is dense enough tht A is ω (N), i.e. A domintes N symptoticlly. However, we chose, for the purposes of our tests in this rticle, the djcency mtrix for the simplicity of its implementtion. At this stge, s previously nnounced, we focus in Susection 3.3, on the description of the structurl similrity mesure proposed. 3.3 Tree structure similrity Usully, to compre two words we use thesurus or dictionry. But when these words correspond to node nmes (lels) in tree, it is necessry to tke into ccount their respective tree reltionships. The ide is tht even though two nodes re represented y the sme nme, or y synonymous nmes, this does not men tht they remin necessrily similr in the context of their respective ncestors, descendnts, silings nd/or cousins, which cn e completely different. Thus, the similrity of two nodes depends not only on their ontologicl similrity (terms could e similr ecuse they hve sme string or could e semnticlly relted using dictionry), ut lso on their respective tree reltionships tht ply crucil role in the similrity clcultion. Most methods for clustering XML documents y structure use the edit distnce for mesuring the similrity etween their structures. We recll tht tree edit distnce mesures the numer of elementry opertions (insertions, deletions nd replcements of nodes) required to trnsform one tree into nother. On the other hnd, ll these methods perform depth-first trversl to visit nodes of tree. We propose novel method for clculting the similrity: Firstly, insted of performing depth-first trversl to visit nodes of tree, our proposl is to perform redth-first trversl, lso clled level y level trversl. In other words, we explore the redth, i.e., full width of the tree t given level, efore going deeper. Secondly, we tke into considertion the hierrchicl contexts of XML tree structures. Before descriing in detil our method, it is necessry to introduce some fundmentl concepts Bsic preliminry notions A tree level consists of siling nd/or cousin nodes. As suggested in our pproch, repetitions of siling nodes will e eliminted, ut not those of the cousin nodes. Therefore, it is possile to encounter on sme tree level severl duplictions of cousin nodes. It is then necessry in such cse to tke them into ccount in the similrity clcultion. To express tht, we cn use the concept of weight. Indeed, let,,, e vector ; its norm (Eucliden distnce) is. The usge of the norm llows exploiting efficiently the concept

8 48 of weight. We cn extend its use even to ojects tht re not necessry vectors of. Indeed, for exmple, if,,,,,, is tree level, then the weights (or frequencies) of, nd c re, nd 3, respectively. Therefore, if these weights re stored in vector such s,,3 then the norm ssocited with L is 3 7. The norm will serve therefter for the normliztion of the similrities vlues. Moreover, in order to fully highlight fetures of our pproch, it should lso recll some notions on depth- nd redth-first trversls of trees. Indeed, there re essentilly two different methods in which to visit systemticlly ll the nodes of tree, nmely, depth-first trversl nd redth-first trversl. Certin depth-first trversl methods occur frequently enough tht they re given nmes of their own: preorder trversl, inorder trversl nd postorder trversl. To descrie these concepts esily nd clerly it is etter to rely on concrete exmples. In fct, we do not relly need dwell too long on the detils of the tree trversl; we give only the minimum necessry to distinguish the redth-first trversl (which chrcterizes our proposed method) nd the depth-first trversl tht ws used in most existing clustering methods. Thus, for exmple, given the tree in the Fig. : preorder trversl would visit the elements in the order: A, B, C, D, E, F, G, H, I. This type of trversl is clled depth-first trversl ecuse it tries to go deeper in the tree efore exploring siling nodes. root B A C E H D F G I level 0 level level level 3 preorder (tree) if (tree not empty) visit root of tree preorder (left su_tree) preorder (right su_tree) Fig. 3 Preorder trversl lgorithm 3.3. Bredth- first tree trversl To our knowledge, the redth-first trversl lgorithm hs not een prcticlly pplied in existing work on clustering of XML documents. We encountered only one pproch in [9] tht ddressed the similrity computtion ccording to the similrities of the levels of XML tree structures. Recll tht in our pproch, the representtive structures of XML documents re tree structurl summries, structured s generl trees, i.e., where ech tree node cn hve ny numer of children. The lgorithm in Fig. 4 llows exploring generl tree nd retrieving its nodes, dopting the redth-first trversl. The redth-first trversl hs liner time complexity O (N) in the worst cse, s the depth-first trversl. redh_trversl (n : Node) egin level {n} while level φ ; {dept_level φ ; for ech node level {store in list; depth_level depth_level child_of ();} level depth_level ;} end Fig. 4 Bredth-first trversl lgorithm Fig. Simple generl tree For exmple, the trversl visits ll the descendnts of B (i.e., keeps going deeper) efore visiting B s siling D (nd ny of D's descendnts). As we hve seen, this kind of trversl cn e chieved y simple recursive lgorithm given in Fig. 3. Wheres the depth-first trversls re defined recursively, redth-first trversl is est understood s non-recursive trversl. The redth-first trversl of tree visits the nodes in the order of their depth in the tree. Bredth-first trversl lgorithm first visits ll the nodes t level 0 (i.e., the root), then ll the nodes t level one, nd so on. At ech level the nodes re visited from left to right. Thus, redth-first trversl of the tree shown in Fig. visits the nodes in the following order: A, B, D, C, E, H, F, G, I. Indeed, given tree of N nodes, the lgorithm in Fig. 4 clerly shows the linerity of the complexity time. At ech level, the nodes re visited from left to right, nd then stored in lists tht will e used therefter for clculting similrities. The dvntge of storing the nodes in the lists is twofold: On the one hnd, this llows esy clcultion of sic similrities etween levels of trees. On the other hnd, given two tree levels elonging respectively to two trees, it is possile to know the similrities of their respective ncestor nd descendnt levels. As suggested ove, the ncestor nd descendnt levels represent somehow hierrchicl contexts to tke into ccount in clculting the similrity of two levels of two given trees. These levels re somehow implicitly linked y hierrchicl reltionships in trees. The underlying ide is tht even though two tree levels re identicl, or very similr, this

9 49 does not men tht they remin necessrily similr in the context of their respective ncestor nd descendnt levels which cn e completely different. So, given ll these chrcteristics, we descrie nd explin in Susection 3.3.3, the structurl similrity mesure tht we propose, tking into ccount the hierrchicl reltionships etween levels in ech tree Structurl similrity mesure sed on redth-first tree trversl Let T nd T e two trees representing respectively two XML documents. We propose to compute their similrity s follows:,, (),, is the similrity of the levels l i nd l j. The levels l i nd l j elong respectively to T nd T. The ounds n nd m re the levels numers of T nd T respectively. Given two levels l i nd l j, we define their similrity ccording to their hierrchicl context s follows:, () w 0, w 0 nd w 3 0 re weights such tht w + w + w 3 =. S is the sic similrity of l i nd l j. It is expressed s follows:, The term, is the ontologicl similrity of the nodes e k nd e l (otined using dictionry). In other words,, if e k = e l,, if e k nd e l re synonymous, otherwise, 0. The nodes e k nd e l elong respectively to the levels l i nd l j. The ounds p nd q re the nodes numers of l i nd l j respectively. The product llows normlizing the sum,. The terms N nd N re two vectors whose elements re weights of nodes elonging respectively to the tree levels l i nd l j. Thus, S is clculted for ech pir of levels (l i, l j ). So the result is the sic similrity mtrix of trees T nd T. In Susection 3.3.4, we give n ide out the clcultion of this mtrix. S nd S 3 in some wy reflect the hierrchicl context in clculting the similrity of ech pir of levels (l i, l j ). S represents the similrity of descendnt levels of l i nd l j respectively. It is expressed s follows:,, The term, represents the sic similrity of the levels d k nd d l. The levels d k nd d l elong respectively to desc nd desc. The terms desc nd desc re the sets of descendnt levels of l i nd l j, respectively. (3) (4) The ounds r nd s re the levels numers of desc nd desc, respectively. S 3 is the similrity of ncestor levels of l i nd l j respectively. It is expressed s follows:,, The term, represents the sic similrity of the levels k nd l. The levels k nd l elong respectively to nc nd nc. The terms nc nd nc represent the sets of ncestor levels of l i nd l j, respectively. The ounds t nd u re the levels numers of nc nd nc, respectively Illustrtive exmple This exmple shows the different steps followed in computing the similrity of the two trees T nd T in Fig. 5 using the proposed structurl similrity mesure sed on redth-first tree trversl. The first step is to use Eq. (3) to clculte the similrity mtrix of levels of T nd T. As there re three levels in ech tree (T nd T ), we will hve mtrix (3 3). The clcultion gives the following mtrix: We note tht the similrity etween the lst levels of T nd T respectively is equl to 0.8, while it is equl to etween the other levels of the sme rnk. It is equl to 0 everywhere else. T T level 0 c e f g d level level Fig. 5 Comprison of two XML trees using the clcultion of the structurl similrity sed on the redth-first trversl Before clculting S nd S 3, it would e pproprite to define how to use the weights w, w nd w 3. Indeed, if we ignore the hierrchicl contexts (descendnt levels nd ncestors levels), it is not necessry to clculte S nd S 3, in this cse we tke w = with w = 0 nd w 3 = 0. Otherwise, in prticulr in the cse of XML documents, it is more nturl to give to S, S nd S 3 the weights w =, w =, nd w 3 =, respectively. Thus, with respect to the first cse mentioned, nmely tht we do not consider the hierrchicl contexts, the similrity etween two tree levels of two trees, respectively, is defined y S. g c e d (5)

10 430 Indeed, with (w =, w = 0 nd w 3 = 0), we hve,, ecuse w =. So the finl similrity clcultion of the two trees T nd T ecomes esy nd requires only exploring the mtrix (3 3) clculted ove with the formul (). There will therefore,, =,., =, 0.94 which is reltively good similrity vlue tht we could get y compring two vectors, so it does not reflect the tree view of XML documents. Hving sid this, ut if we consider the cse where w =, w = nd w 3 =, the clcultion is oviously more complicted, ut in principle reflects more relile similrity clcultion. The elements of the new similrity mtrix, efore clculting the finl similrity, re clculted using Eq. (), with w =, w = nd w 3 =. So we will hve,. In this cse, we must lso clculte S nd S 3 using Eqs. (4) nd (5). But to go fster, we clculte S nd S 3 only for non-zero similrity of the mtrix clculted ove. The elements concerned re those of the min digonl of the mtrix, nmely (, ), (, ) nd (3, 3) which re represented y the vlues, nd 0.8, respectively. Moreover, it should lso e noted tht some elements of the mtrix re not concerned y the clcultion of S or S 3, s for exmple those of the lst row or those of the first row of the mtrix. But for not distort the similrity computtion, we ttriute the vlue to S nd S 3, in the cse of the lst row nd the first row of the mtrix, respectively. Thus, for ech element of the mtrix equl to 0, we clculte the vlues of S nd S 3 s follows: The element (, ) hs no ncestor levels so, i.e., S 3 =, ut it hs two descendnt levels, nmely (, ) nd (, 3) such tht, nd,30, tht is to sy = 0.5. So, y pplying Eq. () we hve, The element (, ) hs only one ncestor level nd one descendnt level, corresponding respectively to (, ) nd (3, 3), which gives, nd 3,30.8, i.e.,,, nd,, hve, So y pplying Eq. () we The lst cse concerns the element (3, 3) tht hs no descendnt levels, ut hs four ncestor levels, nmely (, ), (, ), (, ) nd (, ). Regrding the descendnt level, we ssign the vlue, s expected, to, i.e., S =. Other vlues re clculted s follows:,,, 0,, 0 nd,. Thus, we hve,. Finlly, we otin 3, The finl mtrix formed y the elements,..,.. efore clculting the similrity of the two trees T nd T is given s follows: Applying the eqution (), we otin,,, Unlike the first result (nmely 0.94) without tking into ccount the hierrchicl contexts of trees T nd T, i.e., with w =, w = 0 nd w 3 = 0, the ltter result (nmely 0.905), seems to etter reflect the relity of the tree structure of XML documents. This exmple gives n ide on how to clculte the similrity ccording to our pproch, ut to vlidte our proposl we will mke severl tests in the experimentl prt of this pper Complexity of the structurl similrity clcultion Given generl tree of M nodes nd height h, this ltter is equl to the numer of tree levels. So, tree level, other thn tht of the root contins on verge nodes. Therefore, given two trees hving levels contining respectively nd nodes, then the clcultion of their sic similrity mtrix is chieved on verge in M M opertions since they hve respectively h nd h levels. In other words, it requires time complexity of order O (M M ), which is the sme s tht of clculting the similrity sed on edit distnce. However, in our pproch, unlike pproches sed on edit distnce, we extend the similrity clcultion tking into ccount the tree reltionships etween nodes. It will therefore e necessry to dd the clcultion of descendnt nd ncestor levels similrities, respectively. Indeed, sed on sic similrity mtrix S [..h,..h ], the worst cse time complexity of the dditionl clcultion is on the order. Note however, tht the heights h nd h re usully reltively much smller thn the tree sizes (numers of nodes) M nd M respectively. We thus otin time complexity slightly higher thn tht of the edit distnce, ut this is cceptle given the relevnce of the proposed similrity mesure tht tkes into ccount the hierrchicl reltionships of nodes. Remrk given tht we hve proposed similrity mesure other thn tht sed on the distnce for clustering XML documents, on the one hnd nd, on the other hnd, we relied on Prim s lgorithm tht computes the shortest pth (MST) in grph which is then exploited for clustering XML documents sed on their structurl distnces (ech node of the MST, symolizes the structure of n XML document), it is then necessry to dpt our similrity mesure. To do this, it suffices to replce the similrity

11 43 vlue clculted on the sis of the similrity mesure proposed y the distnce vlue ccording to the following Eq. (6). (6) In the next section, we evlute the effectiveness nd efficiency of our pproch. To this end, we conducted our experiments relying on severl different tests. 4. Experiment nd results 4.. Implementtion of the clustering system We developed first progrm in jv, under the Jcretor environment. The developed progrm consists of two modules: the first one is sed on SAX to crry out the first prsing s nnounced in Susection 3.. This module provides for ech treted XML document n intermedite file intercepted y second module to finlize the extrction of its corresponding tree summry. For clustering XML summry trees otined using the previous prsing progrm (i.e., tree structurl summries extrction progrm) we wrote second progrm in C++ tht uses the files (representing the tree summries) generted y the first progrm for clustering them. 4.. Experimentl frmework Our experiments re crried out on Lenovo, Intel Core Duo GHz CPU nd.99 GB of RAM. For dt set, the experiments were crried out on oth rel (ACM SIGMOD Record ) nd synthetic XML collections. ACM SIGMOD Record corpus concerns scientific rticles pulished y ACM SIGMOD conference nd is composed of pproximtely,000 XML documents shring 5 DTDs, nmely HomePge, IndexTermsPge, OrdinryIssuePge, ProceedingsPge, nd SigmodRecord. These DTDs cn, in fct, e considered s trget clsses ginst which we cn ssess our clustering pproch. This corpus is distriuted s shown in Tle 4. DTD Tle 4: ACM SIGMOD Record corpus distriution Numer of XML documents IndexTermsPge 90 OrdinryIssuePge 30 ProcedingsPge 6 SigmodRecord HomePge Evlution metrics The evlution is to verify to wht extent the clustering is susceptile to find clusters in greement with the clsses of the leled corpus, which re considered s trget clsses. To vlidte our pproch, we used the F-mesure, Recll nd Precision mesures, which re commonly used metrics to ssess the clustering results. F (F-mesure) [6] is comintion of Precision nd Recll. It mesures the lnce etween P (Precision) nd R (Recll) expressed respectively y the following Eqs. (7) nd (8). (7) (8) N c is the numer of documents in the cluster C, N d is the numer of documents in the trget clss (DTD) nd X d is the numer of documents in the trget clss ssigned to cluster C. We recll tht ech DTD is considered s trget clss with which we cn evlute our clustering. So we know priori these clsses, i.e., we know their numers nd the nmes of the documents they contin. The F-mesure F, in turn, is expressed y Eq. (9) representing the hrmonic men of Precision nd Recll Evlution nd discussion In this phse, we first derive from the previous XML collection, the corresponding tree summries, nd we then respectively proceeded to their clustering. The first clustering test consists in compring the mesure of similrity proposed with the similrity mesure sed on tree edit distnce nd the similrity mesure proposed in [3]. The second test is to compre some of our results with those of existing pproches. Finlly, the third test is to confirm the symptotic time complexity of our similrity mesure Similrity mesure proposed versus other similrity mesures In the first test, s expected, we compred the similrity mesure proposed to nother mesures, nmely the edit distnce nd the similrity mesure proposed in [3]. We chose to compre our similrity mesure with the edit distnce, ecuse the ltter is mesure of similrity tht hs een widely used in mny clustering pproches. The comprison with the work presented in [3], is justified y the fct tht we use exctly the sme model for representing XML documents, in this cse, structurl tree summries. This comprison test is prticulrly motivted y the response time of our clustering on the one hnd nd, on the other hnd, y the reliility of our similrity (9)

12 43 mesure. For this, we first replced in our clustering lgorithm, the similrity mesure descried y Eq. 6, respectively, y the edit distnce nd similrity mesure proposed in [3]. We then performed three series of tests with the sme vlues of distnce threshold in the intervl [ ]. However, given the recurring results (especilly for the 90 documents corresponding to IndexTermsPge which re lmost identicl, thus structurlly very similr), we used only of the corpus, nmely 348 XML documents, i.e., ( ) corresponding respectively to IndexTermsPge, OrdinryIssuePge, ProceedingsPge, SigmodRecord, nd HomePge. To hve cler ide out the performnce nd reliility of our tests, it would e pproprite to report the comprtive results (for the sme type of test) pulished in [3]. These results re given in Tle 5. Some revitions used in Tle 5: NC is the numer of clusters. The time unit on the column nmed Time is the second. The revitions SM nd ED denote respectively Similrity Mesure nd Edit Distnce. Finlly, the revition TH represents the similrity threshold. Tle 5: [3] similrity mesure versus edit distnce TH [3] SM ED NC Time NC Time Note tht the clustering lgorithm in [3] is completely different from the clustering lgorithm we proposed in this rticle. In this regrd, we recll tht our clustering here is sed on conventionl gglomertive hierrchicl clssifiction, while tht of the pproch [3] is n incrementl clustering. As nticipted ove, fter completing the first test with our similrity mesure (sed on Eq. 6), we replce, in our clustering lgorithm, our similrity mesure, successively, y the edit distnce nd [3] similrity mesure. We then perform two new series of tests whose the results re collected in Tle 6. Other revitions concerning Tle 6 re DT nd OSM; they denote respectively the distnce threshold nd similrity mesure (sed on Eq. 6). As cn e seen (Tle 6), in ll cses, i.e., with the similrity mesure proposed (OSM) or with other mesures (ED nd [3] SM), clustering time remins prcticlly the sme when the similrity threshold chnges (increses or decreses). Indeed, our clustering is sed on the minimum distnce s criterion for ggregtion. In other words, the numer of comprisons is prcticlly the sme for ech threshold distnce vlue. As for differences, there is lg in response times nd differences etween the similrities vlues otined. The difference in response times, s expected, is ovious, given the differences etween the equtions used y ll three mesures tested. The time prmeter is not very restrictive nd should not weigh hevily on the fesiility of such pplictions (clustering is not n interctive ppliction where time is lwys criticl prmeter). Note, however, tht differences in the vlues of the similrities re crucil, since it is on the sis of similrity tht it is decided tht document is or is not ssigned to cluster. Moreover, these differences hve direct impct on the numer of clusters (NC) otined in ech test. Indeed, with these thresholds, some documents re structurlly very distnt to sty together in the sme cluster. This is due to [3] similrity mesure nd our similrity mesure tht tke into ccount the ncestor nd descendnt context of nodes, so tht we find in the sme cluster s the documents hving very close hierrchicl structures. Thus, XML documents tht do not stisfy this condition, i.e., tht re not sufficiently structurlly similr, will migrte to other newly creted clusters. In fct, these new clusters re considered s not corresponding to ny DTD. We recll tht ech DTD is considered s trget clss ginst which we cn ssess our clustering. Tle 6: Our similrity mesure versus [3] mesure nd edit distnce mesure DT [3] SM ED OSM NC Time NC Time NC Time When we compre the results in Tles 5 nd 6, there re some differences. Indeed, if we consider the column nmed [3] SM in the two tles in question, we find tht there is cler difference in the clustering time. This is certinly due to our clustering lgorithm, which is fster compred to the lgorithm of the study y [3], which is reltively slow. The numer of clusters NC does not chnge rpidly with the distnce threshold in Tle 5 compred to NC in Tle 6. This is due to clustering lgorithms tht re different. The clustering lgorithm used in this rticle is simple lgorithm sed on conventionl gglomertive hierrchicl clssifiction, while the clustering pproch y [3] is n incrementl clustering. Our clustering

13 433 lgorithm uses only the minimum distnce s ggregtion criterion, while the clustering pproch y [3] is chrcterized y the moility of the centroid, representing ech cluster. Ech time n XML document must e dded, nd its representtive is systemticlly compred with ll existing centroids nd ll the trees in the cluster to which it is ssigned. During the comprison process, we cn either hve new centroid, which is systemticlly ssigned to newly creted cluster, or n existing centroid cn e replced y nother tree more representtive, mong those of the sme cluster. This clerly explins the differences etween the two pproches prticulrly regrding the time vrition in the clustering (SM column) in Tle 5. If we compre the column OSM in Tle 6, representing our pproch, we see tht it is somewht close to the result of the ED column in Tle 5, in terms of clustering time nd the numer of clusters NC. But it is somewht fr from the result of the ED column in Tle 6, prticulrly in terms of time clustering. We conclude tht our method is etter ecuse it is relile in terms of clustering time nd the qulity of clustering Comprison of some of our results with those of other clustering methods The second test is to compre some of our results with those otined in [3, 6, 3, 3, 38] pproches on portion of ACM SIGMOD collection. To this end we used the smple of XML documents in Tle 7. Tle 7: Distriution of ACM Sigmod record suset Nme of the DTD Numer of XML documents IndexTermsPge 5 OrdinryIssuePge 30 ProceedingsPge 6 SigmodRecord HomePge We chose to compre our method with those developed in [3, 6, 3, 3, 38] pproches for severl risons. First, s our pproch, these pproches use very close representtion, nmely tree structures to structurlly represent XML documents. Second, ecuse they use the sme dt set, nmely ACM SIGMOD corpus. Third, their clustering methods re ll sed on edit distnce or similrity tht is different from our mesure of similrity. Recll tht these results do not depend only on the similrity mesure, ut lso nd especilly, of the model (originl XML tree structure or XML tree structure summry) used to represent the structures of XML documents. In Tle 8, we cn see the results of this comprison. Note tht [3, 6, 3, 3, 38] results were reported in [3, 38]. These vlues represent the verge Precision, Recll, nd F-mesure, in the intervl [0, ]. The results in Tle 8 show tht our clustering hs slightly lower precision thn those of [6, 3, 38] pproches, ut it is very close to those of [3, 3] pproches. But it nevertheless hs etter Recll thn the mjority of other pproches, with the exception of tht of the [3] pproch. Finlly, the F-mesure otined y our clustering lso seems to e higher thn ll others, with the exception of tht of the pproch [3]. However, our clustering is etter overll, since it hs etter Precision thn tht of the pproch [3]. Tle 8: Comprison of our results with those of other pproches Approch Precision Recll F-mesure [3] [6] [3] [3] [38] Our pproch Time needed to clculte the structurl similrity etween two XML documents Finlly, s expected, in this third test, we will conduct experiments to determine the time required to clculte the structurl similrity etween two XML documents. To conduct these experiments, we generted set of 0 synthetic XML documents whose the numer of nodes vries respectively from 50 to 500. We conducted two sets of tests with the group of XML documents previously generted: The first one ws conducted y setting the vlues of the weights, nd to in Eq. (). As we hve lredy considered, it is more nturl, in the cse of XML documents, to give to S, S nd S 3 the sme weight, nmely w =, w =, nd w 3 =. Recll tht S nd S 3 represent respectively scendnt (ncestor) nd descendnt contexts. For more detils see the equtions for clculting the similrity. The second one ws conducted y setting the vlues of the weights, = = 0, in the sme eqution. In other words, we ignore the hierrchicl contexts (descendnt levels nd ncestors levels). Therefore, it is not necessry to clculte S nd S 3. In this cse our similrity mesure ehves like the edit distnce. Thus, the time complexity of clculting the similrity etween two trees T nd T is in the worst cse O (N ). Wht mtters in this test is not the qulity of clustering, ut the time required for comprison of two XML documents structures. Therefore, it is not necessry to hve summries of XML trees. For this, we slightly modified our prser, so s not remove repetitions of siling nodes nd thus to otin the originl XML tree structures (the whole structure of document). For more detils out this question, see Susection 3..

Slides for Data Mining by I. H. Witten and E. Frank

Slides for Data Mining by I. H. Witten and E. Frank Slides for Dt Mining y I. H. Witten nd E. Frnk Simplicity first Simple lgorithms often work very well! There re mny kinds of simple structure, eg: One ttriute does ll the work All ttriutes contriute eqully