Approximate XML Structure Validation based on Document-Grammar Tree Similarity

Size: px

Start display at page:

Download "Approximate XML Structure Validation based on Document-Grammar Tree Similarity"

Elvin Harrell
5 years ago
Views:

1 Approximte XML Structure Vlidtion sed on Document-Grmmr Tree Similrity Joe Tekli, Richrd Cheir *, Agm J.M. Trin 3, Cetno Trin Jr. 3 4, Rento Fileto Dept. of Elec. nd Compt. Eng., SOE, Lenese Americn University (LAU), 36 Bylos, Lenon LIUPPA Lortory, University of Pu nd Adour Countries (UPPA), 64, Anglet, Frnce 3 ICMC, University of So Pulo (USP), São Crlos, SP, Brzil 4 Federl University of Snt Ctrin (UFSC), Florinopolis, SC, Brzil Astrct. Compring XML documents with XML grmmrs, lso known s XML document nd grmmr vlidtion, is useful in vrious pplictions such s: XML document clssifiction, document trnsformtion, grmmr evolution, XML retrievl, nd the selective dissetion of informtion. While exct (Boolen) XML vlidtion hs een extensively investigted in the literture, the more generl prolem of pproximte (similrity-sed) XML vlidtion, i.e., document-grmmr similrity evlution, hs not yet received strong ttention. In this pper, we propose n originl method for mesuring the structurl similrity etween n XML document nd n XML grmmr (DTD or XSD), considering their most common opertors tht designte constrints on the existence, repetility nd lterntiveness of XML elements/ttriutes (e.g.,?, *, MinOccurs, MxOccurs, etc.). Our pproch exploits the concept of tree edit distnce, introducing novel edit distnce recurrence nd dedicted lgorithms to effectively compre XML documents nd grmmr structures, modeled s ordered leled trees. Our method lso inherently performs exct vlidtion y imposing mximum similrity threshold (imum edit distnce) on the returned results. We implemented prototype nd conducted severl experiments on lrge sets of rel nd synthetic XML documents nd grmmrs. Results underline our pproch s effectiveness in clssifying similr documents with respect to predefined grmmrs, ccurtly detecting document nd/or grmmr modifictions, nd perforg document nd grmmr relevnce rnking. Time nd spce nlysis were lso conducted. Keywords: XML; Semi-structured dt; XML Grmmr; Structurl Similrity; Tree Edit Distnce; Clssifiction, Relevnce Rnking.. Introduction The structurl nd self-descriing nture of XML promotes numer of emerging techniques rnging from XML version control, intelligent We serch, nd dt integrtion, to messge trnsltion nd clustering/clssifiction, requiring, in one wy or nother, some notion of XML structurl similrity. In XML similrity-relted reserch, most work hs focused on estimting similrity t the XML dt lyer (compring XML documents, e.g., [6, 33, 48]), while quite few studies hve trgeted the XML type lyer (compring XML grmmrs, e.g., [5, 8, 6]). Nonetheless, few efforts hve een dedicted to similrity evlution in-etween the XML dt nd type (document/grmmr) lyers. Trditionlly, most studies relted to XML document/grmmr comprison hve trgeted XML vlidtion [7, 8, 49], i.e., specific cse of Boolen XML comprison, designed to verify whether n XML document is vlid (or not) with respect to (w.r.t.) given XML grmmr (DTD [6] or XSD [3]). Yet with the prolifertion of heterogeneous XML dt on the We (i.e., documents originting from different dt-sources nd not conforg to the sme grmmr, or documents lcking predefined grmmrs), there is n incresing need to perform rnked XML document/grmmr comprison, which we refer to s pproximte XML vlidtion : identifying those documents which re not necessrily vlid w.r.t. the user grmmr, ut which shre certin mount of similrity with the grmmr, rnked following their similrity scores. Evluting the similrity etween heterogeneous documents nd grmmrs cn e exploited in vrious ppliction scenrios requiring ccurte nd rnked detection of XML structurl similrities [, 6], rnging over: XML document clssifiction ginst set of grmmrs declred in n XML dtse [, 8], (just s DB schems re necessry in trditionl DBMS for the provision of efficient storge, retrievl nd indexing fcilities, the sme is true for DTDs nd/or XSDs in XML repositories), XML rnked document retrievl vi structurl queries [3, 55] ( structurl query eing represented s DTD/XSD in which dditionl constrints on content cn e defined), the selective dissetion of XML documents [] (user profiles eing expressed s DTDs/XSDs ginst which the incog XML dt strem is mtched), s well s We service mtching nd SOAP processing (serching nd rnking services which est mtch WSDL service requests, nd compring outgoing SOAP messges to sender-side WSDLs, processing only those prts of the messges which differ from the WSDL descriptions in order to void unnecessry overhed, nd thus reduce processing cost in SOAP prsing [74], seriliztion [], nd communictions [7, 78]). * Corresponding uthor. Tel.: ; fx: ; e-mil: richrd.cheir@univ-pu.fr We Service Description Lnguge (WSDL) is specil XML grmmr structure tht supports the mchine-redle description of We service s interfce nd the opertion it supports, including corresponding SOAP messge formts.

2 In this study, we focus on the prolem of evluting the structurl similrity etween n XML document nd n XML grmmr, i.e., compring the structurl rrngement nd ordering of XML elements/ttriutes in the XML document nd the XML grmmr. Different from previous pproches which re either generic (disregrding XML grmmr constrints, e.g., the Or opertor,?, *, +, etc.) [3, 5, 75], developed for the DTD lnguge nd do not consider more complex nd expressive XSD-sed constrints (e.g., MinOccurs nd MxOccurs) [9, ], or restricted to Boolen results (i.e., trditionl XML vlidtion methods [7, 8, 49]), we im t providing method which is: Fine-grined in detecting nd identifying the structurl similrities nd disprities etween XML documents nd grmmrs, in comprison with current generic [3, 75] nd lterntive [9, ] pproches, Considering the more expressive XSD grmmr constrints (nmely MinOccurs nd MxOccurs), in comprison with less expressive DTD-sed constrints (e.g.,?, *, +) hndled in existing methods [9, ], Producing rnked similrity result, in comprison with existing Boolen (vlidtion) methods, e.g., [7, 8, 49]. To chieve these gols, we provide new pproch tht extends well-known dynmic progrmg techniques for finding the edit distnce etween tree structures, XML documents nd grmmrs eing modeled s Rooted Ordered Leled Trees. Our pproch consists of two min phses: i) XML document/grmmr tree representtion, nd ii) XML document/grmmr tree comprison (cf. overll rchitecture in Fig. ). While XML documents cn e nturlly represented s leled trees, XML grmmrs re usully more intricte, due to the vrious types of constrints on the existence, repetility nd lterntiveness of XML nodes (e.g.,?, *, + opertors in DTDs, MinOccurs, MxOccurs crdinlity opertors in XSD, s well s the And sequence opertor nd Or lterntiveness opertor). These would hve to e considered to otin n ccurte similrity mesure. Hence, we ddress the prolem of compring n XML document with n XML grmmr s tht of: producing tree representtion for the XML grmmr (comprle to the XML document tree representtion) with dditionl components to descrie crdinlity constrints (nmely the MinOccurs nd MxOccurs opertors), nd then pplying tree-to-tree edit distnce function to compute document-togrmmr structurl similrly, tking into ccount XML grmmr constrints. We introduce dedicted grmmr trnsformtion rules to simplify grmmr expressions (while preserving their expressiveness) representing ech grmmr s single tree or set of trees following its disjunctive norml form (i.e., set of grmmrs free of the Or opertor, e.g., declrtion ( (, c)) is split into two declrtions: nd (, c)), ech eing represented s seprte tree). Then, we introduce Tree (Edit Distnce) Comprison pproch to compute (concurrently, using multi-thred processing), the cost of trnsforg the XML document tree so tht it ecomes vlid w.r.t. the (set of) XML grmmr tree(s). Minimum Tree edit Opertions Costs computed vi TOC module, re fed to Tree Edit Distnce (TED) lgorithm, which identifies the imum distnce (mximum similrity) vlue. We uild on TED s n effective nd efficient mens to compre semi-structured dt, e.g., XML documents [8, 6, 48], which hs een proven optiml in structurl similrity evlution, w.r.t. less ccurte methods [7]. Also, note tht our XML grmmr tree model considers complex declrtions, including: i) repetle sequence expressions, ii) repetle lterntive expressions, nd iii) recursive expressions, which hve een disregrded in most existing studies, e.g., [9, 3, 57]. In ddition, our grmmr tree model is not limited to context-free (DTD-like) grmmr declrtions: where the definition of n element is unique nd independent of its position in the grmmr; ut cn e used with context-sensitive (XSD-sed) declrtions: where identiclly leled elements cn hve multiple definitions in different contexts in the grmmr. XML doc D XML grm G XML Document tree representtion XML Grmmr tree representtion Trnsformtion rules One-to-one representtion Tree Representtion Disjunctive norml form Set of conjunctive grmmr trees {C} G C i {C} G TOC TED Edit opertions costs Multi-thred processing for ech C i {C} G C i {C} G Tree (Edit Distnce) Comprison Mx{Sim(D, C i )} Sim(D, G) Fig.. Simplified ctivity digrm descriing our XML document/grmmr comprison frmework. A prototype system clled XS3 (XML Structure & Semntic Similrity) hs een developed to evlute nd vlidte our pproch, conducting lrge ttery of experiments on lrge XML dtsets, covering: One to One (compring one document to one grmmr), One to Mny (compring one XML document to set of grmmrs nd vice-vers) nd Set comprison (enling XML document/grmmr clssifiction nd rnked retrievl). Results highlight fine-grined (ccurte) similrity scores, produced in typicl cse polynomil time. The reminder of the pper is orgnized s follows. Section presents preliry notions. Section 3 descries our XML grmmr tree representtion model. Our XML document-grmmr structure comprison lgorithms re developed in Section 4. Section 5 presents the experimentl tests. Section 6 riefly reviews the stte of the rt in XML document/grmmr similrity pproches nd relted prolems. Section 7 concludes the pper.. Preliries.. XML Document Representtion Model Following the Document Oject Model (DOM) [77], XML documents represent hierrchiclly structured informtion nd cn e represented s ed ordered leled trees.

3 3 Definition Rooted Ordered Leled Tree: It is ed tree in which the nodes re leled nd ordered. We denote y T[i] the i th node of T in preorder trversl, T[i]. its lel, T[i].d its depth, nd T[i].Deg its out-degree (i.e., the node s fn-out). R(T)=T[] designtes the node of tree T. In the reminder of this pper, terms tree nd ed ordered leled tree re used interchngely Definition XML Document Tree: It is ed ordered lelled tree in which the nodes represent XML elements/ttriutes, lelled following element/ttriute tg nmes. Element nodes re ordered following their order of ppernce in the XML document. Attriute nodes pper s children of their encompssing element nodes, sorted leftto-right y ttriute nme, nd ppering efore su-element silings [48, 83] Note tht the order of ttriutes (unlike elements) is irrelevnt in ntive XML [], yet in the context of XML structure comprison nd processing, ttriute nodes re usully ordered (s descried ove) so s to reduce the complexity of the similrity evlution process [48, 83]. Element/ttriute vlues cn e disregrded (structure-only) or considered (structure-nd-content) in the comprison process following the ppliction scenrio (e.g., structure-only comprison is usully performed when processing heterogeneous documents for clustering/clssifying [6, 48], wheres dt vlues re generlly considered in XML chnge mngement nd dt integrtion [5, 4]). In this pper, we ddress heterogeneous XML document-grmmr comprison, nd thus trget element/ttriute tg nmes (structure-only comprison) rther thn dt vlues. A smple XML document structure is depicted in Fig... <?xml?> <Pper title= > <Pulisher> <FirstNme> </FirstNme> <LstNme> </LstNme> </Pulisher> <Version> </Version> <Length> </Length> <url> <Pper> </Pper> <Downlod> <url> <Pper> </Pper> <Downlod> </Downlod> </url> </Downlod> </url> </Pper> Smple XML document Pper.xml Title Pulisher XML tree D Pper Version Length LstNme FirstNme Pper Downlod XML tree representtion D of Pper.xml. Smple XML document, nd corresponding tree representtion. url Pper url Downlod <! DOCTYPE [ <!ELEMENT Pper ((Author* Pulisher), Version, Length?, url?)> <!ELEMENT Pulisher (FirstNme?, LstNme)> <!ELEMENT url (Homepge, Downlod+)> <!ELEMENT Downlod (url?) <!ELEMENT Author (#PCDATA)> <!ELEMENT Version (#PCDATA)> <!ELEMENT Length (#PCDATA)> <!ELEMENT FirstNme (#PCDATA)> <!ELEMENT LstNme (#PCDATA)> <!ELEMENT Homepge (#PCDATA)> ] XML grmmr in DTD syntx VPper (WAuthor* VPulisher), WVersion, WLength?, Vurl? VPulisher WFirstNme?, WLstNme Vurl WHomepge, WDownlod+ VDownlod Vurl? Production rules (structurl model definitions) descriing the structure of the DTD grmmr ove in forml lnguge <?XML?> <schem> <element nme= Pper > <sequence> <choice> <element nme= Author occurs= mxoccurs= unounded /> <element nme= Pulisher > <sequence> <element nme= FirstNme occurs= type= String /> <element nme= LstNme type= String /> </sequence> </element> </choice> <element nme= Version type= Deciml /> <element nme= Length occurs= type= Deciml /> <element nme= url occurs= > <sequence> <element nme= Homepge type= URI /> <element nme= Downlod mxoccurs= unounded type= URI > <element ref= url occurs = /> </element> </sequence> </element> </sequence> </element> </schem> XML grmmr in simplified XSD syntx (llowing higher degree of expressiveness in defining structurl constrints nd dt-types). Smple XML grmmrs, in DTD nd XSD syntxes. Fig.. Smple XML document nd XML grmms. Note tht hyper-links in XML documents (e.g., XLinks nd IDREFs) nd other types of nodes such s entities, comments nd nottions re usully disregrded in most existing structure comprison methods, e.g., [8, 6, 3, 33, 48], since they re not considered prt of the core structure of XML documents. The dots indicte the plce where XML element/ttriute dt vlues reside.

4 4.. XML Grmmr Representtion Model An XML grmmr (e.g., DTD [6] or XSD [3]) is n entity consisting of set of expressions descriing XML element/ttriute structurl positions nd dt-types, nd defining the rules elements/ttriutes dhere to in corresponding document instnces (cf. Fig..). The structurl properties of XML grmmrs re siclly cptured y regulr tree lnguges [46], XML grmmrs eing viewed s specil regulr tree grmmrs [, 46, 47]. In forml lnguge theory [34], regulr tree grmmr consists of set of production rules to trnsform trees. Formlly: Definition 3 Regulr Tree Grmmr: It is represented s tuple G = (N, T, R, p) where N is set of nonterl symols, T is set of terl symols, R is set of regulr expressions over N T, nd p is function p: N R tht ssocites non-terl symol n N with regulr expression r n R, producing set of production rules of the form n r n. The lnguge L(G), defined sed on grmmr G, consists of ll the possile trees tht cn e generted following the set of symols nd production rules defined in G [46] Definition 4 XML Grmmr: It cn e viewed s specil regulr tree grmmr [, 46, 47], where ech symol underlines n element e, such tht non-terl symols underline composite XML element lels, terl symols underline simple (lef node) element lels or ttriute lels, nd where the right hnd side of their production rules e r e re mde of specil regulr expressions r e which we identify s structurl models (or structurl expressions), defined using comintions of XML grmmr constrint opertors (insted of trditionl regulr expression opertors). XML grmmr constrint opertors specify rules on the existence nd repetility of elements/ttriutes, nmely: crdinlity constrints, i.e.,?, *, + in DTDs, MinOccurs nd MxOccurs in XSDs, nd lterntiveness constrints: And (sequence) nd Or (choice) opertors. In ddition, specil production rules re introduced in XML grmmr lnguges (which do not exist in trditionl tree lnguges [34]) to encode XML element dt-type content models (e.g., #PCDATA, String, Deciml, gyer, cf. Fig..) Note tht the DTD lnguge [6] llows context-free-grmmrs (locl tree grmmrs) [46], which mens tht the structurl model ssocited to n given element is independent of its position (i.e., context) in the document, the element eing identified y its lel (i.e., for n element e in grmmr G, there exists only one possile production rule e r e, i.e., only one possile structurl model r e ). In contrst, XSD [3] llows context-sensitive grmmrs (single type tree grmmrs) [34] where the structurl model ssocited to n element depends on its position in the document (e.g., one might hve more thn one production rule shring the sme element e in the grmmr, e.g., e r e nd e r e, following the element s structurl position). For further detils, study highlighting the correltion etween XML grmmr lnguges (nmely DTD nd XSD) nd regulr tree lnguges cn e found in [46]..3. XML Document/Grmmr Structurl Similrity We identify two kinds of XML document/grmmr structure similrity: i) Boolen comprison, referring to trditionl XML structure vlidtion, nd ii) rnked comprison, which we refer to s pproximte XML structure vlidtion. Definition 5 XML Structure Vlidtion (Boolen Comprison): denoted G D, n XML document (tree) D is deemed vlid w.r.t. n XML (regulr tree) grmmr G (i.e., D conforms to G), if ll element (ttriute) tgs in D mtch the element (ttriute) structurl models defined in G, considering structurl model constrint opertors. In other words, the result of the vlidtion opertion would e Boolen vlue (true or flse) indicting whether the document is vlid (or not) w.r.t. the grmmr, which comes down to checking whether the document tree is included in the lnguge defined y the grmmr, i.e., if D L(G) Definition 6 - Approximte XML Structure Vlidtion (Rnked Comprison): denoted G, D, pproximte XML structure vlidtion etween n XML document (tree) D nd n XML (regulr tree) grmmr G, with similrity (relevnce) score [, ] (i.e., D pproximtely conforms to G with similrity score ), is defined s the structurl comprison (mtching) etween the element/ttriute tg nmes in D nd the structure models in G, in order to detere the est mtches possile. Corresponding (est) mtching scores re compiled into n overll similrity (relevnce) score, highlighting the structurl reltedness etween D nd G. In other words, similrity score underlines the degree of memership of D w.r.t. the grmmr (regulr tree) lnguge L(G) Note tht in the reminder of the pper, we sometimes use the simple nottion: G D to designte tht document D pproximtely vlidtes grmmr G (omitting similrity score for ese of presenttion). Also note tht we dopt the concept of similrity s the inverse of distnce function, i.e., smller distnce vlue underlining higher similrity etween the XML document nd grmmr eing compred, nd vice-vers. In lnguge theory, terl symols re those not ssigned to production rules, nd thus cnnot e roken down to smller units.

5 5 3. XML Grmmr Tree Representtion The min ide is to compute tree representtion of the XML grmmr in order to pply tree-to-tree edit distnce for computing the document/grmmr similrity. To do so, we im to unfold the XML (regulr tree) grmmr G structurl expressions into set of conjunctive grmmr trees {C} of equivlent expressiveness, such tht compring document tree D with grmmr G would come down to compring D with {C}. Here, the min difficulties in XML document/grmmr tree-to-tree comprison lie within the disprities in the representtion nd processing of: i) repetle expressions defined vi the And opertor (cf. Fig. 3.), ii) lterntive declrtions defined vi the Or opertor (cf. Fig. 3.), nd iii) recursive declrtions (which could induce infinite loops of elements, cf. Fig. 3.c). <!ELEMENT (?,, c)+ > <!ELEMENT (? c)> <!ELEMENT ()> <!ELEMENT (?)> DTD tree representtion [] Smple XML document tree conforg to the DTD declrtion? + AND c c c DTD tree representtion [] Smple XML document tree conforg to the DTD declrtion? OR c Recursive declrtions re not considered in the DTD tree representtion model proposed in []. XML tree conforg to the DTD declrtion. Repetle expression. Alterntive declrtion c. Recursive declrtion Fig. 3. Disprities in tree representtions etween XML document nd grmmr structures, following the grmmr (DTD) tree representtion model in [] (one of the centrl methods in the literture). Intuitively, the higher the disprities in document nd grmmr tree representtions, the more complicted it ecomes to perform the tree comprison (mtching) tsk. Hence, we need to hve expressive, yet simplified (flttened) XML grmmr trees, which re (more esily) comprle to XML document trees. To do so, we proceed in three phses: XML Grmmr Trnsformtion Rules: First, we introduce numer of trnsformtion rules to fltten XML grmmr declrtions, considering the most common XML grmmr constrints. One-to-One Document/Grmmr Representtion: Second, we extend trnsformtion rules to further simplify repetle nd recursive declrtions in the grmmr w.r.t. ech document tree eing compred (one-to-one). XML Grmmr Tree Model sed on the Disjunctive Norml Form: Where the resulting simplified (flttened) grmmr is represented s set of conjunctive grmmrs mde solely of sequence declrtions (i.e., elements connected vi n And opertor), eliting lterntive declrtions (i.e., elements connected vi the Or opertor), producing grmmr tree structures which re (more esily) comprle to document trees. The reminder of this section develops ech of the phses mentioned ove, nd provides exmples. 3.. XML Grmmr Trnsformtion Rules nd Properties An XML grmmr trnsformtion rule cn e viewed s inry function tht trnsforms n XML structurl expression into nother, thus trnsforg one grmmr into nother. Formlly: Definition 7 XML Grmmr Trnsformtion: Let Ω denote the domin of XML grmmr structurl expressions (set of ll grmmr structurl expressions llowed in our study, cf. Definition 4), trnsformtion rule R is defined s function R: ΩΩ, ssociting n input structurl expression r e Ω with n output structurl expression r e Ω, such tht r e results from the ppliction of trnsformtion rule R on r e, denoted r e R r e. When pplied to ll the structurl expressions in n XML grmmr G, i.e., r e G, r e R r e, we sy tht R is pplied to G, nd trnsforms it into n output grmmr G mde of the trnsformed expressions r e G, denoted G R G Definition 8 Informtion Structure Preserving (ISP) property: Given n XML grmmr (structurl expression in) G nd grmmr trnsformtion rule R pplied to G, resulting in G, i.e., G R G, rule R is deemed informtion structure preserving if ny XML document tree D tht conforms to G lso conforms to G nd vice-vers, i.e., D, G D G D. In other words, the originl nd the trnsformed grmmr (structurl expressions in) G hve the sme structurl expressiveness, denoted G G The trnsformtion rules we provide in our study (cf. Tle ) verify the ISP property in most prcticl cses (with one exception discussed susequently), i.e., they mintin the expressiveness of the input grmmr s structurl models. They cn e grouped in three min ctegories: simple expression flttening (Rule ), repetle sequence expression flttening (Rule ) nd repetle lterntive expression flttening (Rule 3). Hereunder, we utilize DTD-like syntx (even when presenting XSD opertors) to ese the presenttion. We introduce simplified nottion for MinOccurs nd MxOccurs, such tht n element (expression) e tht is ssocited

6 Underlines tht 6 crdinlity constrints: MinOccurs = x Λ MxOccurs = y, is noted y e. Note tht n element (expression) e with no ssocited crdinlity constrints is identified s hving null constrint, which is equivlent to x e, i.e., it ppers exctly once in the XML document. We lso highlight the notion of empty structurl model (utilized in defining our trnsformtion rules): given n XML grmmr G, n element e G hs n empty structurl model, noted e, (i.e., r e ) when e does not encompss ny su-elements, nd corresponds to lef node in the XML document tree instnce. Recll tht we only trget XML structure in our current study, nd hence do not discuss element content dt-types nd vlues. Thus, elements with sic content models (e.g., PCDATA, String, Integer, etc.) will e processed s empty structurl models (e.g., <!ELEMENT dummy (#PCDATA)>, will e processed s production rule: dummy ). Tle. Outline of our XML grmmr trnsformtion rules. Note tht A nd B designte XML grmmr structurl expressions. N# Rule Type y v A x u ( ) R y v A xu (generl rule, hndling oth MinOccurs nd MxOccurs ) Simplified version of Rule hndling the MinOccurs constrint: ( AB, ) x R. (A, B),, (A, B) where (A, B) is repeted x times. Simplified version of Rule hndling the MxOccurs constrint: ( AB, ) y ( AB, ) y x R. ((A, B) ),, ((A, B) ) where ((A, B) ) is repeted y times. R (A, B),, (A, B), ((A, B) ),, ((A, B) ) where (A, B) is repeted x times, nd ((A, B) ) is repeted z = y x times Simplified version of Rule 3 hndling the MinOccurs constrint: ( A B) x R3. (A B),, (A B) where (A B) is repeted x times. Simplified version of Rule 3 hndling the MxOccurs constrint: ( A B) y Note tht ( R3. ( A A B ),, ( A B ) where ( A B ) is repeted y times. B ) underlines tht either A or B cn occur, or nothing t ll, which is different from ((A, B) ε) used in Rule 4.β underlining tht A nd B must occur together, or nothing t ll. ( A B) y x R3 (A B),, (A B), ( repeted x numer of times, nd ( A A B ),, ( A B ) is repeted z = y x times. B ) where (A B) is Simple expression flttening Repetle sequence expression flttening (And) Repetle lterntive expression flttening (Or) ISP property Specil cse The trnsformtion rules in Tle verify the ISP property (cf. proofs in [73]), to the exception of Rule, which verifies the ISP property in some prcticl cses, ut not in the generl cse: Lemm Given n XML grmmr expression of the form ( A ), trnsformtion Rule complies with the ISP property when ny of the following conditions holds: Condition : (x = y = ) or (u = v = ) Condition : (x = u = ) nd (y = or v = ) y v x u (i.e., ( A ) v u A v u ; ( A x ) A y x ). x u x u y (i.e., ( ) v v y y A A ; ( A ) A ). x u x u y v y v Condition 3: (x = y) nd (u = v) (i.e., ( A ) A ( A ) Rule my not comply with the ISP property otherwise For instnce, ISP holds when trnsforg DTD expressions such s: (A*)?, which is equivlent to ( A R3 ) A ; (A+)? which is equivlent to ( ) (A+)* which is equivlent to ( ) A R3 A ; A R3 x u A xu A, since Condition of Lemm holds. Note tht the specil cse of MxOccurs = unounded is covered in the following section. y v A y v ).

7 7 3 3 Likewise, trnsforg expression ( A ) However, consider n expression of the form ( A ) R3 A 6 6, ISP holds since Condition 3 of Lemm holds. R3 A 4. Here, the ISP property does not hold since the resulting expression does not preserve the structurl expressiveness of its originl counterprt since (neither of Lemm s conditions holds): the trnsformed expression underlines tht expression A cn occur imum of times nd mximum of 4 times (i.e., it ccepts -to-4 occurrences of element dummy), wheres the originl expression underlines tht expression A cn occur either times, or 4 times only. In ddition, we extend the repetle (sequence/lterntive) expression flttening rules in Tle to hndle the specil cses of MinOccurs = Unounded (infinite repetitions) nd recursive declrtions, s shown in the following. 3.. One-to-One Document/Grmmr Representtion 3... Hndling Repetle Expressions As descried in Tle, the simplifiction of repetle sequence nd lterntive expressions, of the form ( AB, nd ( A B) x, following Rule nd Rule 3, requires the infinite repetition of flttened expressions ((A, B) ) nd ( A B ) respectively, in order to verify the ISP property. Yet since our method is one-to-one in compring one single XML document to one single XML grmmr, repetle sequence/lterntive expressions cn e further simplified without loss of expressiveness in the context of the XML document t hnd. This cn e done y repeting the flttened expressions: i.e., ((A, B) ) w.r.t. Rule, nd ( A B ) w.r.t. Rule 3, only finite numer of times in the trnsformed grmmr, necessry to cover ll possile structurl configurtions of the concerned XML elements/expressions following the originl grmmr expression. Hence, we propose extensions of trnsformtion Rule (nmely Rule.) nd Rule 3 (nmely Rule 3.) in Tle. We note y (G G ) D grmmr G nd trnsformed grmmr G hving the sme structurl expressiveness w.r.t. prticulr document tree D. We sy tht the rule trnsforg G into G verifies the ISP property w.r.t. document tree D. ) x Tle. Outline of one-to-one document/grmmr trnsformtion rules. Note tht A nd B designte XML grmmr structurl expressions N# Rule (given n XML document tree D) Type Simplified version of Rule + hndling the MxOccurs constrint (MinOccurs is hndled the sme s in Rule.): ( AB, ) R. MxDeg( D) ((A, B) ),, ((A, B) ) where ((A, B) ) is repeted ceil( ) times, E such s E = ( AB, ) nd E denotes the expression s crdinlity w.r.t. the min And sequence opertor (e.g., E = for E=(A, B)*, E =3 for E=(A, B, C)*). ( AB, ) R x (A, B),, (A, B), ((A, B) ),, ((A, B) ) MxDeg( D) where (A, B) is repeted x times, wheres ((A, B) ) is repeted z =ceil( ) - x times E Repetle sequence expression flttening Simplified version of Rule 3+ hndling the MxOccurs constrint (MinOccurs is hndled the sme s in Rule 3.): ( A B) R3. ( A B ),, ( A B) where ( A B) is repeted MxDeg(D) times. Repetle lterntive expression ( AB, ) R3 x (A B),, (A B), ( A B ),, ( A B ) flttening where (A B) is repeted x times, nd ( A B ) is repeted z = MxDeg(D) x times. R4 A strong-liner recursive expression defined on element e, denoted e ref e rec Depth ( D ) + where e i denotes ith nested occurrence of element e, repeted n = ceil( NestDepth ( e, e ) e e e n- e n ref rec G ) times. Recursive expression flttening ISP property Given n XML grmmr (structurl expression in) G, nd n XML document tree D to e compred with G, the ppliction of Rule + nd/or Rule 3+ to grmmr (structurl expressions in) G, considering document tree D, verifies the ISP property w.r.t. D (cf. proofs in [73]). More importntly, Rule + nd Rule 3+ hve llowed to simplify (possily) infinite size expressions (originlly required to preserve grmmr expressiveness in the generl cse) into expressions which sizes vry linerly w.r.t. the out-degree of the document tree, MxDeg(D) (cf. exmple in Fig. 4.). The function ceil(x) returns the smllest integer vlue tht is not less thn x. The formuls proofs re provided in [73].

8 8 c Root c c c c Smple XML tree D c <!ELEMENT Root(, )*> <!ELEMENT (c*)> Grmmr G R+ <ELEMENT Root (( (, ) ), ((, ) ), ((, ) ))> <!ELEMENT (c*)> Grmmr G trnsformed following Rule +, w.r.t. XML doc tree D, such MxDeg( D ) tht expression ((, ) ) is repeted ceil ( ) = 3 times Simplifying expression (, )* while preserving expressiveness w.r.t. XML document tree D Note tht c* need not e simplified nd will e hndled y the edit distnce lgorithm.. Flttening repetle sequence expressions vi the ppliction of Rule +. () <!ELEMENT Pper (Title, Pper, (Downlod Pper*))> Multiple occurrences of recursive declrtions in structurl model () <!ELEMENT Pper (Title, Pper, (downlod url))> (3) <!ELEMENT Pper (Title, (downlod Pper?))> Non-liner recursive declrtion Strong-liner recursive declrtions. Smple recursive XML grmmr declrtions (represented in DTD syntx for ese of presenttion). Root Root Root XML Tree D <ELEMENT Root (, )> // Root ref <ELEMENT (Root?)> // Root rec <ELEMENT (#PCDATA)> <Element nme = Root > // Root ref <Sequence> <Element nme = type= #PCDATA /> <Element nme = > <Element ref = Root MinOccurs= /> // Root rec </Element> </Sequence> </Element> Recursive grmmr G (in oth DTD nd XSD syntxes) <Element nme = Root > // st occurrence <Sequence> <Element nme = type= #PCDATA /> <Element nme = > <Element nme = Root MinOccurs= > // nd occurrence <Sequence> <Element nme = type= #PCDATA /> <Element nme = > <Element nme = Root MinOccurs= > //3 rd occurrence R4 <Sequence> <Element nme = type= #PCDATA /> <Element nme = type= #PCDATA /> </Sequence> </Element> </Sequence> </Element> </Sequence> </Element> Non-recursive grmmr G flttened while preserving grmmr expressiveness w.r.t. document tree D (cnnot use DTD syntx here since grmmr is context-sensitive) 3... Hndling Recursive Declrtions c. Flttening recursive XML declrtions. Fig. 4. Flttening repetle nd recursive XML declrtions. The prolem of hndling recursive declrtions is comprle to tht of hndling repetle expressions such tht the recursive expression would hve to e repeted (in certin cses) n infinite numer of times in the flttened XML grmmr to preserve structurl expressiveness. Yet, since our method is one-to-one in compring one XML document to one XML grmmr, recursive grmmr expressions cn e simplified without loss of expressiveness in the context of the XML document t hnd. To do so, we propose to repet the recursive nesting only finite numer of times, necessry to cover ll possile structurl configurtions of the concerned XML elements (following the originl recursive declrtion) in the XML document tree eing compred. Note tht in our current study, we only consider strong-liner recursive declrtions, which re the most common in prctice [, 3], nd which re inherently esier to process thn non-liner recursive expressions. Formlly: Definition 9 Recursive XML Grmmr: An XML grmmr G is recursive if it contins n element e rechle from itself, denoted y e ref e rec, where e ref underlines the originl element declrtion nd e rec its recursive reference in the grmmr (e.g., in DTD expression <!ELEMENT dummy (,, dummy)>, the first dummy element is denoted dummy ref wheres the second is denoted dummy rec ) Definition Strong-Liner Recursive XML Grmmr: An XML grmmr G is strong-liner recursive if it is recursive, nd if for ech recursive element e ref e rec in G, e rec occurs t most once in the structurl expression of its contining element, such s e rec is not repetle, nd where every other element rechle from e ref (e G such tht e ref e ) is non-recursive [, 3] (cf. Fig. 4.) Following Rule 4, strong-liner recursive declrtion e ref e rec is trnsformed into chin of non-recursive nestings consisting of the elements comprised within e Ref nd e Rec, repeted finite numer of times liner in the XML document tree depth (Depth(D)) nd the nesting depth of the recursive declrtion (NestDepth(e Ref, e Rec ) = e Ref.d e Rec.d), in order to preserve structurl expressiveness w.r.t. the XML document tree. Hence, given n grmmr G nd n document tree D to e compred with G, the flttening of strong-liner recursive declrtions in G, following Rule 4, produces trnsformed grmmr G which verifies the ISP property w.r.t. D, (G G ) D (cf. proof in [73]).

9 9 A simple exmple is presented in Fig. 4.c. Here, the recursive declrtion defined on node Root in grmmr G is trnsformed into chin of non-recursive nestings consisting of the elements comprised within Root Ref nd Root Rec, repeted ( 5 + = ) 3 times w.r.t. XML document tree depth (=5) nd the nesting depth of the recursive declrtion (=) Applying Trnsformtion Rules As discussed ove, most of our trnsformtion rules verify the ISP property to the exception of Rule which only conditionlly complies with ISP. As result, the trnsformtion rules (to the exception of Rule ) verify the Church- Rosser property [35, 76] nd cn e pplied to n input XML grmmr, in ny sequence order, producing n output grmmr hving the sme structurl expressiveness (s the input grmmr): Definition 3 Extended Church-Rosser (ECR) property: Let Ω e the domin of (structurl expressions in) XML grmmrs, nd ρ = {R i, R j, R k } e the set of XML grmmr (expression) trnsformtion rules defined on Ω, ρ hs the extended Church-Rosser property w.r.t. Ω if G, G, G 3 Ω, nd R i, R j ρ, [ (G Ri G ) Λ (G Rj G 3 ) ] G 4, G 4 Ω, R k, R l ρ such tht [ (G Rk G 4 ) Λ (G 3 Rl G 4 ) where (G 4 G 4 ) ] (the resulting (structurl expressions in) grmmrs G 4 nd G 4 hve the sme structurl expressiveness (cf. ECR digrm in Fig. 5.) In other words, given n input XML grmmr G, the trnsformtion rules in Tle (except the conditionl cse of Rule ), cn e pplied to G in ny sequence order, lwys resulting in trnsformed grmmr G hving the sme structurl expressiveness s its originl counterprt (G G ). The sme crries for our one-to-one document/grmmr trnsformtion rules in Tle (cf. proof in [73]). Consider the exmple in Fig. 5.. Input grmmr declrtion (, ( (c, d, e)+ f )? ), is trnsformed, without ny loss of expressiveness, to (, ( (((c, d, e) ε), ((c, d, e) ε),...) f ), vi the ppliction of two different sequences of trnsformtion rules. In the resulting grmmr declrtion, ll repetle expressions hve een flttened, crdinlity constrints eing uniquely ssocited to single elements (i.e., nd f ) without loss of expressiveness. G Ri G Rj R k G3 G4 G4 Rl. Visul description of the ECR (Extended Church-Rosser) property, w.r.t. XML grmmr trnsformtion. Input grmmr declrtion G : <!ELEMENT (, ( (c, d, e)+ f )? ) > G R3 G R G R G 3 Strt: (, ( (c, d, e) f ) ) Rule 3 : (, ( ((c, d, e) ) f ) ) Rule : (, ( (c, d, e) f )) Rule : (, ( (((c, d, e) ε), ((c, d, e) ε),...) f )). Smple trnsformtion sequence, yielding G 3 G. G R G R3 G R G 3 Strt : (, ( (c, d, e) f ) ) Rule : (, ( ((c, d, e), (c, d, e) ε, (c, d, e) ε, ) f ) ) Rule 3 : (, ( ( (c, d, e), (c, d, e) ε, (c, d, e) ε, ) f ) ) Rule : (, ( ( ((c, d, e) ε), ((c, d, e) ε),...) f ) ). Smple trnsformtion sequence, yielding G 3 G. G 3 G 3 : <!ELEMENT (, ( (((c, d, e) ε), ((c, d, e) ε),...) f )) >. XML grmmr trnsformtion exmple, using two different sequences of XML grmmr trnsformtion rules. Fig. 5. Visul description nd smple ppliction of smple XML grmmr trnsformtions preserving the ECR property. Note tht the ISP nd ECR properties reflect the correctness nd completeness or the trnsformtion rules. Minimlity lso seems intuitive since the trnsformtion rules hve een specificlly defined to del with exctly ech of the common grmmr constrints considered in our study, nd cnnot e further reduced. Nonetheless, the nture of the trnsformtion rules pplied, during grmmr simplifiction tsk, might differ depending on the nture of the grmmr expressions (s shown in the exmple of Fig. 5.). Hence, despite yielding the sme end result (the sme trnsformed grmmr), the imlity of the numer of trnsformtions pplied might not lwys e gurnteed.

10 3.4. XML Grmmr Tree Model sed on the Disjunctive Norml Form To produce simple grmmr tree structures comprle to document trees, we propose to unfold n XML grmmr into single tree or set of trees, depending on the occurrences of the And (sequence) nd Or (choice) opertors. To do so, we introduce the disjunctive norml form of n XML grmmr s set of conjunctive grmmrs: Definition 4 Conjunctive XML Grmmr: A grmmr G is conjunctive if ll structurl expressions in G re sequence expressions, i.e., e r e in G, r e is mde of elements/expressions connected vi the And opertor Definition 5 XML Grmmr Disjunctive Norml Form (DNF): The disjunctive norml form (DNF) of n XML grmmr G is the set of conjunctive grmmrs, DNF(G) = {C} G which re equivlent in their expressiveness to G tking into ccount the lterntive structurl expressions in G, i.e., r e G such tht e r e, where r e is mde of elements/expressions connected vi the Or opertor (cf. Fig. 6) The disjunctive norml form of n XML grmmr verifies y definition the ISP property, such tht the grmmr s expressiveness (lnguge) is distriuted mong its constituent conjunctive grmmrs, L(G) = L(C i ), denoted s G DNF(G). In other words, given n XML grmmr G, nd its representtion in disjunctive norml form DNF(G) = {C} G, ny XML document tree D tht conforms to G, will conform to t lest one of its conjunctive grmmrs, i.e., G D, C DNF(G) such tht C D (the proof crries directly from the definition of DNF). <!ELEMENT Root (, ( c)) > <!ELEMENT (d e) > <!ELEMENT Root (, ) > <!ELEMENT (d) > Conjunctive grmmr C I <!ELEMENT Root (, ) > <!ELEMENT (e) > Conjunctive grmmr C II <!ELEMENT Root (, c) > Conjunctive grmmr C III. Smple grmmr G.. Disjunctive norml form of G, DNF(G) = {C I, C II, C III } G. Fig. 6. Representing n XML grmmr in its disjunctive norml form. Note tht the numer of conjunctive grmmrs, resulting from the DNF of n input XML grmmr, depends on the numer nd configurtions of Or opertors in the input grmmr expressions. This my generte prolifertion of conjunctive grmmrs depending on the expressiveness of the grmmr declrtions. In this context, we hve conducted mthemticl nlysis covering some of the most common configurtions of lterntive (Or) expressions, sed on surveys of rel DTDs nd XSDs in [, 3, 38] (some sttistics re provided in Section 5.6). Results show tht the most common lterntive expressions generte numer of conjunctive grmmrs liner in the numer of Or opertors involved (e.g., in Fig. 6, DNF(G) = + numer of Or opertors in G = 3), while only certin specific cses (of usully mixed: And-Or expressions) yield polynomil nd/or exponentil sized DNF representtions (mthemticl detils re provided in Appendix I). As result, we model ech resulting conjunctive grmmr C DNF(G) s specil ed ordered leled tree: Definition 6 Conjunctive XML Grmmr Tree: It is ed ordered leled tree in which nodes represent XML element/ttriutes, leled following element/ttriute tg nmes, nd such tht ech node is ssigned the corresponding element/ttriute (MinOccurs/MxOccurs) crdinlity opertor. Element nodes re ordered following their order of ppernce in the XML grmmr declrtions. Attriute nodes pper s children of their encompssing element nodes, sorted left-to-right y ttriute nme, nd ppering efore ll su-element silings (similrly to XML document trees, cf. Definition ). Formlly, We model conjunctive XML grmmr tree s C = (N C, E C, L C, CC C, g C ): N C is the set of nodes (i.e., vertices) in C E C N C N C is the set of edges underlining the XML element/ttriute continment reltion L C is the set of lels corresponding to the nodes of C CC C is the set of crdinlity constrints ssocited to the nodes of C g C is function g C : N C L C CC C tht ssocites lel l L C nd crdinlity constrint cc CC C to ech node nn C, following element/ttriute ordering s descried ove. We denote y C[i] the i th node of C in preorder trversl, represented s doulet (l, cc) where l L C, nd cc CC C re respectively its lel nd crdinlity constrint, referred to s C[i].l nd C[i].cc (since crdinlity constrints mount to MinOccurs nd MxOccurs, we refer to the ltter s C[i].MinOccurs nd C[i].MxOccurs, cf. Fig. 7.). C[i].d represents the node s depth in the tree, nd C[i].Deg its out-degree. R(C)=C[] designtes the of tree C Recll tht while the order of ttriute children is irrelevnt in XML [], yet we represent the ltter s ordered tree nodes in oth our document nd grmmr tree models (s descried ove) in order to reduce the complexity of the similrity computtion process, while semlessly ffecting the ccurcy of the similrity results (since ttriute nodes, in oth XML document nd grmmr trees, re ordered in the sme wy). The lgorithm for trnsforg n XML grmmr into its tree representtion in provided in Appendix II. E.g., most unordered tree distnce lgorithms re of exponentil time, compred to verge polynomil ordered tree methods [].

11 3.5. Running Exmple: Smple XML Grmmr nd Corresponding Tree Representtion Consider XSD grmmr Pper.xsd in Fig. 7., to e compred with XML document tree D in Fig... C[i].l C[i].cc <element nme= Pper > <sequence> <choice> <element nme= Author occurs= mxoccurs= > <sequence> <element nme= FirstNme type= String /> <element nme= MidNme occurs= type= String /> <element nme= LstNme type= String /> </sequence> </element> <element nme= Pulisher > <sequence occurs= > <element nme= FirstNme type= String /> <element nme= MidNme occurs= type= String /> <element nme= LstNme type= String /> </sequence> </element> </choice> <element nme= Version type= Deciml /> <element nme= Length occurs= type= Deciml /> <element nme= url occurs= > <sequence> <element nme= Pper type= String /> <element nme= Downlod mxoccurs= > <element ref= url occurs = /> </element> </sequence> </element> </sequence> <ttriute nme= Title type="string"/> <ttriute nme= Ctegory use= Implied type= String /> </element>. Smple XML grmmr Pper.xsd, designted s G. Represent defult MinOccurs = nd MxOccurs = vlues null crdinlity constrint, nd cn e omitted. C[i].l C[i].MinOcc= x Λ C[i].MxOcc= y. Grphicl representtion(s) of conjunctive grmmr tree node n. Ctegory Fig. 7. Smple conjunctive grmmr tree representtions. Title Pulisher Version Length url c. Conjunctive grmmr trees corresponding to the disjunctive norml form of grmmr Pper.xsd, designted s DNF(G) ={C I, C II, C III }. The DNF form of Pper.xsd, unfolded in set of conjunctive trees, is shown in Fig. 7.c. Grmmr Pper.xsd is first run through our trnsformtion rules (cf. Fig. 8). Then, the resulting flttened grmmr, encompssing Or opertors, is represented s three conjunctive grmmr trees ( Fig. 7. nd c) underlining the three structurl configurtions tht cn e otined following the different comintions of the Or opertors. Ctegory Ctegory FirstNme Title Pulisher Version Length url MiddleNme Pper Pper Pper Title Author Version Length url FirstNme MiddleNme LstNme Pper Downlod Conjunctive grmmr tree C I Conjunctive grmmr tree C II Conjunctive grmmr tree C III y C[i].l x LstNme Pper Pper Pper Pper Pper Downlod Downlod url Downlod url Downlod url Downlod <element nme= Pulisher > <sequence occurs= > <element nme= FirstNme type= String /> <element nme= MidNme occurs= type= String /> <element nme= LstNme type= String /> </sequence> </element> R <element nme= Pulisher > <choice> <sequence> <element nme= FirstNme type= String /> <element nme= MidNme occurs= type= String /> <element nme= LstNme type= String /> </sequence> <sequence/>  </choice> </element> Originl declrtion. Flttened declrtion (trnsformtion Rule ). <element nme= url occurs= > <sequence> <element nme= Homepge type= String /> <element nme= Downlod mxoccurs= /> <element ref= url occurs = /> </element> </sequence>. Flttening sequence expression in Pper.xsd of Fig. 7.. R4 <element nme= url occurs= > <sequence> <element nme= Homepge type= String /> <element nme= Downlod mxoccurs= /> <element nme= url occurs= > <sequence> <element nme= Homepge type= String /> <element nme= Downlod mxocc= type= String /> <sequence> </element> </element> Originl recursive declrtion. Flttened declrtion (trnsformtion Rule 4).. Simplifying recursive node declrtion url in grmmr Pper.xsd of Fig. 7., w.r.t. XML document tree D in Fig... The recursive nesting is re-inserted second time, corresponding to totl of occurrences. Fig. 8. Flttening sequence nd recursive declrtions in grmmr Pper.xsd of Fig. 7..

12 Elsevier Informtion Sciences Journl Note tht in ddition to hndling more expressive XSD constrints (MinOccurs nd MxOccurs), s well s lterntive expressions nd recursive declrtions, our XML grmmr tree model lso strightforwrdly hndles context-sensitive XSD declrtions, where identiclly leled elements cn hve multiple definitions in different contexts in the grmmr (s opposed to context-free DTD declrtions). For instnce, grmmr Pper.xsd contins two elements shving the sme lel Pper: i) the grmmr node element, nd ii) the first child of element url. Hence with XML documents nd grmmrs represented s ed ordered leled trees, the prolem of XML document/grmmr structurl comprison now comes down to compring corresponding trees. 4. XML Document nd Grmmr Tree Comprison As mentioned previously, our pproch consists of two min phses: i) Tree Representtion of documents nd grmmrs s ed ordered leled trees (descried in the previous section), ii) nd Tree Edit Distnce Comprison for computing the similrity etween document nd grmmr tree structures. The overll lgorithm is presented in Fig. 9. After trnsforg the XML document nd XML grmmr into their tree representtions ( Fig. 9, lines -), the edit distnce etween the document tree nd ech conjunctive grmmr tree is computed (lines 3-). Definition 7 Tree Edit Distnce (TED): The edit distnce etween two trees A nd B is defined s the imum cost of ll edit scripts tht trnsforms A to B, TED(A, B, Cost Op )=Min{Cost ES }, noted s TED(A, B) [, 8] Hence, the prolem of compring two trees A nd B, i.e., evluting the structurl similrity etween A nd B, is viewed s the prolem of computing corresponding tree edit distnce, i.e., imum cost edit script [8]. In this context, the notion of edit distnce cn e dpted to our study s follows: Definition 8 TED XDoc_XGrm : Given n XML document tree D, conjunctive grmmr tree C, s well s corresponding tree edit opertions costs, denoted Cost InsTree/DelTree, nd sed on the trditionl definition of tree edit distnce, we define TED XDoc_XGrm (D, C, Cost InsTree/DelTree ), noted simply s TED XDoc_XGrm (D, C), s the imum cost of ll edit scripts trnsforg D into document tree D which is vlid w.r.t. C, i.e., C D Algorithm XDoc_XGrm_Comprison Input: D // XML document G // XML grmmr {R} // Set of trnsformtion rules (cf. Tles nd ) Output: Sim(D, G) // Structurl similrity vlue etween D nd G [,] Begin Begin Tree Representtion phse D Tree = XDoc_to_Tree(D) // Document tree representtion G Tree Set = XGrm_to_Tree(G, D, {R}) // Grmmr tree representtion End Tree Representtion phse Begin Tree Edit Distnce phse Dist[] = new [ G Tree Set ] // G Tree Set {C} G, n# of conjunctive 3 // grmmr trees representing G 4 Multi-thred (i=, i G Tree Set, i++) // Tree edit distnce multi-threding 5 { 6 {Cost InsTree/DelTree } = TOC XDoc (D Tree ) TOC XGrm (G Tree Set[i]) // Tree opertions costs 7 Dist[i] = TED XDoc_XGrm (D Tree, G Tree Set[i], {Cost InsTree/DelTree }) // Tree edit distnce 8 } 9 Return Sim (D, G) = Mx // Structurl Similrity XDoc_XGrm i GTreeSet + Dist[i] End Tree Edit Distnce phse End Fig. 9. Pseudo-code of overll XML document/grmmr comprison lgorithm. The comprison is undertken using concurrent computing, i.e., multi-threding, evluting the similrity etween the document tree nd ech of the conjunctive grmmr trees simultneously (lines 5-9) since they constitute forest of seprte tree structures (corresponding to the input grmmr) following our grmmr tree model, without neither interfering nor relying on ech other s results. Tree edit opertions costs re computed (vi lgorithms TOC XDoc nd TOC XGrm, line 7, mentioned in the following section), nd re consequently provided s input to the min tree edit distnce lgorithm (TED XDoc_XGrm, line 8). An tomic edit opertion on tree is either the insertion (ddition) of n inner node, the insertion of lef node, the deletion (removl) of n inner node, the deletion of lef node, or the replcement (i.e., updte) of node y nother one. A complex tree edit opertion is sequence of tomic tree edit opertions, treted s one single opertion, such s the insertion/deletion of whole su-tree, or the reloction (moving) Algorithms TOC XGrm nd TED XDoc_XGrm re developed in the following sections. Algorithm TOC XDoc is provided in [73].

13 of su-tree from one position into nother in its contining tree. A sequence of edit opertions, clled n edit script, ES= op,, op k cn e pplied on tree T, producing resulting tree T y pplying the edit opertions op, op k in ES to T, following their order of ppernce in the script. By ssociting costs, Cost Op, with edit opertions, the cost ES of n edit script is defined s the sum of the costs of its component opertions [, 8]: Cost ES = Cost i= Op. [9] i Then, the mximum similrity (imum distnce) etween the document tree nd ech conjunctive grmmr tree is evluted s the overll document/grmmr structurl similrity vlue (line ). One cn relize, sed on the definition of TED XDoc_XGrm, tht our pproch llows oth: - Exct document/grmmr structure vlidtion ( Definition 5), where TED XDoc_XGrm (D, C) = C D, - Approximte document/grmmr vlidtion (cf. Definition 6), where TED XDoc_XGrm (D, C) C D, such tht designtes similrity vlue which is inversely proportionl to TED XDoc_XGrm (D, C): the lesser the similrity, the lrger the distnce TED XDoc_XGrm (D, C), i.e., the lrger the edit script cost needed to trnsform D into document tree D such tht C D. As for the method to compute TED XDoc_XGrm (D, C), we uild on dynmic progrmg formultion similr to centrl tree edit distnce lgorithm y Niermn nd Jgdish in [48], minly in terms of the edit opertions utilized. However, we introduce novel recurrences specificlly designed to hndle XML grmmr crdinlity constrints (nmely MinOccurs nd MxOccurs). In the reder, we first discuss tree edit opertions costs in Section 4., nd susequently develop the min lgorithm nd similrity mesure in Sections 4. nd 4.3. Section 4.4 presents computtion exmples. Time nd spce complexity nlyses re provided in Section Tree Edit Opertions Costs: TOC XDoc & TOC XGrm Our tree edit distnce lgorithm employs five edit opertions: i) lef node insertion, ii) lef node deletion, iii) node updte, iv) tree insertion nd v) tree deletion dopted from [8, 48] (forml definitions re provided in Appendix III). However, centrl issue in most edit distnce pproches is how to detere edit opertions cost vlues, in order to consequently detere the edit distnce vlue (i.e., the imum cost of ll possile edit scripts). An intuitive wy would e to ssign identicl unit costs to single node opertions: Cost Ins (x)= Cost Del (x)= Cost Upd (x,y) = Cost Upd (x.,y.)= when x. y., otherwise, Cost Upd =, underlining tht no chnges re to e mde to the lel of node x. () As for tree deletion (insertion) opertions, they cn e nturlly evluted s the sum of the costs of deleting (inserting) ll individul nodes in the considered su-tree [6], such s: Cost Del Tree / InsTree(T) = Cost Del/ Ins ( x ) i () All nodes xi T Following our pproch, computing TED XDoc_XGrm (D, C) comes down to trnsforg document tree D into D to otin C D. To do so, node/tree deletion opertions will e pplied on the document tree D to remove those nodes which do not conform to the grmmr, wheres node/tree insertion opertions will dd grmmr nodes to the document tree D in order to otin C D. Yet, recll tht nodes in grmmr trees re ssocited crdinlity constrints: MinOccurs nd MxOccurs, specifying the llowed numer of occurrences corresponding to (su-tree ed t ) given node. Hence, grmmr tree insertion opertions costs re updted ccordingly in order to evlute TED XDoc_XGrm : Cse Optionl Grmmr Nodes: An optionl grmmr tree node x i C such s x i.minoccurs =, ( x i.mxocc), long with its su-tree C i (i.e., the su-tree ed t x i = R(C i )), do not ffect the costs of tree insertion opertions pplied on C. In other words, node/su-tree x i /C i do not hve to e inserted in the document tree D to otin C D, nd hence should not ffect edit distnce cost. Cse Mndtory Grmmr Nodes: A mndtory nd/or repetle grmmr tree node x i C such s x i.minoccurs ( x i.mxoccurs), long with its su-tree C i (where R(C i ) = x i ), ffect the costs of tree insertion opertions pplied on C, considering the imum numer of occurrences required for x i (C i ), i.e., {x i.minoccurs, x i.mxoccurs} x i.minoccurs. In other words, when x i /C i is mndtory/repetle, then it should occur (or should e inserted) in the document tree D, imum numer of times (x i.minoccurs) necessry to otin C D, thus ffecting tree insertion opertions costs ccordingly. Formlly, given conjunctive XML grmmr tree C (with node R(C) of degree k), nd its first level su-trees C,, C k (i.e., the su-trees ed t the children nodes of R(C)), we compute corresponding tree opertions costs s: Other complex opertions such s su-tree copying nd gluing hve een considered [9]. These re similr to tree insertions/deletions respectively, ut re defined in the context of unordered tree comprison. Thus, they won t e further investigted hereunder.

14 4 Elsevier Informtion Sciences Journl Cost InsTree(C) = Cost Ins (R(C)) + Cost ( C ) R( C ). MinOcc InsTree i i All first-level su-trees Ci of C where R(C i ).MinOccurs underlines the MinOccurs constrints ssocited to the of C i. (3) The lgorithm for computing insert tree opertions costs is provided in Fig.. Here, we only develop XML grmmr tree processing, TOC XGrm, nd omit the pseudo-code for XML document tree processing, TOC XDoc (computing tree deletion opertions costs) since the ltter is strightforwrd following formul (). Algorithm TOC XGrm goes through ll su-trees of the conjunctive XML grmmr tree, computing grmmr su-tree insertion opertions costs following Formul (3) ( Fig., lines 9-), tking into ccount corresponding su-tree node MinOccurs. Algorithm TOC XGrm Input: C Output: {Cost InsTree } C // XML conjunctive grmmr tree // Tree insertion opertions costs, for ll su-trees in C Begin End M = Degree(C) // The numer of first level su-trees in conjunctive grmmr tree G. Cost InsTree (C) = Cost Ins (R(C)) // Initilizing grmmr su-tree costs // with the cost of corresponding su-tree node. If (M = ) 3 { 4 Return Cost Ins (R(C)) // Lef node opertions re ssigned unit costs 5 } // in our pproch (sic cost model), 6 Else 7 { 8 For (i = ; i M ; i++) // Going through the first level su-trees of C, C i / i= M 9 { Cost InsTree (C) = Cost InsTree (C) + (TOC XGrm (C i ) R(C i ).MinOccurs) } } 3 Return {Cost InsTree } C // Tree insertion opertions costs 4 Fig.. Algorithm TOC XGrm for computing XML grmmr su-tree opertions costs. Author x Tree C Pper C url x C 5 FirstNme MiddleNme LstNme Homepge Downlod x x 3 x 4 x 6 x 7 Fig.. Smple conjunctive grmmr tree C (extrcted from first grmmr tree in Fig. 7.c). Consider smple grmmr tree C in Fig.. Tree insertion opertions costs following TOC XGrm re computed s: Cost InsTree (C ) = Cost Ins (R(C )) + [ Cost Ins (x ) + Cost Ins (x 3 ) + Cost Ins (x 4 ) ] = 3 Cost InsTree (C ) = Cost Ins (R(C )) + [ Cost Ins (x 6 ) + Cost Ins (x 7 ) ] = 3 Cost InsTree (C) = Cost Ins (R(C)) + [ Cost InsTree (C ) + Cost InsTree (C ) ] = 7 Here, the cost of inserting su-tree C ed t the node of lel Author (x ) is equl to 3. This is ecuse node x 3.MinOcc=, which mens tht the occurrence of node MiddleNme is not required (in the trnsformed document tree for it to conform to the grmmr). In turn, the cost of inserting tree C (s whole) is equl to 7, since its first-level sutree C is required to pper imum numer of times in the trnsformed document tree (R(C ).MinOccurs = x.minoccurs =, yielding Cost DelTree (C ) ), wheres su-tree C is optionl (yielding Cost DelTree (C ) ). Note tht oth MinOccurs nd MxOccurs constrints re used in our min tree edit distnce lgorithm, to verify whether the imum/mximum llowed numer of occurrences for given grmmr node (su-tree) re violted in the document tree eing compred, so s to llocte tree edit opertions ccordingly (descried in the following section). In this study, we restrict our presenttion to the sic cost schemes ove, since we focus on the structurl properties of XML documents nd grmmrs (i.e., considering prent/child reltionships nd ordering mong XML elements, identified y their lels). The investigtion of lterntive tree opertions cost models (considering for instnce the semntic reltedness etween document nd grmmr node lels given semntic reference such s WordNet [45], Wikipedi [84], or Google [37]) will e ddressed in dedicted upcog study. 4.. Tree Edit Distnce (TED) Algorithm: TED XDoc_XGrm As riefly mentioned previously, we propose novel tree edit distnce method to consider the structurl properties of XML document trees nd conjunctive grmmr trees (inspired y existing tree edit distnce proposls, nmely [6, 48]). Hereunder, we first descrie the overll process of our min lgorithm. Then, we present the Trditionl TED

15 Recurrence formultion s the sic foundtion of our lgorithm, nd introduce our Extended TED Recurrence formultion tking into ccount the Minoccurs nd MxOccurs constrints. We then develop computtion exmples Min Algorithm The overll lgorithm TED XDoc_XGrm for computing the edit distnce etween n XML document tree D nd conjunctive grmmr tree C is shown in Fig.. It uilds on n Extended TED Recurrence to identify the imum cost edit script (i.e., the imum distnce, thus mximum similrity) trnsforg D into D to otin C D. In short, lgorithm TED XDoc_XGrm recursively goes through the su-trees of oth XML document nd XML grmmr tree structures, comining node updte, tree insertion nd tree deletion opertions so s to identify the sequence of opertions (edit script) of iml cost. The insertion/deletion of single nodes is undertken vi tree insertion nd tree deletion opertions pplied on lef node su-trees. In other words, lef node insertion/deletion opertions do not contriute directly to the edit distnce lgorithm, ut re utilized in computing tree insertion nd tree deletion opertions costs (cf. TOC XGrm in Fig. ). First, the updte opertion is pplied to the s of the su-trees eing compred (releling su-tree nodes, Fig., line 6). Then, tree deletion opertions re pplied to corresponding document first-level su-trees (line 7), nd tree insertion opertions re pplied on grmmr first-level su-trees tking into ccount the MinOccurs constrint (line 8), in order to consider the imum numer of occurrences required in the document tree so s to conform to the grmmr tree (s discussed with su-tree opertions costs in Section 4.). Consequently, the edit distnce process TED XDoc_XGrm is recursively clled once for ech pir of su-trees D i nd C j occurring t the sme structurl level (depth) in the document nd grmmr trees eing compred. This is undertken following our Extended TED Recurrence formul (lines -4) descried in detil the following section. The imum distnce etween ll sutrees (first-level, second-level, nd so on) of the document tree D nd grmmr tree C is finlly returned (line 8). When the grmmr tree is free of constrint opertors (i.e., when ll elements in C re ssocited defult constrints MinOccurs = MxOccurs = ), our lgorithm simplifies to clssicl TED process (nmely the lgorithm in [48]). 5 Algorithm TED XDoc_XGrm Input: D // XML document tree. C // Conjunctive grmmr tree. {Cost InsTree/DelTree } // Tree opertions costs computed vi TOC XDoc nd TOC XGrm. Output: TED XDoc_XGrm (D, C) // Edit distnce etween D nd C. Begin M = Degree(D) // The numer of first level su-trees in D N = Degree(C) // The numer of first level su-trees in C Dist [, ] = new [...M, N] 3 NOcc [] = new [ N] // Keeping trck of the numer of occurrences for ech grmmr su-tree, 4 NOcc [...N] = // in order to hndle corresponding MinOccurs nd MxOccurs constrins 5 Dist [, ] = Cost Upd (R(D)., R(C).) // R(D). nd R(C). re the lels of the s of trees D nd C 6 For (i = ; i M ; i++) { Dist[i, ] = Dist[i-, ] + Cost DelTree (D i ) } 7 For (j = ; j N ; j++) { Dist [, j] = Dist [, j-] + Cost InsTree (C j ) R(C j ).MinOccurs } 8 For (j = ; j N ; j++) 9 { For (i = ; i M ; i++) { NOcc[j]++ α = Dist[i-, j] + Cost DelTree (D i ) // Considering su-tree deletion costs 3 β = Dist[i, j-] + Cost InsTree [C j ] R(C j ).MinOccurs // Considering su-tree insertion costs 4 If (NOcc[j] < R(C j ).MxOccurs) 5 { 6 If (NOcc[j] < R(C j ).MinOccurs) // Considering the MinOccurs constrint, 7 NOcc j - { = Dist[i- NOcc[j], j-] + TED XDoc_XGrm ( Di - n, Cj, { CostInsTree/DelTree }) + Cost InsTree(C j) (R(C j).minoccurs - NOcc[j]) } 8 n= Else // Considering the MxOccurs constrint, 9 { NOcc j - γ = Dist[i- NOcc[j], j-] + TED XDoc_XGrm ( Di - n, Cj, { Cost InsTree/DelTree }) } n= } Else { γ = Dist[i-, j-] + TED XDoc_XGrm (D i, C j, {Cost InsTree/DelTree }) } // Clssic edit distnce formul, 3 3 Dist[i, j] = { α, β, γ } // Identifying imum distnce 4 If(Dist[i, j] = α Dist[i, j] = β) { NOcc[j] = } // Updting NOcc vlue corresponding to su-tree C j 5 } // End For i 6 } // End For j 7 Return Dist[M, N] // Edit distnce vlue 8 End Fig.. Algorithm TED XDoc_XGrm for compring n XML document tree nd conjunctive grmmr tree.

16 6 Elsevier Informtion Sciences Journl 4... TED Recurrences TED XDoc_XGrm (D, C) computes, s su-routines, the edit distnce etween pirs of first-level su-trees D i D nd C i C (i.e., the su-trees ed t the children nodes of R(D) nd R(C) respectively), noted TED XDoc_XGrm (D i, C j ). We use leftto-right numering to identify first-level su-tree order. We denote y Dist[i, j] the edit distnce mtrix keeping trck of the edit distnce etween document tree D with only its i first-level su-trees, identified s prtil document tree D, nd grmmr tree C with only its j first-level su-trees, identified s prtil grmmr tree C<j>. Hence, trditionl tree edit recurrence dopted from existing pproches [6, 48] cn e represented s: Trditionl TED Recurrence. Consider the pir of first-level su-trees D i D nd C i C. Then: Dist[i, j] = Dist[ i, j] + Cost DelTree ( Di ) = Dist[ i, j-] + Cost InsTree ( C j ) γ = Dist[ i-, j-] + TEDXDoc_XGrm ( Di, C j) Proof. Our gol is to find the imum cost script trnsforg D into prtil document tree D such tht C<j> D. This cn e computed in three wys: Hving Dist[i-, j], we spend α = Dist[i-, j] + Cost DelTree (D i ), deleting D i from D Hving Dist[i, j-], we spend β = Dist[i, j-] + Cost InsTree (C j ), inserting one occurrence of C j into D Hving Dist[i-, j-], we spend γ = Dist[i-, j-] + TED XDoc_XGrm (D i, C j ), trnsforg D i into D i such tht C j D i. Since these three cses express ll possile tree edit pths yielding Dist[i, j], we keep the imum from these costs. Proof description. Trditionl TED Recurrence crries from Niermn & Jgdish s pproch [48]. The imum cost script trnsforg prtil tree D into D in order to hve C<j> D cn e computed in three wys: Hving Dist[i-, j] nd the cost of deleting su-tree D i, we need to spend t lest Dist[i-, j] + Cost DelTree (D i ) to trnsform prtil document tree D into D in order to otin C<j> D. Hving Dist[i, j-] nd the cost of inserting su-tree C j, we need to spend t lest Dist[i, j-] + Cost InsTree (C j ) to trnsform D into D to otin C<j> D. Hving Dist[i-, j-], we need to spend t lest Dist[i-, j-] + TED XDoc_XGrm (D i, C j ) to trnsform D into D to otin C<j> D. Since the three cses ove express ll possile tree edit pths following the set of edit opertions considered in our pproch, hence we keep in Dist[i, j] the imum from the three costs α, β, nd γ, i.e., the imum ttinle distnce etween D nd C<j> llowing to trnsform D into D such tht C<j> D TED XDoc_XGrm is recursively pplied on ll su-trees in D nd C (first-level, second-level, so on, cf. computtion, following [48]), hence identifying the imum cost scrip trnsforg D into document tree D such tht C D. In short, the Trditionl TED Recurrence underlines the most sic cse where the conjunctive grmmr tree C is virtully free of crdinlity constrint opertors (i.e., when ll elements in C re ssocited defult constrints MinOccurs = MxOccurs = ), such tht ll elements (su-trees) re mndtory nd should pper exctly once. Hence, we propose n Extended TED Recurrence to specificlly consider the MinOccurs nd MxOccurs crdinlity constrints when compring document nd grmmr trees. To do so, we keep trck of the numer of occurrences NOcc of the document su-trees corresponding to ech grmmr su-tree in the conjunctive grmmr t hnd (which will llow us to verify whether the corresponding grmmr su-tree MinOccurs/MxOccurs constrint hs een met, or violted, nd if so to wht extent). Su-tree occurrences in the document tree cn e exct (conforg, ) or pproximte (similr enough, ) to the grmmr su-tree. 3 c C d e C 3 C c d e d e d e D D D 3 NOcc(C ) = 3. Conjunctive grmmr tree C. Document tree D.. Document tree E. Document tree F. D 4 Document tree D is vlid w.r.t. C (C D), wheres document trees E nd F re similr (ut not vlid) w.r.t. C (C {E, F}). Fig. 3. Smple document nd grmmr trees. Consider the exmple in Fig. 3, comprising of conjunctive grmmr tree C nd document trees D, E, nd F. Here, one cn relize tht grmmr su-tree C occurs three times in document tree D, where C {D, D, D 3 }. Also, grmmr su-tree C occurs once in document tree E, where C E. Yet, su-tree C occurs 4 times in document tree E d e E 3 E NOcc(C ) = c F d e d e F d d 6 f F F 3 F 4 NOcc(C ) = 4 F 5 c

17 F, where C {F, F 3 } wheres C {F 4, F 5 }. Note tht we cn computtionlly decide whether document su-tree D i consists of n exct ( ) or pproximte ( ) occurrence of grmmr su-tree C j sed on the corresponding edit distnce score TED XDoc_XGrm (D i, C j ) (e.g., TED XDoc_XGrm (D i, C j ) = is otined when C j D i, wheres TED XDoc_XGrm (D i, C j ) is otined when C j D i ). Hence, sed on the Trditionl TED Recurrence, nd the notion of numer of occurrences: NOcc, we cn effectively consider the MinOccurs nd MxOccurs constrints in our tree edit distnce computtions s follows: Extended TED Recurrence (TED + ). Consider the pir of first-level su-trees D i D nd C i C. Let NOcc[j] e specil counter keeping trck of the numer of occurrences (exct/pproximte mtches) of grmmr su-tree C j in the document tree D. Then, considering R(C i ).MinOccurs nd R(C i ).MxOccurs: Dist[i, j] = Where: = Dist[i -, j] + Cost DelTree (D i ) = Dist[i, j-] ì ï g = í ï î + Cost InsTree (C j ) R(C j ).MinOccurs g if (NOcc[j] < R(C j ).MinOccurs) / / Condition : Considering MinOccurs constrint g else if (NOcc[j] < R(C j ).MxOccurs) / / Condition : Considering MxOccurs constrint g 3 else / / Trditionl TED computtion NOcc j = Dist[ i-nocc[j], j-] + TEDXDoc_XGrm ( Di - n, C j ) Cost InsTree ( C j ) ( R( C j ). MinOccurs - NOcc [j]) n= 7 NOccéë j ù û - g = Dist[i - NOcc[j], j-] + TED XDoc_XGrm (D i - n, C j ) å n= g 3 = Dist[i-, j-] + TED XDoc_XGrm (D i, C j ) Proof. Our gol is to find the imum cost edit script trnsforg D into D to otin C<j> D, given tht grmmr su-tree C i C<j> is required to pper imum numer of times (R(C j ).MinOccurs) nd mximum numer of times (R(C j ).MxOccurs) in D. This cn e computed in three wys: Hving Dist[i-, j], we spend α =Dist[i-, j] + Cost DelTree (D i ), deleting D i from D Hving Dist[i, j-], we spend β = Dist[i, j-] + Cost InsTree (C j )R(C j ).MinOccurs, inserting su-tree C j into D s mny times s required y the corresponding R(C j ).MinOccurs constrint. Hving Dist[i-, j-], we need to ccount for three lterntive cost fctors: Hving NOcc[j] occurrences of C j in D, such tht NOcc[j] [R(C j ).MinOccurs, R(C j ).MxOccurs)] (Condition ) we compute the edit distnce etween ll su-trees in D which mtch C j, strting from D i - NOcc[j] (first mtch) to D i (lst mtch): γ = Dist[i-NOcc[j], j-] + NOcc j n= XDoc_XGrm i - n, j TED ( D C ) Hving NOcc[j] < R(C j ).MinOccurs (Condition ) we dd to the costs of the NOcc[j] existing occurrences of C j (i.e., γ ) the cost of dditionl su-tree occurrences required to occur in D in order to fulfill C j s MinOccurs constrint: γ = γ + Cost InsTree (C j ) (R(C j ).MinOccurs NOcc[j]). Otherwise, when NOcc[j] > R(C j ).MxOccurs, then we pply the Trditionl TED Recurrence fctor 3, such tht every dditionl C j occurrence is treted s ny regulr su-tree occurrence. Since these three cses express ll possile tree edit pths yielding Dist[i, j], we keep the imum from these costs. Proof description. Our gol is to find the imum cost edit script trnsforg prtil document tree D into D to otin C<j> D, considering tht grmmr (element) su-tree C i C<j> is required to pper imum numer of times (R(C j ).MinOccurs) nd mximum numer of times (R(C j ).MxOccurs) in the trnsformed prtil document tree D for it to conform to C<j>. This cn e computed in three wys: Hving the vlue Dist[i-, j] nd the cost of (deleting) su-tree D i, we need to spend t lest Dist[i-, j] + Cost DelTree (D i ) to trnsform the prtil document tree D into D to otin C<j> D. This crries from the Trditionl TED Recurrence. Hving the vlue Dist[i, j-] nd the cost of (inserting) su-tree C j, we need to spend t lest Dist[i, j-] + Cost InsTree (C j )R(C j ).MinOccurs to trnsform D into D to otin C<j> D. This requires inserting su-tree C j into D, s mny times s required y the corresponding R(C j ).MinOccurs constrint.

18 8 Elsevier Informtion Sciences Journl Hving the vlue Dist[i-, j-], we need to ccount for two cost fctors: the costs of i) existing su-tree occurrences of C j (su-trees mtching C j which lredy pper in the prtil document tree), nd ii) remining su-tree occurrences of C j (su-trees which re still required to pper which need to e inserted in the prtil document tree) to fulfill C j s MinOccurs nd MxOccurs constrints: Existing su-tree occurrences cost: It is only pplied when NOcc[j] < R(C j ).MinOccurs, i.e., when the numer of su-trees mtching C j in the prtil document tree, does not yet fulfill R(C j ).MinOccurs (cf. ). Hving NOcc[j] existing occurrences (exct/pproximte mtch cndidtes) of su-tree C j in prtil document tree D, the similrity score etween ech of the (exct/pproximte) mtch cndidtes on one hnd, nd C j on the other hnd, need to e computed, in order to identify the overll cost of these existing su-tree occurrences. To do so, we need to strt from Dist[i-NOcc[j], j-], the distnce vlue t the lst position (in the edit distnce tle) preceding the occurrence of the first document su-tree mtch cndidte with C j, i.e., D i - NOcc[j], nd then spend t lest NOcc j n= XDoc_XGrm i - n, j TED ( D C ), covering the sum of the tree edit distnce costs for compring grmmr su-tree C j with ll consecutive first-level document su-trees in D rnging from su-tree D i - NOcc[j] (the first exct/pproximte mtch cndidte of grmmr su-tree C j in the document tree) to D i (the lst exct/pproximte mtch cndidte of C j in the document tree). Exmple: Consider computing the edit distnce etween document tree F nd grmmr tree C from Fig. 3. Evluting the fctor t Dist[4, ], considering grmmr su-tree C, nd hving NOcc[]=3 (given tht 3 possile cndidte su-trees mtching C hve een identified: F, F 3, nd F 4 ) yields: Dist[, ] + TED XDoc_XGrm (F, C ) + TED XDoc_XGrm (F 3, C ) + TED XDoc_XGrm (F 4, C ) = This mens tht C {F, F 3 } such tht their occurrences in prtil tree D<3> do not entil ny dditionl cost, wheres C F 4 requiring trnsformtion of cost= (i.e., the insertion of node e in F 4 ) for prtil document tree F<4> to ecome vlid w.r.t. grmmr tree C<>, i.e., C<> F<4> (cf. grphicl presenttion in Fig. 4.c, nd more detiled computtion exmples in the following section). Remining su-tree occurrences cost: It is only pplied when NOcc[j] [R(C j ).MinOccurs, R(C j ).MxOccurs)], i.e., when the numer of su-trees mtching C j in the prtil document tree, remins within C j s constrints mrgin (cf. ). Here, in ddition to the edit distnce costs of the NOcc[j] existing occurrences (exct/pproximte mtches) of su-tree C j in the prtil document tree D, we need to ccount for the cost of su-tree occurrences (corresponding to C j ) which hve not yet een inserted in D ut which re required to occur in D, to fulfill the MinOccurs constrint ssocited to C j, in order to otin C<j> D. This is mthemticlly concretized in the edit distnce formul y dding: Cost InsTree (C j )(R(C j ).MinOccurs NOcc[j]), covering the cost of inserting su-tree C j multiplied y the imum numer of occurrences needed R(C j ).MinOccurs, us the numer of lredy existing occurrences NOcc[j] of exct/pproximte mtches of C j in the prtil document tree D. The remining su-tree occurrences fctor is (nturlly) disregrded when R(C j ).MinOccurs =, such tht no dditionl occurrences whtsoever re required in D since C<j> D. Otherwise, when the numer of su-trees mtching C j in the prtil document su-tree surpsses C j s (mximum) crdinlity constrints, i.e., NOcc[j] > R(C j ).MxOccurs, then we simply pply the Trditionl TED Recurrence fctor ( 3 ), such tht every dditionl C j occurrence is treted s ny regulr su-tree occurrence in the prtil document su-tree. Since the three cses ove express ll possile tree edit pths following the set of edit opertions considered in our pproch, consequently we keep in Dist[i, j] the imum from the three costs α, β, nd γ, i.e., the imum ttinle distnce etween D nd C<j> llowing to trnsform D into D such tht C<j> D Integrting TED + in the Min Algorithm When utilized in our min TED XDoc-XGrm lgorithm ( Fig. ), the Extended TED Recurrence (TED + ) is recursively clled for every pir of su-trees D i nd C j in the document nd grmmr trees eing compred ( Fig., lines -4). Here, the numer of occurrences of document su-trees evluted s potentil exct/pproximte mtch cndidtes of grmmr su-tree C j C, noted NOcc[j], is compred with corresponding R(C j ).MinOccurs nd R(C j ).MxOccurs constrints (lines 5 nd 7) in order to decide on the tree edit distnce recurrence to execute (,, or 3 ). Then, the imum edit distnce etween prtil document tree D nd prtil grmmr tree C<j>, highlighting the imum cost scrip necessry to trnsform D into D to otin C<j> D, is kept in the distnce mtrix Dist[i, j]. Counter NOcc[j], keeping trck of the numer of occurrences of document su-trees D i mtching ech grmmr su-tree C j, is incremented when processing every D i initilly considered s potentil new (exct/pproximte) mtch cndidte for C j (line ). Then, NOcc[j] s new vlue is preserved whenever the edit distnce cost Dist[D i, C j ] = Min(,, ) =, i.e., whenever the cost of the edit scrip leding to TED XDoc_XGrm (D i, C j ) (pplying the fctor), is lesser thn those of: i) deleting D i (pplying the fctor), nd ii) inserting C j (pplying the fctor). This mens tht the chepest cost for trnsforg prtil tree D in order to otin C<j> D, is through computing TED XDoc_XGrm (D i, C j ) (i.e., through the fctor), rther thn deleting D i or inserting C j (pplying the or fctors), which in turn mens (following the logic of edit distnce) tht D i nd C j mtch; in other words tht the potentil mtch

19 cndidte D i is ctully confirmed mtch for C j (either n exct mtch C j D i, when =, or n pproximte mtch C j D i, when ). Otherwise, when the imum distnce Dist[D i, C j ] = Min(,, ), then the NOcc[j] is reinitilized (line 5), hence ignoring D i s potentil mtch for C j. At the end, the lgorithm returns the imum distnce etween ll su-trees (first-level, second-level, nd so on) of the document tree D nd grmmr tree C (line 8), reflecting the imum cost scrip necessry to trnsform D into D to otin C D. The imum distnt vlue is then used to compute XML document/grmmr similrity Similrity Mesure As indicted previously, we dopt the concept of similrity s the inverse of distnce function ( smller distnce vlue underlining higher similrity degree). This iml distnce is computed using our TED XDoc_XGrm lgorithm, such tht our document/grmmr similrity mesure is defined s follows: Sim (D, C) = [, ] XDoc_XGrm (4) + TED (D, C) XDoc_XGrm 9 When the XML grmmr is represented s set of conjunctive grmmr trees G = {C} G, the mximum similrity (i.e., imum edit distnce) etween the XML document tree nd the set of conjunctive grmmr trees is retined: Sim (D, G) = XDoc_XGrm Mx + TED (D, C XDoc_XGrm i ) C i {C} G [, ] Our similrity mesure in formul (5) is consistent with the forml definition of similrity, s (semi-) metric function stisfying (in prt) the metric properties of Reflexivity, Minimlity, Symmetricity nd Tringulr Inequlity (cf. detils in Appendix IV). Our mesure is semi-metric (nd not full metric) since: i) it does not llow compring two grmmrs (i.e., Sim(G, G )), nor ii) using grmmr s the first prmeter of the similrity mesure (Sim(G, D) is not llowed, i.e., we cnnot trnsform grmmr G in order to otin G D. We do it the other wy round: trnsforg D to otin G D ). Compring/trnsforg grmmrs is out of the scope of this study. (5) E DelTree(E ) E d e d e DelTree(E ) d e d e F d e d e d F d e d e Hving Dist[, ] = = Dist[, ] + Cost DelTree(D ) = + 3 = 3 C C Hving Dist[, ] = 6 = Dist[, ] + Cost DelTree(E ) = 6 + = 7 C C F F 3 F 4 DelTree(F 4) Hving Dist[3, ] = = Dist[3, ] + Cost DelTree(F 4) = + = F F 3 InsTree(C ) Hving Dist[, ] = C = Dist[, ] + Cost Ins(C ) R(C ).MinOccurs = + = E Ins(C ) Hving Dist[, ] = 3 D d e d e C C = Dist[, ] + Cost Ins(C ) R(C ).MinOccurs = = 9 F Ins(C ) Hving Dist[4, ] = 9 = Dist[3, ] + Cost Ins(C ) R(C ).MinOccurs = = 5 F d C e d C 3 e E TED(E,C ) Exct mtch NOcc[] = since E is the only mtch cndidte with C Hving Dist[, ] = = Dist[, ] + TOC(E, C ) = C E NOcc[] [R(C ).MinOccurs, R(C ).MinOccurs] [, ] is pplied (since no dditionl occurrences of C re required to fulfill C s constrints) E d E e TED(E, C ) + InsTree(C ) E d E e NOcc[] = since E is the only mtch cndidte with C NOcc[] < R(C ).MinOccurs (=) is pplied (since one dditionl occurrence of C is required to fulfill C s MinOccurs constrints) = Dist[, ] + TED(E, C ) + Cost InsTree (C ) (R(C ).MinOccurs - NOcc[]) = ( ) = 3 d C e F d F e d F 3 e d F 4 TED(Q 3, C ) Ins(C ) NOcc[] = 3 since F nd F 3 re identified s exct mtches of C (in previous recursions) wheres F 4 is cndidte pproximte mtch F d F e d NOcc[] [R(C ).MinOccurs, R(C ).MinOccurs] [, 3] is pplied (since no dditionl occurrences of C re required to fulfill C s constrints) Hving Dist[, ] = = Dist[, ] + TED(F, C ) + TED(F 3, C ) + TED(F 4, C ) = = F 3 e d C e. Dist[, ] etween E nd C.. Computing Dist[, ] etween E nd C. c. Computing Dist[4, ] etween F nd C. Fig. 4. Smple Extended TED Recurrence (TED + ) computtions (to simplify, we note TED XDoc_XGrm (A, B) s TED(A, B)).

20 Elsevier Informtion Sciences Journl 4.4. Computtion Exmples TED + Computtions Consider the edit distnce computtions in Fig. 4. Fig. 4. depicts the computtion of Dist[, ] etween prtil document tree E<> nd prtil grmmr tree C<>. Computing the fctor consists of computing the cost of deleting su-tree E (consisting of lef node ), i.e., cost =. Computing the fctor consists in inserting su-tree C (mde of grmmr node ) with R(C ).MinOccurs =, hence its cost =, indicting tht C is optionl nd is not required to pper in the prtil document tree D<> since C<> D<>. Computing the fctor consists in evluting the edit distnce etween document su-tree E, the (only existing) mtch cndidte with grmmr su-tree C. Since NOcc[]= [R(C ).MinOccurs=, R(C ).MxOccurs=], thus is pplied. This yields cost =, indicting tht E is n exct mtch of C, C E. Hence, Dist[, ] = Min(,, ) = =, indicting tht no chnges need to e mde to E<> since C<> E<>. Similr exmples in Fig. 4. nd c re discussed in detil in Appendix V. To summrize, the exmple in Fig. 4. depicts the computtion of Dist[, ] etween prtil document tree E<> nd C<>, where Dist[, ] = Min(,, ) = = 3, indicting tht the imum (cost) mount of chnge required to trnsform E<> is to insert n dditionl occurrence of C in E<>, in order to otin C<> E<>. The exmple in Fig. 4.c depicts the computtion of Dist[4, ] etween prtil document tree F<4> nd prtil grmmr tree C<>, where Dist[4, ] = Min(,, ) = =, indicting tht the imum (cost) mount of chnge required to trnsform F<4> is to insert node e under su-tree F 4, in order to otin C<> F<4> Complete TED XDoc_XGrm Mtrix Computtions Fig. 5 shows the complete edit distnce mtrixes (with ll recurrences) when running our TED XDoc_XGrm lgorithm to compre document trees D, E, F with grmmr C of Fig. 3. For instnce, the first line of the distnce mtrix in Fig. 5. (likewise in Fig. 5. nd c), i.e., Dist[][], corresponds to the sum of the costs of inserting every node of the grmmr tree C. Likewise, the first column, Dist[][], underlines the sum of the costs of deleting every node of XML tree D. Consequently, the lgorithm identifies the comintion of tree insertion/deletion opertions of imum cost, following our Extended TED Recurrence, in populting the reminder of the mtrix, such s TEDXDoc_XGrm(D, C)=Dist[ S ][ C ] underlines the finl distnce vlue. The mtrix in Fig. 5. shows the edit distnce result when compring document tree D to grmmr tree C, yielding TED XDoc_XGrm (D, C) = Sim XDoc_XGrm (D, C) = / (+TED XDoc_XGrm (D, C)) = C D. The imum cost edit script is highlighted in Fig. 5.. Dist[, ] = Cost Upd (R(D), R(C))=, since the document/grmmr tree s mtch: R(D) = R(C). Dist[, ] = Dist[, ] + Cost InsTree (C )R(C ).MinOccurs =, underlining tht C is optionl nd is not required to pper in the document tree. Dist[3, ] = Dist[, ] + TED XDoc_XGrm (D, C ) + TED XDoc_XGrm (D, C ) + TED XDoc_XGrm (D 3, C ) = since C {D, D, D 3 }, such tht NOcc[] = R(C 3 ).MxOccurs = 3 (3 occurrences of C re llowed to pper, nd hve ctully ppered in the document tree). Dist[4, 3] = Dist[3, ] + TED XDoc_XGrm (D 4, C 3 ) Cost Upd (R(D 4 )., R(C 3 ).) = since C 3 D 4, hving R(C 3 ).MinOccurs = R(C 3 ).MxOccurs = (i.e., one nd only one occurrence of C 3 is required to pper in the document tree). Hence, no chnges need to e mde to D since C D. Similr exmples in Fig. 5. nd c show the edit distnce result when compring document trees E nd F (respectively) with grmmr tree C, nd re discussed in detil in Appendix V. To summrize Fig. 5. shows TED XDoc_XGrm (E, C) = 3 Sim XDoc_XGrm (E, C) = / ( + TED XDoc_XGrm (E, C) =.5 C E highlighting the.5 cost of inserting one dditionl occurrence of su-tree C into document tree E, to otin C E. Similrly, Fig. 5. shows TED XDoc_XGrm (F, C) = 4 Sim XDoc_XGrm (F, C) = / ( + TED XDoc_XGrm (F, C) =. C F, which underlines the costs of i) inserting node e in su-tree F 4, nd ii) deleting su-tree F 5 from document tree F, in order to otin C F. This mens tht F requires more costly trnsformtions to ecome vlid w.r.t. grmmr tree C, nd thus is less similr to grmmr tree C in comprison with document tree E Running Exmple To sum up, we present the result of compring smple XML document Pper.xml with XML grmmr Pper.xsd in Fig. 6 (reported from Fig.. nd Fig. 7.c respectively, for ese of presenttion). Pper.xml is represented s XML document tree D following our XML dt tree model ( Fig. 6.), nd Pper.xsd, designted s G, is represented s set of conjunctive grmmr trees {C I, C II, C III }. TED XDoc_XGrm computtions etween D nd {C I, C II, C III } yield: TED XDoc_XGrm (D, C I ) = TED XDoc_XGrm (D, C ) + Cost InsTree ( C I 3 I 3 ) (R( C ).MinOccurs NOcc[3]) = + 3 = 4, which comes down to: i) the cost of trnsforg D, s n pproximte mtch cndidte of su-tree C I 3, in order to otin C D ( Cost Upd (R(D )., R( C ).)=, updting node lel Pulisher into Author), nd ii) I 3 I 3 the cost of inserting one dditionl occurrence of C I 3 (mde of nodes Author, FirstNme nd LstNme, i.e., Cost InsTree ( C I 3 ) = 3), since D is the only mtch of C I 3 in D (NOcc[] = ) wheres the imum numer of occurrences of C required to pper in document tree D is R( C ).MinOccurs =, trnsforg D into D in I 3 order to otin C I D. I 3 I 3.

21 TED XDoc_XGrm (D, C II ) = Cost DelTree (D ) + Cost DelTree (D ) =, which corresponds to the sum of the costs of deleting (su-tree) nodes of lels FirstNme nd LstNme from document tree D, to otin C II D. TED XDoc_XGrm (D, C III ) =, underlining tht chnges need to e mde to document D, since we lredy hve C III D (edit distnce mtrixes when compring D with C I, C II, nd C III cn e found in Appendix V). i j 3 R(C) C C 3 C R(D) 6 7 D 3 D 6 3 D D NOcc[]= NOcc[]= NOcc[]= NOcc[]= NOcc[]= 9 9 NOcc[]=3 6 9 NOcc[3]= 4 3 NOcc[3]= 7 3 NOcc[3]= 8 NOcc[]= NOcc[]= NOcc[3]=. Compring document tree D nd grmmr tree C. i j 3 R(C) 3 C C C 3 R(D) 6 7 E E 4 3 E 3 5 NOcc[]= NOcc[]= NOcc[]= NOcc[]= NOcc[3]= NOcc[3]= i j 3 R(C) C C 3 C R(D) 6 7 F F 4 3 F F F 5 6 F 6 3 NOcc[]= NOcc[]= NOcc[]= NOcc[]= 3 3 NOcc[]= 3 3 NOcc[]= 5 NOcc[]= NOcc[]= NOcc[]= NOcc[]= NOcc[]= NOcc[]= NOcc[3]= 9 NOcc[3]= 4 3 NOcc[3]= 3 NOcc[3]= NOcc[3]= NOcc[3]= c. Compring document tree F nd grmmr tree C. NOcc[]= NOcc[]= NOcc[3]=. Compring document tree E nd grmmr tree C. Fig. 5. Computing edit distnce etween XML documents D, E nd F nd grmmr tree C in Fig. 3. Hence, the structurl similrity etween XML document Pper.xml nd XML grmmr Pper.xsd is computed s: Sim ( D, G) = XDoc_XGrm Mx + TED D, C + TED D, C C i {C II, C II, C III } ( ) ( ) XDoc_XGrm i XDoc_XGrm III G D In other words, mximum similrity vlue ( or %), indictes tht XML document Pper.xml is structurlly vlid w.r.t. grmmr Pper.xsd, nd tht no trnsformtions need to e pplied to the corresponding document tree since it lredy conforms to the grmmr tree representtion Complexity Anlysis Let D e the crdinlity of the XML document tree D considered, nd G the numer of nodes (elements/ttriutes) in the XML grmmr, N G = {C} G the numer of conjunctive grmmrs mking up the disjunctive norml form of G, nd C G the crdinlity of the lrgest conjunctive grmmr tree corresponding to G. Our XML document nd grmmr structure comprison pproch is of O( D + G +(N G D C G )) time. It simplifies to O( D G ) in the typicl (prcticl) cse, nd O(N G D G ) in the worst cse Time Complexity

22 Elsevier Informtion Sciences Journl Tree Construction: The XML document tree nd XML grmmr tree construction processes (including lgorithm XGrm_to_Tree) re of typicl liner complexity nd simplify to O( D + G ). Algorithm XGrm_to_Tree processes XML grmmr simplifiction rules using dedicted index tles to monitor ech simplifiction rule (i.e., detecting whether the grmmr expression is of the form trgeted y given trnsformtion rule). This proved computtionlly efficient in prctice, requiring typicl O( G ), since the numer of simplifiction rules nd thus the size of the index tles is constnt). On the other hnd, document tree construction requires one single trversl over the document, hence O( D ) time. Tree Edit Opertions Costs: Computing document tree nd conjunctive grmmr tree edit opertions costs requires O( D + C G ) time: i) lgorithm TOC XGrm for computing XML grmmr tree edit opertions costs is of O( C G ) time, ii) Likewise, lgorithm TOC XDoc (developed in the Technicl Report [73]) for computing document tree opertions costs, requires O( D ) time. Core Tree Edit Distnce Algorithm: The TED XDoc_XGrm lgorithm ( Fig. ) for computing the edit distnce etween the XML document tree nd conjunctive grmmr tree is of worst O( D C G ) complexity. The lgorithm recursively goes through the su-trees of oth XML document nd conjunctive grmmr trees, comining edit opertions so s to identify those of iml cost. Its min recursive procedure is clled once for ech pir of su-trees occurring nd the sme structurl level (depth) in the document nd conjunctive grmmr trees eing compred, thus reflecting liner dependency on the size of ech tree, nd thus qudrtic dependency on the sizes of oth trees. Title Pulisher Pper Version Length LstNme FirstNme Pper Downlod url url Ctegory Pper Title Pulisher Version Length url Conjunctive grmmr tree C II Pper Downlod url Pper Downlod. XML document tree D representing Pper.xml. Pper Downlod Ctegory Pper Title Author Version Length url FirstNme MiddleNme LstNme Pper Downlod Conjunctive grmmr tree C I Pper url Downlod Ctegory FirstNme Pper Title Pulisher Version Length url MiddleNme Conjunctive grmmr tree C III LstNme Pper Pper Downlod url Downlod. Conjunctive grmmr trees corresponding the disjunctive norml form of grmmr Pper.xsd, designted s DNF(G)={C I, C II, C III }. Fig. 6. XML document tree (reported from Fig..) nd conjunctive grmmr trees (reported from Fig. 7.c). XML Document/Grmmr Comprison: Algorithms TOC XGrm nd TED XDoc_XGrm re executed for ll conjunctive grmmrs C i {C} G in order to compute overll document/grmmr edit distnce similrity, thus requiring O(N G D C G ) time. Here, recll tht the numer of conjunctive grmmrs N G resulting from the disjunctive norml form expnsion of n input XML grmmr G, depends on the numer nd configurtions of Or (choice) opertors in the input grmmr expressions. This my generte prolifertion of conjunctive grmmrs depending on the expressiveness of the grmmr declrtions. However, in our pproch, the XML document tree nd ech of the conjunctive grmmr trees re compred concurrently (i.e., in prllel) using multi-thred processing (cf. lgorithm in Fig. 9). Hence, regrdless of the (possily limited) processing cpilities of the computer system eing used, the complexity of the edit distnce phse is not (theoreticlly) ffected y the numer of conjunctive grmmrs N G, nd comes down to O( D C G ). In ddition, most common lterntive expressions found in rel XML grmmrs [, 3, 38] generte numer of conjunctive grmmrs liner in the numer of Or opertors involved (cf. mthemticl nlysis is Appendix I). Hence, sed on i) the lgorithmic design of our pproch ( Fig. 9), nd i) the reltively simple nture of rel XML grmmr expressions, the overll complexity of our pproch, O(N G S C G ), typiclly simplifies to O(D C G ), which in turn simplifies to O( D G ), since C G is liner in the size of G (cf. Section 3.). In the worst cse, when the numer of conjunctive grmmrs N G is explosive (nd cnnot e even hndled using multi-threding), then the term N G cnnot e simplified form the eqution, nd complexity ecomes O(N G D G ) Spce Complexity As for memory consumption, our pproch requires O( D + N G C G ) to store the XML document tree nd conjunctive XML grmmr trees eing compred, in ddition to O(N G D C G ) spce to store corresponding distnce mtrixes. Yet prcticlly, spce complexity simplifies to O( D + G ) + O( D G ) = O( D G ) since conjunctive grmmr trees consist of references (pointers) to the elements/ttriutes in the source grmmr, nd thus require limited

23 spce in comprison with the ctul document nd grmmr sizes (even when the child structures of elements in different conjunctive grmmr trees re different, represented y their respective pointers). Experimentl time nd spce nlyses re provided in Section Experimentl Evlution We first strt y descriing our prototype nd experimentl scenrios, nd then we present nd ssess empiricl results. 5.. Prototype We hve implemented our XML document/grmmr comprison pproch in the existing XS3 prototype (XML Structurl nd Semntic Similrity). Implemented using C#.Net, the XS3 prototype system includes: i) prser component verifying the integrity of XML documents nd grmmrs, ii) tree representtion component, for trnsforg XML documents nd grmmrs into their tree representtions, nd iii) tree edit distnce component for computing document/grmmr similrity. An dpttion of the IBM XML documents genertor ws implemented to produce sets of XML documents nd grmmrs sed on specific user input requirements (e.g., MxRepets 3 vriility prmeter for document genertion, the numer of And/Or opertors nd opertor positions in synthetic grmmrs, etc.). In ddition, we hve implemented n XML document/grmmr modifiction genertor. It ccepts s input n XML document or n XML grmmr, ModifType vlue designting the kind of modifiction to e induced to the document/grmmr t hnd (i.e., element/ttriute insertions, deletions or lel updtes, cf. Section 5.4), s well s Modif% vlue indicting the mount of modifictions to e produced w.r.t. document/grmmr size (i.e., crdinlity). Built upon the min XS3 components re different modules for similrity evlution: One to One, One to Mny (compring one XML document to set of grmmrs nd vice-vers, llowing similrity rnking), nd Set comprison, (enling XML document/grmmr clssifiction). A detiled description of the prototype system is ville online. 5.. Experimentl Scenrios How to experimentlly evlute the qulity of n XML similrity method remins detle issue, especilly in informtion retrievl. To our knowledge, the definition of stndrdized XML similrity evlution metrics remins hot topic in the INEX evlution cmpigns 4. While few similrity evlution techniques hve een proposed in the context of XML document comprison (e.g., inter- nd intr-cluster similrity coefficients [3], mis-clustering coefficient [48], nd cluster-precision nd -recll metrics [6]), nd grmmr comprison (e.g., overll mesure to quntify user effort in grmmr mtching [7, 44]) yet, to our knowledge, none hve een proposed for XML document/grmmr similrity evlution; which is proly due to the novelty of the issue. Hence, in the following, we introduce experimentl evlution methods sed on the most common pplictions of XML document/grmmr comprison, i.e., document clssifiction nd rnked retrievl. We demonstrte our method s effectiveness in clssifying similr documents w.r.t. predefined grmmrs in Section 5.3, nd rnking relevnt XML documents (grmmrs) w.r.t. their resemlnces to the grmmrs (documents), in Section 5.4. In Section 5.5, we perform hyrid experimentl nlysis, comining oth document clssifiction nd grmmr trnsformtion, to ssess our method s intelligent (noise resistnt) ehvior in compring non-conforg yet relted documents/grmmrs, i.e., given set of grmmrs, recognizing documents which re similr ut re not written exctly in those grmmrs. A qulittive comprtive study is presented in Section 5.6. Complexity nlysis is presented in Section XML Document Clssifiction Experiments The scenrio dopted in our document clssifiction experiments comprises of numer of heterogeneous XML dtses tht exchnge documents mong ech other, ech dtse storing nd indexing the locl documents ccording to set of locl grmmrs. Consequently, XML documents introduced in given dtse re mtched, vi n XML structurl similrity method, ginst the locl grmmrs. In such n ppliction, similrity threshold is identified underlining the iml degree of similrity required to ind n XML document to grmmr. The XML grmmr for which the similrity degree is highest, nd ove the specified threshold, is selected. Thus, the XML document is ccepted s pproximtely vlid for tht grmmr (the documents re exctly vlid when similrity is mximl, i.e., Sim XDoc_XGrm =). Note tht when the similrity score is elow the threshold, for ll grmmrs in the XML dtse, the XML document is considered unclssified nd is stored seprtely Evlution Metrics Owing to the proficient use of their trditionl predecessors in clssic informtion retrievl evlution [43], nd their recent exploittion in XML document clustering (e.g., [6]), we dpt the precision metric (PR, highlighting the frction of relevnt selected entities) nd the recll metric (R, highlighting the frction of relevnt non-selected entities) Aville online t A greter MxRepets underlines greter size nd vriility in generting XML documents, when repetle elements (ssocited *, + in DTDs, or MxOccurs in XSDs) re encountered. 4

24 4 Elsevier Informtion Sciences Journl in informtion retrievl to our XML clssifiction scenrio, nd propose new method for their usge in order to otin consistent experimentl results. For n extrcted clss K i corresponding to given grmmr G i : i is the numer of XML documents in K i tht indeed correspond to G i (correctly clssified documents, i.e., those tht conform to grmmr G i ). i is the numer of documents in K i tht do not correspond to G i (misclssified). c i is the numer of XML documents not in K i, lthough they correspond to G i (documents tht conform to G i nd tht should hve een clssified in K i ). Hence, setting n s the totl numer of clsses, which corresponds to the totl numer of grmmrs considered for the clssifiction tsk, we hve: å i i= PR = n n å + å n i i i= i=, å i i= R = n n å + å c n i i i= i=, PR R F-Vlue PR R High precision denotes tht the clssifiction tsk chieved high ccurcy grouping together documents tht ctully correspond to the grmmrs considered. On the other hnd, high recll mens tht very few XML documents re not in the pproprite clss where they should hve een. In ddition to compring one pproch s precision improvement to nother s recll, it is lso common prctice to consider their hrmonic men: the F-vlue mesure. Hence, s with clssic informtion retrievl, high precision nd recll, nd thus high F-vlue, (indicting in our cse high clssifiction qulity) chrcterize good (XML document/grmmr) similrity method Multi-level Clssifiction In our experiments, we undertook series of multilevel clssifiction tsks, vrying the clssifiction threshold in the [, ] intervl. In other words, we construct dendrogrm-like structure ( Fig. 7.) such tht: For the strting threshold s =, ll XML documents pper in ll clsses. For the finl clssifiction threshold s n = (with n the numer of clssifiction levels, i.e., clssifiction sets in the dendrogrm), ech clss will only contin the XML documents which ctully conform (i.e., which re exctly vlid with respect) to the grmmr identifying the clss. Intermedite clssifiction sets will e identified for thresholds s i / s <s i <s n. Then, we compute precision, recll nd F-vlue for ech clssifiction set identified in the dendrogrm, thus constructing PR, R nd F-vlue grphs tht descrie the system s evolution throughout the clssifiction process Experimentl Results We conducted experiments on oth rel nd synthetic XML documents to test our XML document/grmmr structurl comprison method. For rel XML dt, we utilized the online XML version of the ACM SIGMOD Record, nd the University of Wisconsin s Nigr XML document collection, including lrge XML dt set extrcted from the Internet Movie Dtse IMDB 3. We performed two min clssifiction experiments to test the effectiveness of our method in compring: i) relted XML documents (i.e., documents shring identicl tg nmes nd relted structures), nd ii) heterogeneous XML documents (descriing different kinds of informtion, using different structures). The first experiment considers the SIGMOD Record documents, which correspond to three min grmmrs: OrdinryIssuePge.dtd, ProceedingsPge.dtd nd SigmodRecord.dtd 4, descriing scientific pulictions. The second experiment considers ll three SIGMOD Record, Nigr nd IMDB dt sets, comining heterogeneous XML dt descriing different kinds of informtion, rnging over scientific pulictions, compny profiles, personnel descriptions, movie credentils, nd ctor descriptions. The chrcteristics of ech document collection nd corresponding grmmr definitions re shown in Tle 3. Grmmr sttistics re shown in Tle 5. We lso generted two sets of XML documents from rel-cse 5 nd synthetic grmmrs (using the synthetic XML document nd XML grmmr genertors implemented in the XS3 prototype). The first set of documents ws creted with MxRepets = 5, the second with MxRepets =, the ltter set underlining XML documents with greter size nd vriility (i.e., greter heterogeneity) w.r.t. the former, when optionl nd repetle elements re encountered. The chrcteristics of synthetic XML dtsets re summrized in Tle 4 nd Tle 5. (6) Aville t Aville t 3 XML dt extrcted from using dedicted wrpper genertor. 4 We were le to find only one XML file conforg to SigmodRecord.dtd: SigmodRecord.xml. However, due to its reltively lrge size (479KB) w.r.t. the XML documents corresponding to the other grmmrs (KB of verge size per document), we crefully decomposed SigmodRecord.xml to severl documents, creting set of documents conforg to SigmodRecord.dtd. 5 From nd

25 5 Tle 3. Chrcteristics of SIGMOD Record, Nigr, nd IMBD document sets. Grmmrs (SIGMOD) N# of Docs Avg Node Depth (per doc) N# of nodes (per grm) Avg N# of nodes (per doc) OrdinryIssuePge ProceedingsPge SigmodRecord Grmmrs (Nigr) N# of Docs Avg Node Depth (per doc) N# of nodes (per grm) Avg N# of nodes (per doc) Profile Personnel Clu Bi Grmmrs (IMDB) N# of Docs Avg Node Depth (per doc) N# of nodes (per grm) Avg N# of nodes (per doc) Movies Actors Tle 4. Chrcteristics of synthetic document sets. Document set N# of Numer of N# of documents Averge Node Averge Numer of grmmrs Documents (per grm) Depth (per doc) Nodes (per doc) MxRepets = MxRepets = Tle 5. The percentge nd numer of structure model expressions in oth sets of rel nd synthetic grmmrs. Sequence exp. (And) Alterntive exp. (Or) Mixed exp. (And & Or) Single element Expressions Empty structurl Grmmr set model (ε) exp. Rel grmmrs 9. % (5). % (3).87 % (5) 4.98 % (4) 6.9 % (68) Synthetic grmmrs 5.59 % (8) 6.99 % () 9.79 % (4).48 % (5) 67.3 % (96) Note tht grmmr sttistics in Tle 5 firly concur with the empiricl nlyses in [, 3, 38] highlighting the fct tht rel-world XML grmmrs re usully mde of simple structurl models (e.g., sequence expressions, single element declrtions, or sic content models, e.g., PCDATA, String, etc.). In other words, few grmmr expressions contin lterntive declrtions, i.e., Or opertors (e.g., less thn 7% of ll grmmr expressions surveyed in [], nd less thn 6% of those surveyed in [3], contin Or opertors - cf. for preliry sttistics). In ddition, note tht ll rel nd synthetic grmmrs considered in our experiments re firly different nd do not produce identicl documents. In other words, we mde certin tht given document cnnot conform to two grmmrs simultneously, so s to prevent ny confusion in computing the precision nd recll metrics. Precision nd recll grphs re presented in Fig. 7. One cn clerly relize tht recll (R) is lwys equl to. This reflects the fct tht our XML document/grmmr comprison pproch constntly identifies, in the grmmr clsses, the XML documents tht ctully conform to the grmmrs considered (i.e., documents hving Sim XDoc_XGrm =), regrdless of the clssifiction threshold s well s the nture of the document collection (relted nd/or heterogeneous). On the other hnd, precision (PR), nd consequently F-vlue (note tht F-vlue follows PR in this experiment, since R is lwys equl to ) grdully increses towrd, while vrying the clssifiction threshold from to : When the clssifiction threshold is equl to, ll documents in the XML repository re considered in ech nd every clss corresponding to the grmmrs t hnd ( Fig. 7., initil level). Tht is underlined y imum PR. Then, s the clssifiction threshold increses, inconsistent documents re grdully filtered from the XML grmmr clsses, ultimtely yielding clsses tht only encompss documents conforg to the considered grmmrs (cf. Fig. 7., finl level). In summry, Precision nd F-vlue results in Fig. 7 show tht our method yields high clssifiction qulity with oth relted nd heterogeneous document collections, otining optiml clsses t very erly stge of the multilevel clssifiction process (with thresholds <.5) Similrity Rnking Experiments In ddition to XML document clssifiction, we rn series of experiments to evlute the rnking cpilities of our document/grmmr comprison method Experimentl Scenrio The pproch consists in grdully trnsforg rel XML documents/grmmrs, nd consequently evluting how closely the otined similrity results correspond to the induced chnges. Here, we exploit two complementry criteri for rnking evlution: i) n internl criterion, consisting of the mount of modifiction (trnsformtion) in documents/grmmrs, nd ii) n externl criterion, consisting of user predefined rnkings. On one hnd, we consider s n internl evlution criterion: the correspondence etween the mount of chnges nd document/grmmr similrity vlues (i.e., similrity decresing proportionlly with the increse in chnges, nd vice-vers), such tht stright correspondence would underline high rnking qulity. On the other hnd, we lso exploit user-predefined rnkings, necessry to highlight the user s perception of document/grmmr similrity w.r.t. document/grmmr modifictions. Note tht ll DTD grmmrs were trnsformed into XSD definitions, replcing DTD crdinlity constrints (nmely:?, *, +) with their more expressive XSD counterprts (i.e., MinOccurs nd MxOccurs).

6 Elsevier Informtion Sciences Journl..9.8.7.6.5.4.3.. PR R F-Vlue.5..5. Clssifiction Threshold. Clssifying ll 4 XML documents of the SIGMOD Record...9.8.7.6.5.4.3.. PR R F-Vlue.5.5.75.

26 6 Elsevier Informtion Sciences Journl PR R F-Vlue Clssifiction Threshold. Clssifying ll 4 XML documents of the SIGMOD Record PR R F-Vlue Clssifiction Threshold c. Clssifying rel XML documents sets: of SIGMOD, Nigr nd IMDB PR R F-vlue Clssifiction Threshold d. Clssifying documents of synthetic set.. Dendrogrm otined when clssifying 5 XML documents smpled from the ACM SIGMOD Record Clssifiction Threshold PR R F-Vlue e. Clssifying documents of synthetic set. Fig. 7. XML document clssifiction: dendrogrm, nd PR, R, F-vlue grphs. To produce chnges to XML documents/grmmrs (cf. Fig. 8.), we utilize our prototype s modifiction genertor: For the strting phse of the trnsformtion process, the modifiction threshold Modif% is set to, underlining the originl document/grmmr structure. For the finl phse, Modif%=. The mount of chnges in the resulting modified document/grmmr t hnd mounts to % of its originl size. Intermedite trnsformtion phses correspond to < Modif% <. In ddition, for ech similrity rnking experiment, the modified documents/grmmrs were mnully evluted, identifying corresponding user-relevnt rnkings. Thirty grdute students were involved in the experiments. Ech suject ws given set of initilly conforg documents/grmmrs nd their trnsformed (modified) versions, nd ws sked to rnk the trnsformed documents/grmmrs w.r.t. the originl versions (ssigning scores rnging from A to F, such s A = Conforg, B = Very Similr,, F = Lest similr). Mnul nswers were consequently correlted ginst the system generted ones in order to identify the sttisticl dependence etween system generted similrity scores nd the user s perception of similrity.

A Tautology Checker loosely related to Stålmarck s Algorithm by Martin Richards

A Tautology Checker loosely related to Stålmarck s Algorithm by Martin Richards A Tutology Checker loosely relted to Stålmrck s Algorithm y Mrtin Richrds mr@cl.cm.c.uk http://www.cl.cm.c.uk/users/mr/ University Computer Lortory New Museum Site Pemroke Street Cmridge, CB2 3QG Mrtin