HCMX: AN EFFICIENT HYBRID CLUSTERING APPROACH FOR MULTI-VERSION XML DOCUMENTS

HCMX: AN EFFICIENT HYBRID CLUSTERING APPROACH FOR MULTI-VERSION XML DOCUMENTS VIJAY SONAWANE 1, D.RAJESWARA.RAO 2 1 Research Scholar, Department of CSE, K.L.Unversty, Green Felds, Guntur, Andhra Pradesh 2 Professor, Department of CSE, K.L.Unversty, Green Felds, Guntur, Andhra Pradesh E-mal: vjaysonawane11@gmal.com 1,rajeshduvvada@klunversty.n 2 ABSTRACT In order to retreve useful nformaton from large number of growng XML documents on the web, effectve management of XML document s essental. One soluton s to cluster XML documents to fnd knowledge that promote effectve nformaton management and mantenance. But n the real world XML documents are dynamc n nature. In contrast to statc XML documents, changes from one verson of XML document to another verson cannot be predcted. So clusterng technque of statc XML documents cannot be used to cluster multple versons of XML documents. In case of multverson XML documents, prelmnary clusterng soluton s not become vald after document versons appear. XML documents are self descrptve n nature, whch results n large document sze. To fnd new clusterng soluton after change, comparsons between all documents s not vable soluton. In ths paper we have proposed hybrd clusterng approach to cluster multverson XML documents. Ths approach mproves speed of clusterng by lmtng the growng sze of XML documents by usng homo-morphc compresson scheme and usng dstance nformaton from prelmnary clusterng soluton wth the changes recorded n compressed delta Keywords: HCMX, Hybrd Clusterng, Cluster re-evaluaton, Multverson, PCP, CSRP, compressed Delta. 1. INTRODUCTION Wth the huge growth of the Internet, XML has now become a unversal standard for nformaton representaton and exchange on over Internet [1]. Due smple and flexble nature of XML, varous applcatons lke scentfc wrtng and techncal papers to handle news summares [2] use XML n nformaton exchange and representaton. XML s also used to represent the web based free content encyclopeda known as Wkpeda, t has more than 3.4 mllon XML documents. XML offers many features of onlne busness functons such as content ntegraton and ntellgence. The growng popularty of XML has lft up many concerns regardng the methods of how to effcently mantan and manage the XML data and retreve these XML documents n large collectons. One of the feasble solutons to handle large XML collecton s to make group of smlar XML documents n the form of cluster. Clusterng smlar XML documents s supposed to be one of the more effectve way for document handlng by facltatng better nformaton retreval, data ndexng, data ntegraton and query processng.the clusterng of smlar XML documents has been perceved as potentally beng one of the more effectve solutons to mprove document handlng by facltatng better nformaton retreval, data ndexng, data ntegraton and query processng [3]. But there are several challenges n clusterng XML documents. Unlke the clusterng of text documents or normal data, clusterng of XML documents s a complex process [4] and as a result the most commonly used clusterng methods for text clusterng cannot be replcated for clusterng these documents. Ths s due to the fact that changes n XML documents are applcaton specfc. In real world content of XML documents s dynamc n nature and changes n t are lmtless and not predctable [5].Every tme changes n orgnal XML document gves brth to the new verson of t. Dynamc XML documents are applcable n many felds of nformaton mantenance and management, so t creates the demand for multverson support [6].So t s also necessary to store dfferent versons of XML documents. In spte ts potental, storage of all the versons of an XML document ncreases the redundancy and make searchng and queryng harder on growng documents collecton. In clusterng soluton of multverson XML documents, new document verson s not avalable n advance, or not completely new documents. Only they are dfferent n certan degree from ther prevous one. XML s self descrbng n nature, ths provdes enormous 137

flexblty to t, but also ntroduces the problem of verbosty whch results n huge document sze [7] and results n poor response tme. In ths paper we ntroduces an hybrd clusterng approach named HCMX to fnd clusterng soluton of mult-verson XML documents whch changes dynamcally. In ths we judge the amount of document affected, rather than consderng each comng document as new verson. To fnd new clusterng soluton, after changes n prelmnary one, nstead of comparng all members of clusterng soluton, we make the use of dstance nformaton durng prelmnary clusterng phase, wth the changes responsble for document verson to produce. To mprove clusterng speed and response tme homomorphc compresson scheme used whch retan documents orgnal structure. 2. MOTIVATION Wth the ncreasng number of XML documents on web the need becomes usual to properly organse these XML documents n order to retreve useful nformaton. The defcency of such an effectve organsaton of the XML documents causes a search of the total collecton of XML documents, whch results n poor response tme. So to effcently mantan XML document collecton, t s crucal to cluster these documents based on ther smlarty. Along wth the nformaton retreval, clusterng can also be used to dscover knowledge for web mnng and query executon [12]. XML documents clusterng process s not as straght forward as the process of the clusterng of text documents [10]. It has to pass many challenges n the clusterng of XML documents due to nature of XML documents. That are: 1) XML documents structure are havng herarchcal relatonshp wth ts elements, so ths relatonshp should be preserved. 2) In real-world, the dfference between consecutve versons of an XML document vares, so statc XML document clusterng technque cannot be used [11]. 3. METHODOLOGY Clusterng soluton S of gven set of XML document s represented as complete graph wth S*(S-1)/2 number of weghted edges. To fnd sngle lnk clusters of level L, the edges wth weght w L have to remove, remanng connected edges gves resultant cluster [8]. If correspondence between two XML documents s used as measure to fnd clusterng soluton then weght of the edges connectng documents symbolze the dstance between them. Prelmnary Clusterng Phase (PCP) - (one tme executon): Fgure-1(a). Shows prelmnary clusterng phase, n ths we frst: ) compress nput XML documents usng homo-morphc compresson technque (XGrnd n our case). ) Fnd cluster usng any dstance based clusterng algorthm. ) Save dstance matrx between all XML document par wth the set of operatons analogous to each mnmum dstance. XML Document Set Homomorphc Compresson Calculate Structural Dstance Structural Dstance Matrx Prelmnary Clusterng Prelmnary Clusterng Fgure-1. a) Prelmnary Clusterng Phase. XML Document Set Apply Verson Creaton Program Versons of XML Documents Homomorphc Compresson Re-calculate Structural Dstance Structural Dstance Matrx New Clusterng Soluton b) Expensve Re-Clusterng Method Dstance:- To get the dstance between two XML documents, edt operatons Op (nsert, delete, update) are used wth least cost value that transform one document nto other. d(d1, D2) = least ((dst(d1d2 ),dst(d2d1))) Least dstance between two documents ndcates larger resemblance between them. Two documents are sad to be equal when total cost of operaton s equal to zero. Any amount of changes n any document n prelmnary clusterng soluton wll affect the dstance wth rest documents n ntal cluster. If dstance between two documents s smaller, t ndcates the hgh resemblance between them. Hence when total cost operaton between two document s equal to zero then these document are 138

sad to smlar. When any document n prelmnary clusterng soluton changes then ts dstance wth rest of the documents n the cluster also changes. These number and type of changes are responsble for document resdence n the same cluster or form ts own cluster. To fnd new clusterng soluton after change n documents n prelmnary cluster, reassessment of modfed dstances between all XML documents pars s must. Fgure -1(b). shows one of the possble soluton. As shown, to get new clusterng soluton after changes n prelmnary clusterng soluton, comparson between all the documents par s possble (Full Comparson). But t s not cost effectve soluton because ) It ncur redundancy n calculatng the dstances between each par of document by makng full comparson between the documents. ). It does not consder degree of changes n the document, most of tme new verson may not modfed at all or may carry small amount of change, hence most of the operatons are needlessly repeated. Our proposed HCMX approach s depcted n fgure-2. HCMX s dvded nto two stages. 1) Prelmnary Clusterng Phase- Ths phase forms the base for HCMX. 2) Clusterng soluton re-evaluaton phase (CSRP): Ths phase s repeated whenever documents from prelmnary clusterng soluton changes. 1) Use set of changes from prelmnary clusterng phase and current tme stamp, recorded n compressed delta. 2) Read dstance matrx saved durng prelmnary clusterng phase. 3) Re-evaluate the dstances between documents based on dstances calculated durng PCP, changes responsble for document verson recorded n compressed delta and least cost drecton. Output of ths phase form the base for next teratve run of CSRP. The most mportant part n ths CSRP s reevaluatng the dstances based on dstances calculated durng PCP and set of changes recorded n compressed delta. Next secton of the paper presents the method to acheve ths. 4. DISTANCE MEASUREMENT Herarchcal relatonshp between elements of XML document makes easy to perform operaton on documents. Homomorphc compresson scheme used mantans document orgnal structure wth reduced n sze. In mult-verson XML documents, new versons are obtaned by applyng nsert, update and delete operatons n combnaton on XML Document Set Compressed Document Verson Fnd Compressed Delta Use Prevous Structural Dstance wth Delta New Structural Dstance Matrx Fnal Clusterng Composton New Clusterng Soluton Fgure -2. HCMX clusterng approach. document verson nodes and sum of these operatons are stored n compressed delta. Compressed Delta (C ) - Gven dynamc XML document D wth ts verson D *, compressed delta records the changes from one state of document to another. It conssts of a set operatons Op(nsert, delete, update) executon of t on document D wll return document n state D *. Cost Compressed Delta (CC ) It s sum of the operaton Op(nsert, delete, update) recorded n compressed delta, those are responsble to convert D n to D *. Inverted Operaton (O ) - Gven dynamc XML document D wth ts verson D *, If an executon of operaton Op (nsert, delete, update) on document D returns ts verson D *, then executon of ed operaton on verson D * returns orgnal document D..e nsert s ed operaton of correspondng delete. If prelmnary clusterng soluton S contans D1 and D2 wth d(d1,d2) s dstance between them and (D1D2) s least cost drecton and set of changes saved n compressed delta (responsble for document versonng), then new dstance d * can be defned as: 139

1. If set of changes stored n C transform D1 nto D1 * then new dstance between D1 * and D2 can be defned as : d * (D1 *, D2) = (d(d1,d2) C (D1, D1 * )) - (D1,D2) C (D1,D1 * )).e when D1 changes to D1 * then to fnd ts new dstance wth D2, common set of operatons that transforms D1 to D1 * wth least dstance need to be deducted as ther effect s equal, so only unque operatons need to consdered. 2. If set of changes stored n C transform D2 nto D2 * then new dstance between D1 and D2 * can be defned as : d * (D1,D2 * ) = d(d1,d2)+c (D2,D * ) - q = 1 Q + Q Here Q are the q operatons from d(d1,d2) whch have consequent ed operatons O n C (D2, D2 * ), 1 q..e set of operatons whch gves least dstance between D1 and D2, and were subsequently ed durng D2 transformaton nto D2 * need to be deducted when calculatng the dstance between D1 and D2, as ther combned effect s null, whereas only the dstnct non-ed operatons need to be consdered. 3. If set of changes stored n C 1 transform D1 nto D1 * and set of changes stored n C 2 transform D2 nto D2 * then new dstance between D1 * and D2 * can be defned as : d * (D1 *, D2 * ) = [(d(d1,d2) C 1 (D1,D1 * )) (d(d1,d2) C 1 (D1,D1 * )] + C 2 (D2,D2 * )- p = 1 Q + Q Here O are p resdual operatons from d(d1, D2) after removng repeated operatons from C 1 whch have consequent ed operatons O n C 2 (D2,D2 * ), 1 p..e when both D1 and D2 have changed nto ts consequent versons D1 * and D2 *, above both formulas are applcable to fnd new dstance d * (D1 *,D2 * ). 4. f both documents D1 and D2 do not change, then the new dstance d * s the same wth the prevous dstance d: d * (D1, D2) = d(d1, D2) If there s no change n dstance, t ndcates the total smlarty between the documents. 5. HCMX ALGORITHM Input: Clusterng soluton of compressed dynamc XML documents. Output: Re-evaluated clusterng Soluton. For each compressed documents [D 1, D 2 ] belongs to clusterng soluton and D 1 D 2. 1.1. f document D 1 changes but document D 2 does not change then shared_op_cost = 0; for each operaton O belongs to d(d 1, D 2 ) f operaton O s belongs to (D 1,D 1 * ) then shared_op_cost= shared_op_cost + 2*cost(O ) for next operaton O, the new dstance between d * (D 1 *,D 2 ) d(d 1, D 2 )+cost( (D 1,D 1 * ))- shared_op_cost 1.2. f document D 2 changes but document D 1 does not change then ed_op_cost = 0; for each operaton O belongs to d(d 1, D 2 ) f ed operaton O belongs to (D 2,D * 2 ) then ed_op_cost = ed_op_cost + cost (O )+ Cost (O ) for next operaton O, the new dstance between d * (D 1,D * 2 ) d(d 1,D 2 )+cost( (D 2,D * 2 ))- ed_op_cost. 1.3. f both documents D 1 and D 2 changes then for each operaton O belongs to d(d 1, D 2 ) f operaton O s belongs to (D 1,D 1 * ) then shared_op_cost = shared_op_cost + 2* cost(o ) f ed operaton O belongs to delta of (D 2,D 2 * ) then shared_op_cost = shared_op_cost + cost O + cost (O ) for next operaton O, the new dstance between d * (D 1 *,D 2 * ) d(d 1,D 2 )+cost( (D 1,D 1 * ))+cost( (D 2, D 2 * )) - shared_op_cost ed_op_cost 1.4. f both documents D and Dj does not changes then new dstance d * (D 1 *,D 2 * ) wll be same as d(d 1,D 2 ) 140

Table-1. Expermental Results No of Documents Input documents sze (Kb) Sze after compresson (Kb) % of Changes Appled Full Comparson Clusterng Tme (mllsecond) HCMX 50 500 350 10 3225 964 100 1000 700 20 5610 1465 150 1500 1040 50 8256 2104 200 2000 1390 80 11034 2811 250 2500 1740 100 13813 3520 6. EXPERIMENTAL EVALUATION Proposed HCMX approach s mplanted n java. To assess the performance of HCMX approach we used documents of varable sze from XML data repostory [9]. As shown n fgure1. Prelmnary clusterng soluton of gven nput XML documents are obtaned by usng least dstance between all the compressed document pars. Prelmnary clusterng soluton of gven nput XML documents are obtaned by recordng and usng least dstance between all the document pars wth the set of operatons correspondng to each least dstance. New document versons are created wth dfferent percentage of changes by usng verson creaton program. The man objectve of evaluaton s to assess tme requred to fnd new clusterng soluton when documents change ther dstance n prelmnary clusterng soluton. We evaluated clusterng tme of HCMX wth full comparson. Expermental results are shown n table1. Result charts shown n fgure-3. to fgure- 5. reveal that our proposed HCMX performs better n spte the ncrease n documents, ther sze and appled changes. approach s tme effectve, as used homomorphc compresson scheme hghly reduces the documents sze and operatons nvolved n cluster soluton reevaluaton (CSRP) are reduced. Fgure -3. Result chart 1. 7. CONCLUSION In ths paper we have proposed hybrd clusterng approach HCMX to cluster mult-verson xml documents when prelmnary clusterng soluton becomes outdated. To fnd updated clusterng soluton after document versonng comparson between all documents ncur large amount of redundant operatons. Our proposed HCMX approach judge amount of document affected and re-evaluate cluster by usng effect of temporal changes recorded n compressed delta and dstances recorded durng prelmnary clusterng phase. Expermental results shows proposed Fgure -4. Result chart 2. 141

REFERENCES: Fgure -5. Result chart 3. [1]. E. Wlde and R. J. Glushko. XML fever. J. Comm. ACM, 2008, (51) pp. 40 46. [2]. A. Tagarell and S. Greco. Semantc clusterng of XML documents. ACM Transactons on Informaton Systems, Vol. 28, No1, 2009, pp. 1 56. [3]. T. Tran, S. Kutty, and R. Nayak. Utlzng the structure and content nformaton for xml document clusterng. Lecture Notes n Computer Scence. Sprnger Berln Hedelberg. Volume 5631, 2009, pp.460 468 [4]. S. Kutty, T. Tran, R. Nayak, and Y. L. Clusterng XML documents usng closed frequent subtrees: A structural smlarty approach. Lecture Notes n Computer Scence, Sprnger Berln / Hedelberg. Volume 4862, 2008, pp.183 194. [5]. Dyreson. C, and Grand. F. Temporal XML, Database Systems, 2009, pp.3032-3035. [6]. Sdra F., Mansoor S. Temporal and multversoned XML documents: A survey, Informaton Processng and Management, Vol. 50, No 1, 2014, pp.113-131. [7]. Sherf Sakr. XML compresson technques: A survey and comparson., Journal of Computer and system Scence Vol 75, No 5, 2009 pp.302-322.. [8]. Dalamagas, T., Cheng, T., Wnkel, K.J. and Sells, T. Clusterng XML documents by Structure. SETN LNAI Sprnger. 3025, 2004, pp. 12-121. [9]. XML data repostory, onlne at www.cs.washngton.edu/research/projects/xmlt k/xmldata [10]. Vjay Sonawane and D.Rajeshwara Rao. An Optmstc Approach for Clusterng Multverson XML Documents Usng Compressed Delta. Internatonal Journal of Electrcal and Computer Engneerng. Vol. 5, Issue 6. 2015, ISSN- 2088-8708. [11]. Vjay Sonawane and D.R.Rao. A Comparatve Study: Change Detecton and Queryng Dynamc XML Documents. Internatonal Journal of Electrcal and Computer Engneerng. Vol. 5 No 4. 2015. ISSN- 2088-8708. [12]. Baeza-Yates and R. Rbero-Neto, B. Modern nformaton retreval: The concepts and technology behnd search. ACM Press/Addson-Wesley. 2011. [13]. Gao,M and Chen F. Clusterng XML Data Streams by Structure based on SldngWndows and Exponental Hs-tograms. Proceedngs of the nternatonal conference on advances n databases, knowledge, and data applcatons. 2013. pp.224-230. [14]. Wuwongse V, Yoshkawa M., and Amagasa, T. Temporal versonng of XML documents. Proceedngs of the Seventh Internatonal conference on dgtal lbrares: Internatonal collaboraton and crossfertlzaton. 2004. pp.419-428. [15]. Cavaler F, Guerrn. G, Mest M. and Olbon, B. On the reducton of sequences of XML document and schema update operatons. Proceedngs of the IEEE twenty seventh nternatonal conference on data engneerng workshops. 2011. Pp.77-88. [16]. A. Tagarell. and S. Greco. Semantc clusterng of XML documents. ACM Transactons on Informaton Systems. Vol. 28, No 1. 2010. pp. 1-56. [17]. M.X. Gao, W.J. Yao, and G.J. Mao. Exponental hstogram of cluster feature for XML tream. Journal of Bejng Unversty of Technology. vol. 37, 2011. pp. 1242-1248. 142