Recent Reseaches in Communications, Signals and nfomation Technology Paallel pocessing model fo XML pasing ADRANA GEORGEVA Fac. Applied Mathematics and nfomatics Technical Univesity of Sofia, TU-Sofia Sofia, 8 Kliment Ohidski, 000 BULGARA e-mail: adig@tu-sofia.bg BOZHDAR GEORGEV Fac. Compute Systems and Contol Technical Univesity of Sofia, TU-Sofia Sofia, 8 Kliment Ohidski, 000 BULGARA e-mail: bgeogiev@tu-sofia.bg Abstact: - n this pape, ae pesented some development poblems and solutions concening the paallel implementation of an algebaic method fo XML data pocessing. t is in tight connection with moden concepts of the paallel pogamming. The poposed paallel algoithm fist patitions the XML document into chunks and then apply the paallel model to pocess each chunk of XML tee. n the aticle ae shown some theoetical aspects of XML functional pases and paallel navigating mechanisms on XML souce. The authos suggest a diffeent point of view about XML pases with the ceation of advanced algebaic pocesso (including all necessay softwae tools, seach techniques and pogamming modules). The possibilities of this linea algebaic model, combined with pinciples of paallel pogamming allow efficient solutions fo pasing, seach and manipulation ove semi-stuctued data with hieachical stuctues. Thus pesented pape combines the building of an algebaic fomalism fo navigation ove XML hieachy with concepts of moden XML pase and thei mutual wok in paallel. So poposed paallel pasing mechanism is easy accessible to the Web consume, who is able to contol XML file pocessing, to seach diffeent elements in it, to delete and to add a new XML content. The pesented vaious tests show highe apidity and low consumption of esouces in compaison with some existing commecial XML pases. Key-Wods: - Hieachical XML tee, XML pase, XML tansfomations of semi-stuctued data, algebaic modeling of XML stuctues, evesal pase (RP), paallelization, module-finite algeba, XPath scipting language, functional pogamming. ntoduction With the advent of the infomation age and the ubiquitous use of the ntenet, thee is an unpecedented demand fo effective and efficient techniques fo data pocessing.. The exponential gowth of the intenet and the Web has flooded all the people on the wold with quantities of data in diffeent fomats on a vaiety of subjects. The widespead use of XML as the panacea of this poblem pompted the development of appopiate seaching and bowsing methods fo XML documents. XML is going to become the standad document fomat and with the use of XML quey languages, uses of XML etieval systems ae able to exploit the stuctual natue of the data and estict thei seach to specific stuctual elements within an XML documents. Vey lage scientific data sets ae inceasingly becoming available in XML fomats [5]. Unfotunately, most XML pases ae still using algoithms that ae inheently seial, which show little impovement on newe computing hadwae []. SBN: 978--6804-08-7 26
Recent Reseaches in Communications, Signals and nfomation Technology The cuent XML implementation landscape does not adequately meet the pefomance equiements of lage scale applications. The applications using Web sevices have athe focused on XML potocol standadization and tool building effots, and not on addessing the pefomance bottlenecks when dealing with lage volumes of XML data [9]. Actually, XML paallel pasing has been studied in depth ove the past two decades. XML documents have some stuctual popeties that make it moe dependent on paallelized pasing than geneal context-fee languages. XML pases spend a lage pecentage of time tokenizing the input in an inheently seial pocess. n this aticle, the authos have been made effots to paallelize XML pasing pocess. Recently, many XML eseaches ae exploed new techniques fo paallelizing pases fo vey lage XML documents [7]. When thinking about a multitheaded solution it is necessay to conside at least the following stategies o some mixtues of them:. Ceating multiple pases and unning them in paallel on the XML souces. 2. Rewiting pasing algoithms thead with the main goal to use safety only one instance of the pase. 3. Split the XML souce into chunks and assign the chunks to multiple pocessing theads. This pape poposes an appoach to paallelize XML pasing pocess, whee the XML document is split into fagments (two o moe) and the pase woks on diffeent fagments in paallel. This model is well suited as fo multi-coe pocessos as well as fo multi-theading pogamming. The pevalence of lage XML documents is anothe motivation fo eseaches in thei effots to optimize and paallelize XML pases. At the same time, multi-coe pocessing is inceasingly becoming available on desktop- and laptop-class computing machines. Paallelizing input documents into multiple theads is the key to pefomance impovement of XML pasing pocess. The est of the pape is oganized as follows. Section 2 intoduces the dividing pocess of the whole XML document into chunks and the following paallel handling of XML document. Hee is exploed evesal appoach fo speed up the XML pase mechanisms. Sections 3 descibes in detail the peviously suggested (fom the same authos) functional XML pase [4][6]. Section 4 pesents some XML pase achitectues and pogam ealizations along with an algebaic seach and hieachy access. Finally, in Section 5, the geneal issues, conclusions, the futhe eseaches and some open poblems ae discussed. The last section gives the pefomance evaluation esults and makes a bief compaison with some simila appoaches. 2 XML documents pocessing Ou fist step is by means of chunk patition to divide the whole o pat of input XML document into seveal of appoximately equalsized chunks. The chunk size can be settled at un time and as a ule each chunk should be big enough to minimize the numbe of chunks and educe the post pocessing manipulations. Actually, evey XML document can be epesented by a tee. Hee is pesented example of XML fagment: <bank>new Bank <banches>banch <clients> <assets>account A</assets> <assets>account B</assets> </clients> </banches> <banches> Banch 2 <clients> <assets>account C</assets> <assets>account D</assets> </clients> </banches> <banches> Banch 3 <clients> <assets>account E</assets> <assets>account F</assets> </clients> </banches> </bank> The hieachical tee, coesponding to this XML code is shown below on fig.: SBN: 978--6804-08-7 27
Recent Reseaches in Communications, Signals and nfomation Technology bank banches clients Diffeent types of accounts Fig. Hieachical tee pesentation of XML input file Afte finishing pasing pocess, the infomation fo the chunk is not complete. Then the paallel pase must cay out post pocessing fo the pased chunk, when all peceding chunks ae pased. Afte that, the pased chunks hold the complete infoset infomation fo the coesponding input chunk and can be put into pased chunk pool to be pocessed by next stage. NPUT XML FP RESULT D O C U M E N T PARTTON READER FP FP XML T R E E Fig.2 Paallel XML pasing algoithm with functional pase (FP) as post mechanism The nodes of XML tee ae pesented in coesponding pase table with 34 bytes fo each node. The stuctue is following: fist 6 bytes ae allocated fo the name of node; SBN: 978--6804-08-7 28
Recent Reseaches in Communications, Signals and nfomation Technology byte is detached fo the type of node (element, attibute data etc.). This schema is followed by byte fo the level in hieachy, next bytes 9-22 ae allocated fo the shifting in the ow and last 2 bytes ae sepaated fo child nodes of this paent node [8]. Hee can be stoed thee childen of each coesponding paent node. As a pactical ealization of yet pesented theoetical eseach, authos suggest simple paallel evesal pase (RP), based on standad SAX. The evesal pase (in shot RP) is a two-steaming, two-way XML pase that begins handling the input XML sting fom its both sides. Revesal pase stats pasing pocess following diections fom left to ight and simultaneously fom ight to left. The esults and analyses ae shown in section 4 to pove the theoetical gounds in so poposed solutions. Thee is built test schema with diffeent by size and by complexity XML input documents. 3 An acceleating navigation ove XML documents This section poposes an appoach, inspied by exploed theoetical fomalisms [2], which diectly addesses XML hieachical components. This appoach, offeed by the same authos [3] [8], fo extension of data pocessing possibilities in XML hieachy [4], is applied hee fo acceleating navigation ove XML documents in the fom of hieachical tees. A main goal of this analysis is to povide moden linea algeba tools fo wok on the XML document though a diect access to the nodes of appopiate XML tee. The conceptual model of some hieachy is pesented as an algebaic stuctue A = ( A, A 2,., A n ) a family of modules A i ove the ings α i, whee α i is the dimension of each database domain D i ( α< α2 <... < αn ) and n is the numbe of hieachical levels. This conceptual model pemits to wok with the natual numbes only, i.e. the code values of XML database elements. That means simple physical oganization, because a physical addess could be calculated fo evey hieachical object with a finite sequence of the code values of its attibutes. At that the computation is pefomed by odinay algebaic opeations with integes. This way it is povided an efficient diect access to evey element of XML hieachical data stuctue. This is vey impotant fo data stuctues oganization and suppot, especially when we have paallel pocessing of XML documents. The set A = ( A, A 2,., A n ) is consideed as an algebaic model of hieachical data stuctue with n levels, which is patially odeed by inclusion: A A 2... A i... A n. So any object fom level n with its attibutes can be epesented by the finite sequence ( a, a2,..., a ) n of the set A. n this algebaic model the tansition conceptual intenal is defined as the mapping Φ : A P, which coelates to evey finite sequence: ( a, a 2,, a n ) A in one-to-one manne a fixed intege pn P. Hee the intege p n uniquely defines the place of the object O n fom level n in the eal stuctue M, i.e. its addess in the physical database design P. Fo futhe fomal and coect desciption of data tansition conceptual intenal is defined the mapping Φ : A P, which confont to any finite sequence (, 2,..., ) a a an A one intege p n.this intege defines in simple way the position of the object O n into XML tee. Fo the physical data epesentation it is poved, that this mappingφ : A P, is bijective and linea mapping (linea function) [2] [3]. As esult of this theoetical model hee is given the unique detemination of the physical addess of the objecto (object fom level k k) in common stuctue, i.e. the numbe p k by the following way: k p k = αi + ak = i= α + α +... + α + = 2 k a k. ( ). ( )... 2. ( ) α Φ h + α Φ h + + α Φ hk + a k k = α. Φ( hi ) + ak, i= () SBN: 978--6804-08-7 29
Recent Reseaches in Communications, Signals and nfomation Technology whee: - Φ ( h) = c0 = ; Φ ( h2 ) = c0. c = c; Φ ( h3) = c0. c. c2 = c. c2;. ; Φ ( hk) = cc 0.. c2..... ck = c. c2..... ck ae the tansfomed chaacteistic elements fom the tee; - c0, c, c2..... c k ae the numbe of childen (subodinated elements) of any element fom level i to level i+; ci ; odinay c 0 = and ak = {...{( a ). c+ a2 )}. c2+... + ( ak )}. ck + ak (2) Hee a is the code value a in the k hieachical level k. The calculations in fomula /2/ ae based on the fomal desciption of the sets of code values of XML nodes components. Accoding to suggested fomal algebaic desciption, each of the objects in a eal XML hieachical data stuctue can be accepted as an element of the coesponding hieachical stuctue [3]. Fo XML physical data design is chosen one-dimensional addess aay with codes of all XML database elements fom the type: E ( i k, i 2 k 2,..., i n k n ), which pesents the XML data stuctue in inceasing consistency in the ode of the coesponding hieachical levels. Hee, an expession km im is a dimension of level m in hieachy fo each m =,2,..., n. k So evey object fom the eal XML hieachy can be obtain on conceptual level as an element fom the algebaic stuctue (nsequence), espectively, on physical level as an addess fom coesponding physical stuctue (3). t is in dependence of the puposes and the chaacte of the use application with the following epesentation: f Φ O ( a, a2,..., a ) n p n (3) Finally, the poposed appoach is diffeent in compaison with many othe well known quey and tansfomational languages (as XSL, XSLT, XPath) in espect of thei definition, expessiveness, and seach techniques. 4 Pactical eseaches and tests As has been yet mentioned, pactical esults include the ceation of high pefomance paallel evesal XML pase, and apply it as main component in paallel pasing stategies as well. The pactical ealization includes Windows application, which woks in pogam development envionment Eclipse 3.2. Actually, Java language is vey suitable fo the main goals of this eseach [0]. This objectoiented language possesses set of possibilities fo fast thead pogamming in eal time. n this way, Java povides geat effectiveness in the pocess of ceating contempoay multilevel pases on new hadwae systems with chip multipocessos. n poposed in the aticle eseach, is chosen SAX Java Apache Xeces as a compaative pase. Time (seconds) 4 2 0 8 6 4 2 0 0 50 00 50 File capacity (МВ) Revesal RP SAX pase Fig. 3. Taditional SAX pase compaed with evesal pase (RP) SBN: 978--6804-08-7 30
Recent Reseaches in Communications, Signals and nfomation Technology The poposed evesal appoach insets two pasing pocesses woking in paallel. t speeds up the management of huge XML document. On fig. 3 ae shown esults of test examples. The diagam descibes the diffeences between apidity, while handling with these two types of pases classical SAX pase and evesal pase. Fig.3 also esumes seveal diffeent cases of file capacity and the coesponding time of pasing. When paallel pase is used, it acceleates the whole pocess with appoximately 45% against taditional SAX pase. This pecentage depends on input file capacity. Fom the fig.3, we can see that the bigge the XML document size is, the highe speedup of paallel pase can achieve. Because the huge XML document can be split into moe subtasks to be pased in paallel and can maximize the utilization of multipocessos.n so poposed evesal XML pase, simple typical and pedictable eo situations ae esolved by means of Java exceptions softwae module. 5 Conclusion and futue wok n this aticle, ae descibed some possibilities fo building a new paallel XML pasing algoithm with geat pefomance. The concept about paallel pase is a possible solution fo the acceleation of XML document pocessing. The embedding of this paallel mechanism in the pocess of XML pasing will satisfy the need fo moe effective and faste pase pocessing. This appoach aises the ate of pasing and is especially useful in the pasing pocedues fo huge input XML documents. Ou futue wok includes initial chunk patition of XML document into moe than two chunks, using multiple functional paallel pases, applying new contempoay algebaic methods, schema validations etc. The futhe development of this eseach foesees to extend these theoetical ideas with pactical examples as soon as it is possible. The authos hope that this diection of eseach is vey impotant to advance quey languages development and these new popositions in taditional theoy of pasing, tanslation and compiling will take effect on XML DB and WEB pactice. Refeences: [] Keogh J. and Davidson K., XML DeMYSTiFied, McGaw-Hill, Emeyville, Califonia, USA, 2005. [2] Geogieva A. and Geogiev B., Conceptual Method fo Extension of Data Pocessing Possibilities in XML Hieachy, Fouth ntenational Scientific Confeence, Kavala, Geece, 2008. [3] A. Geogieva, B. Geogiev, A Navigation ove XML Documents though Linea Algeba Tools, The Fouth ntenational Confeence on ntenet and Web Applications and Sevices - CW 09, Venice/Meste, taly, 24-29 Мay 2009, Published by EEE Compute Society, SBN: 978-0-7695-363-2/09. [4] Geogieva A. (2003), One Algebaic Method of Database Design, Poceedings of the -st ntenational Confeence on Mathematics fo ndusty /M 2003/, Thessaloniki-Geece (235-243). [5] Wold Wide Web Consotium, http://www.w3.og/tr/2004/. Extensible Makup Language (XML).0. W3C Recommendation, thid edition, Febuay 2004. [6] Dalington J., Hendeson P., and Tune D., Functional pogamming and its applications, Cambidge univesity pess, SBN 0 52 24503 6, 982 [7] Michael R.Head and Madhusudhan G. Paallel Pocessing of Lage-Scale XML- Based Application Documents on Multi-Coe Achitectues with PiXiMaL. n EEE Fouth ntenational Confeence on escience, pages 26 268, ndianapolis, N, Decembe 2008. doi: 0.09/eScience.2008.77. [8] Geogiev B. and Geogieva A., Realization of Algebaic Pocesso fo XML Documents Pocessing, AP Confeence Poceedings (36- th ntenational Confeence AMEE-0), vol.293, 200. [9] XML on Wall Steet, http://lighthousepatnes.com/xml [0]Shishedjiev, B., M. Goanova, Geogieva, XML-based Language fo Specific Scientific Data Desciption, Poceedings of The Fifth ntenational Confeence on ntenet and Web Applications and Sevices, CW 200, SBN: 978-0-7695-4022-, 9-5 May 200, Spain SBN: 978--6804-08-7 3