Parallel processing model for XML parsing

Similar documents
IP Network Design by Modified Branch Exchange Method

Segmentation of Casting Defects in X-Ray Images Based on Fractal Dimension

A New and Efficient 2D Collision Detection Method Based on Contact Theory Xiaolong CHENG, Jun XIAO a, Ying WANG, Qinghai MIAO, Jian XUE

Journal of World s Electrical Engineering and Technology J. World. Elect. Eng. Tech. 1(1): 12-16, 2012

Towards Adaptive Information Merging Using Selected XML Fragments

Controlled Information Maximization for SOM Knowledge Induced Learning

ANALYTIC PERFORMANCE MODELS FOR SINGLE CLASS AND MULTIPLE CLASS MULTITHREADED SOFTWARE SERVERS

Module 6 STILL IMAGE COMPRESSION STANDARDS

Point-Biserial Correlation Analysis of Fuzzy Attributes

A modal estimation based multitype sensor placement method

XFVHDL: A Tool for the Synthesis of Fuzzy Logic Controllers

An Unsupervised Segmentation Framework For Texture Image Queries

A Recommender System for Online Personalization in the WUM Applications

SYSTEM LEVEL REUSE METRICS FOR OBJECT ORIENTED SOFTWARE : AN ALTERNATIVE APPROACH

DEADLOCK AVOIDANCE IN BATCH PROCESSES. M. Tittus K. Åkesson

RANDOM IRREGULAR BLOCK-HIERARCHICAL NETWORKS: ALGORITHMS FOR COMPUTATION OF MAIN PROPERTIES

Detection and Recognition of Alert Traffic Signs

Annales UMCS Informatica AI 2 (2004) UMCS

And Ph.D. Candidate of Computer Science, University of Putra Malaysia 2 Faculty of Computer Science and Information Technology,

CS 2461: Computer Architecture 1 Program performance and High Performance Processors

APPLICATION OF STRUCTURED QUEUING NETWORKS IN QOS ESTIMITION OF TELECOMMUNICATION SERVICE

High performance CUDA based CNN image processor

Communication vs Distributed Computation: an alternative trade-off curve

High Performance Computing on GPU for Electromagnetic Logging

Positioning of a robot based on binocular vision for hand / foot fusion Long Han

Frequency Domain Approach for Face Recognition Using Optical Vanderlugt Filters

A Memory Efficient Array Architecture for Real-Time Motion Estimation

Optical Flow for Large Motion Using Gradient Technique

A VECTOR PERTURBATION APPROACH TO THE GENERALIZED AIRCRAFT SPARE PARTS GROUPING PROBLEM

FACE VECTORS OF FLAG COMPLEXES

Effects of Model Complexity on Generalization Performance of Convolutional Neural Networks

The EigenRumor Algorithm for Ranking Blogs

Shortest Paths for a Two-Robot Rendez-Vous

Embeddings into Crossed Cubes

Reachable State Spaces of Distributed Deadlock Avoidance Protocols

A Minutiae-based Fingerprint Matching Algorithm Using Phase Correlation

Hierarchically Clustered P2P Streaming System

A Shape-preserving Affine Takagi-Sugeno Model Based on a Piecewise Constant Nonuniform Fuzzification Transform

Scaling Location-based Services with Dynamically Composed Location Index

Improvement of First-order Takagi-Sugeno Models Using Local Uniform B-splines 1

POMDP: Introduction to Partially Observable Markov Decision Processes Hossein Kamalzadeh, Michael Hahsler

A Family of Distributed Deadlock Avoidance Protocols and their Reachable State Spaces

a Not yet implemented in current version SPARK: Research Kit Pointer Analysis Parameters Soot Pointer analysis. Objectives

A Novel Automatic White Balance Method For Digital Still Cameras

(a, b) x y r. For this problem, is a point in the - coordinate plane and is a positive number.

Modelling, simulation, and performance analysis of a CAN FD system with SAE benchmark based message set

On Error Estimation in Runge-Kutta Methods

Obstacle Avoidance of Autonomous Mobile Robot using Stereo Vision Sensor

Information Retrieval. CS630 Representing and Accessing Digital Information. IR Basics. User Task. Basic IR Processes

An Extension to the Local Binary Patterns for Image Retrieval

A ROI Focusing Mechanism for Digital Cameras

On the Conversion between Binary Code and Binary-Reflected Gray Code on Boolean Cubes

A Full-mode FME VLSI Architecture Based on 8x8/4x4 Adaptive Hadamard Transform For QFHD H.264/AVC Encoder

Data mining based automated reverse engineering and defect discovery

ADDING REALISM TO SOURCE CHARACTERIZATION USING A GENETIC ALGORITHM

Efficient protection of many-to-one. communications

SCALABLE ENERGY EFFICIENT AD-HOC ON DEMAND DISTANCE VECTOR (SEE-AODV) ROUTING PROTOCOL IN WIRELESS MESH NETWORKS

IP Multicast Simulation in OPNET

Any modern computer system will incorporate (at least) two levels of storage:

Combinatorial Mobile IP: A New Efficient Mobility Management Using Minimized Paging and Local Registration in Mobile IP Environments

Using Data Flow Diagrams for Supporting Task Models

COSC 6385 Computer Architecture. - Pipelining

Vehicle Chassis Control Using Adaptive Semi-Active Suspension

INDEXATION OF WEB PAGES BASED ON THEIR VISUAL RENDERING

Prioritized Traffic Recovery over GMPLS Networks

Desired Attitude Angles Design Based on Optimization for Side Window Detection of Kinetic Interceptor *

The International Conference in Knowledge Management (CIKM'94), Gaithersburg, MD, November 1994.

dc - Linux Command Dc may be invoked with the following command-line options: -V --version Print out the version of dc

Lecture #22 Pipelining II, Cache I

An Optimised Density Based Clustering Algorithm

Multi-azimuth Prestack Time Migration for General Anisotropic, Weakly Heterogeneous Media - Field Data Examples

UCB CS61C : Machine Structures

DPICO: A High Speed Deep Packet Inspection Engine Using Compact Finite Automata

Title. Author(s)NOMURA, K.; MOROOKA, S. Issue Date Doc URL. Type. Note. File Information

Spiral Recognition Methodology and Its Application for Recognition of Chinese Bank Checks

Input Layer f = 2 f = 0 f = f = 3 1,16 1,1 1,2 1,3 2, ,2 3,3 3,16. f = 1. f = Output Layer

Cellular Neural Network Based PTV

Automatically Testing Interacting Software Components

Improved Fourier-transform profilometry

Adaptation of Motion Capture Data of Human Arms to a Humanoid Robot Using Optimization

Conversion Functions for Symmetric Key Ciphers

Slotted Random Access Protocol with Dynamic Transmission Probability Control in CDMA System

DYNAMIC STORAGE ALLOCATION. Hanan Samet

An Improved Resource Reservation Protocol

DUe to the recent developments of gigantic social networks

MapReduce Optimizations and Algorithms 2015 Professor Sasu Tarkoma

A Texture Feature Extraction Based On Two Fractal Dimensions for Content_based Image Retrieval

Haptic Glove. Chan-Su Lee. Abstract. This is a final report for the DIMACS grant of student-initiated project. I implemented Boundary

ECE331: Hardware Organization and Design

Clustering Interval-valued Data Using an Overlapped Interval Divergence

ART GALLERIES WITH INTERIOR WALLS. March 1998

Computers & Industrial Engineering

THE THETA BLOCKCHAIN

Decision Support for Rule and Technique Discovery in an Uncertain Environment

Transmission Lines Modeling Based on Vector Fitting Algorithm and RLC Active/Passive Filter Design

Extract Object Boundaries in Noisy Images using Level Set. Final Report

Generalized Grey Target Decision Method Based on Decision Makers Indifference Attribute Value Preferences

CS 61C: Great Ideas in Computer Architecture. Pipelining Hazards. Instructor: Senior Lecturer SOE Dan Garcia

The Processor: Improving Performance Data Hazards

Simulation and Performance Evaluation of Network on Chip Architectures and Algorithms using CINSIM

Transcription:

Recent Reseaches in Communications, Signals and nfomation Technology Paallel pocessing model fo XML pasing ADRANA GEORGEVA Fac. Applied Mathematics and nfomatics Technical Univesity of Sofia, TU-Sofia Sofia, 8 Kliment Ohidski, 000 BULGARA e-mail: adig@tu-sofia.bg BOZHDAR GEORGEV Fac. Compute Systems and Contol Technical Univesity of Sofia, TU-Sofia Sofia, 8 Kliment Ohidski, 000 BULGARA e-mail: bgeogiev@tu-sofia.bg Abstact: - n this pape, ae pesented some development poblems and solutions concening the paallel implementation of an algebaic method fo XML data pocessing. t is in tight connection with moden concepts of the paallel pogamming. The poposed paallel algoithm fist patitions the XML document into chunks and then apply the paallel model to pocess each chunk of XML tee. n the aticle ae shown some theoetical aspects of XML functional pases and paallel navigating mechanisms on XML souce. The authos suggest a diffeent point of view about XML pases with the ceation of advanced algebaic pocesso (including all necessay softwae tools, seach techniques and pogamming modules). The possibilities of this linea algebaic model, combined with pinciples of paallel pogamming allow efficient solutions fo pasing, seach and manipulation ove semi-stuctued data with hieachical stuctues. Thus pesented pape combines the building of an algebaic fomalism fo navigation ove XML hieachy with concepts of moden XML pase and thei mutual wok in paallel. So poposed paallel pasing mechanism is easy accessible to the Web consume, who is able to contol XML file pocessing, to seach diffeent elements in it, to delete and to add a new XML content. The pesented vaious tests show highe apidity and low consumption of esouces in compaison with some existing commecial XML pases. Key-Wods: - Hieachical XML tee, XML pase, XML tansfomations of semi-stuctued data, algebaic modeling of XML stuctues, evesal pase (RP), paallelization, module-finite algeba, XPath scipting language, functional pogamming. ntoduction With the advent of the infomation age and the ubiquitous use of the ntenet, thee is an unpecedented demand fo effective and efficient techniques fo data pocessing.. The exponential gowth of the intenet and the Web has flooded all the people on the wold with quantities of data in diffeent fomats on a vaiety of subjects. The widespead use of XML as the panacea of this poblem pompted the development of appopiate seaching and bowsing methods fo XML documents. XML is going to become the standad document fomat and with the use of XML quey languages, uses of XML etieval systems ae able to exploit the stuctual natue of the data and estict thei seach to specific stuctual elements within an XML documents. Vey lage scientific data sets ae inceasingly becoming available in XML fomats [5]. Unfotunately, most XML pases ae still using algoithms that ae inheently seial, which show little impovement on newe computing hadwae []. SBN: 978--6804-08-7 26

Recent Reseaches in Communications, Signals and nfomation Technology The cuent XML implementation landscape does not adequately meet the pefomance equiements of lage scale applications. The applications using Web sevices have athe focused on XML potocol standadization and tool building effots, and not on addessing the pefomance bottlenecks when dealing with lage volumes of XML data [9]. Actually, XML paallel pasing has been studied in depth ove the past two decades. XML documents have some stuctual popeties that make it moe dependent on paallelized pasing than geneal context-fee languages. XML pases spend a lage pecentage of time tokenizing the input in an inheently seial pocess. n this aticle, the authos have been made effots to paallelize XML pasing pocess. Recently, many XML eseaches ae exploed new techniques fo paallelizing pases fo vey lage XML documents [7]. When thinking about a multitheaded solution it is necessay to conside at least the following stategies o some mixtues of them:. Ceating multiple pases and unning them in paallel on the XML souces. 2. Rewiting pasing algoithms thead with the main goal to use safety only one instance of the pase. 3. Split the XML souce into chunks and assign the chunks to multiple pocessing theads. This pape poposes an appoach to paallelize XML pasing pocess, whee the XML document is split into fagments (two o moe) and the pase woks on diffeent fagments in paallel. This model is well suited as fo multi-coe pocessos as well as fo multi-theading pogamming. The pevalence of lage XML documents is anothe motivation fo eseaches in thei effots to optimize and paallelize XML pases. At the same time, multi-coe pocessing is inceasingly becoming available on desktop- and laptop-class computing machines. Paallelizing input documents into multiple theads is the key to pefomance impovement of XML pasing pocess. The est of the pape is oganized as follows. Section 2 intoduces the dividing pocess of the whole XML document into chunks and the following paallel handling of XML document. Hee is exploed evesal appoach fo speed up the XML pase mechanisms. Sections 3 descibes in detail the peviously suggested (fom the same authos) functional XML pase [4][6]. Section 4 pesents some XML pase achitectues and pogam ealizations along with an algebaic seach and hieachy access. Finally, in Section 5, the geneal issues, conclusions, the futhe eseaches and some open poblems ae discussed. The last section gives the pefomance evaluation esults and makes a bief compaison with some simila appoaches. 2 XML documents pocessing Ou fist step is by means of chunk patition to divide the whole o pat of input XML document into seveal of appoximately equalsized chunks. The chunk size can be settled at un time and as a ule each chunk should be big enough to minimize the numbe of chunks and educe the post pocessing manipulations. Actually, evey XML document can be epesented by a tee. Hee is pesented example of XML fagment: <bank>new Bank <banches>banch <clients> <assets>account A</assets> <assets>account B</assets> </clients> </banches> <banches> Banch 2 <clients> <assets>account C</assets> <assets>account D</assets> </clients> </banches> <banches> Banch 3 <clients> <assets>account E</assets> <assets>account F</assets> </clients> </banches> </bank> The hieachical tee, coesponding to this XML code is shown below on fig.: SBN: 978--6804-08-7 27

Recent Reseaches in Communications, Signals and nfomation Technology bank banches clients Diffeent types of accounts Fig. Hieachical tee pesentation of XML input file Afte finishing pasing pocess, the infomation fo the chunk is not complete. Then the paallel pase must cay out post pocessing fo the pased chunk, when all peceding chunks ae pased. Afte that, the pased chunks hold the complete infoset infomation fo the coesponding input chunk and can be put into pased chunk pool to be pocessed by next stage. NPUT XML FP RESULT D O C U M E N T PARTTON READER FP FP XML T R E E Fig.2 Paallel XML pasing algoithm with functional pase (FP) as post mechanism The nodes of XML tee ae pesented in coesponding pase table with 34 bytes fo each node. The stuctue is following: fist 6 bytes ae allocated fo the name of node; SBN: 978--6804-08-7 28

Recent Reseaches in Communications, Signals and nfomation Technology byte is detached fo the type of node (element, attibute data etc.). This schema is followed by byte fo the level in hieachy, next bytes 9-22 ae allocated fo the shifting in the ow and last 2 bytes ae sepaated fo child nodes of this paent node [8]. Hee can be stoed thee childen of each coesponding paent node. As a pactical ealization of yet pesented theoetical eseach, authos suggest simple paallel evesal pase (RP), based on standad SAX. The evesal pase (in shot RP) is a two-steaming, two-way XML pase that begins handling the input XML sting fom its both sides. Revesal pase stats pasing pocess following diections fom left to ight and simultaneously fom ight to left. The esults and analyses ae shown in section 4 to pove the theoetical gounds in so poposed solutions. Thee is built test schema with diffeent by size and by complexity XML input documents. 3 An acceleating navigation ove XML documents This section poposes an appoach, inspied by exploed theoetical fomalisms [2], which diectly addesses XML hieachical components. This appoach, offeed by the same authos [3] [8], fo extension of data pocessing possibilities in XML hieachy [4], is applied hee fo acceleating navigation ove XML documents in the fom of hieachical tees. A main goal of this analysis is to povide moden linea algeba tools fo wok on the XML document though a diect access to the nodes of appopiate XML tee. The conceptual model of some hieachy is pesented as an algebaic stuctue A = ( A, A 2,., A n ) a family of modules A i ove the ings α i, whee α i is the dimension of each database domain D i ( α< α2 <... < αn ) and n is the numbe of hieachical levels. This conceptual model pemits to wok with the natual numbes only, i.e. the code values of XML database elements. That means simple physical oganization, because a physical addess could be calculated fo evey hieachical object with a finite sequence of the code values of its attibutes. At that the computation is pefomed by odinay algebaic opeations with integes. This way it is povided an efficient diect access to evey element of XML hieachical data stuctue. This is vey impotant fo data stuctues oganization and suppot, especially when we have paallel pocessing of XML documents. The set A = ( A, A 2,., A n ) is consideed as an algebaic model of hieachical data stuctue with n levels, which is patially odeed by inclusion: A A 2... A i... A n. So any object fom level n with its attibutes can be epesented by the finite sequence ( a, a2,..., a ) n of the set A. n this algebaic model the tansition conceptual intenal is defined as the mapping Φ : A P, which coelates to evey finite sequence: ( a, a 2,, a n ) A in one-to-one manne a fixed intege pn P. Hee the intege p n uniquely defines the place of the object O n fom level n in the eal stuctue M, i.e. its addess in the physical database design P. Fo futhe fomal and coect desciption of data tansition conceptual intenal is defined the mapping Φ : A P, which confont to any finite sequence (, 2,..., ) a a an A one intege p n.this intege defines in simple way the position of the object O n into XML tee. Fo the physical data epesentation it is poved, that this mappingφ : A P, is bijective and linea mapping (linea function) [2] [3]. As esult of this theoetical model hee is given the unique detemination of the physical addess of the objecto (object fom level k k) in common stuctue, i.e. the numbe p k by the following way: k p k = αi + ak = i= α + α +... + α + = 2 k a k. ( ). ( )... 2. ( ) α Φ h + α Φ h + + α Φ hk + a k k = α. Φ( hi ) + ak, i= () SBN: 978--6804-08-7 29

Recent Reseaches in Communications, Signals and nfomation Technology whee: - Φ ( h) = c0 = ; Φ ( h2 ) = c0. c = c; Φ ( h3) = c0. c. c2 = c. c2;. ; Φ ( hk) = cc 0.. c2..... ck = c. c2..... ck ae the tansfomed chaacteistic elements fom the tee; - c0, c, c2..... c k ae the numbe of childen (subodinated elements) of any element fom level i to level i+; ci ; odinay c 0 = and ak = {...{( a ). c+ a2 )}. c2+... + ( ak )}. ck + ak (2) Hee a is the code value a in the k hieachical level k. The calculations in fomula /2/ ae based on the fomal desciption of the sets of code values of XML nodes components. Accoding to suggested fomal algebaic desciption, each of the objects in a eal XML hieachical data stuctue can be accepted as an element of the coesponding hieachical stuctue [3]. Fo XML physical data design is chosen one-dimensional addess aay with codes of all XML database elements fom the type: E ( i k, i 2 k 2,..., i n k n ), which pesents the XML data stuctue in inceasing consistency in the ode of the coesponding hieachical levels. Hee, an expession km im is a dimension of level m in hieachy fo each m =,2,..., n. k So evey object fom the eal XML hieachy can be obtain on conceptual level as an element fom the algebaic stuctue (nsequence), espectively, on physical level as an addess fom coesponding physical stuctue (3). t is in dependence of the puposes and the chaacte of the use application with the following epesentation: f Φ O ( a, a2,..., a ) n p n (3) Finally, the poposed appoach is diffeent in compaison with many othe well known quey and tansfomational languages (as XSL, XSLT, XPath) in espect of thei definition, expessiveness, and seach techniques. 4 Pactical eseaches and tests As has been yet mentioned, pactical esults include the ceation of high pefomance paallel evesal XML pase, and apply it as main component in paallel pasing stategies as well. The pactical ealization includes Windows application, which woks in pogam development envionment Eclipse 3.2. Actually, Java language is vey suitable fo the main goals of this eseach [0]. This objectoiented language possesses set of possibilities fo fast thead pogamming in eal time. n this way, Java povides geat effectiveness in the pocess of ceating contempoay multilevel pases on new hadwae systems with chip multipocessos. n poposed in the aticle eseach, is chosen SAX Java Apache Xeces as a compaative pase. Time (seconds) 4 2 0 8 6 4 2 0 0 50 00 50 File capacity (МВ) Revesal RP SAX pase Fig. 3. Taditional SAX pase compaed with evesal pase (RP) SBN: 978--6804-08-7 30

Recent Reseaches in Communications, Signals and nfomation Technology The poposed evesal appoach insets two pasing pocesses woking in paallel. t speeds up the management of huge XML document. On fig. 3 ae shown esults of test examples. The diagam descibes the diffeences between apidity, while handling with these two types of pases classical SAX pase and evesal pase. Fig.3 also esumes seveal diffeent cases of file capacity and the coesponding time of pasing. When paallel pase is used, it acceleates the whole pocess with appoximately 45% against taditional SAX pase. This pecentage depends on input file capacity. Fom the fig.3, we can see that the bigge the XML document size is, the highe speedup of paallel pase can achieve. Because the huge XML document can be split into moe subtasks to be pased in paallel and can maximize the utilization of multipocessos.n so poposed evesal XML pase, simple typical and pedictable eo situations ae esolved by means of Java exceptions softwae module. 5 Conclusion and futue wok n this aticle, ae descibed some possibilities fo building a new paallel XML pasing algoithm with geat pefomance. The concept about paallel pase is a possible solution fo the acceleation of XML document pocessing. The embedding of this paallel mechanism in the pocess of XML pasing will satisfy the need fo moe effective and faste pase pocessing. This appoach aises the ate of pasing and is especially useful in the pasing pocedues fo huge input XML documents. Ou futue wok includes initial chunk patition of XML document into moe than two chunks, using multiple functional paallel pases, applying new contempoay algebaic methods, schema validations etc. The futhe development of this eseach foesees to extend these theoetical ideas with pactical examples as soon as it is possible. The authos hope that this diection of eseach is vey impotant to advance quey languages development and these new popositions in taditional theoy of pasing, tanslation and compiling will take effect on XML DB and WEB pactice. Refeences: [] Keogh J. and Davidson K., XML DeMYSTiFied, McGaw-Hill, Emeyville, Califonia, USA, 2005. [2] Geogieva A. and Geogiev B., Conceptual Method fo Extension of Data Pocessing Possibilities in XML Hieachy, Fouth ntenational Scientific Confeence, Kavala, Geece, 2008. [3] A. Geogieva, B. Geogiev, A Navigation ove XML Documents though Linea Algeba Tools, The Fouth ntenational Confeence on ntenet and Web Applications and Sevices - CW 09, Venice/Meste, taly, 24-29 Мay 2009, Published by EEE Compute Society, SBN: 978-0-7695-363-2/09. [4] Geogieva A. (2003), One Algebaic Method of Database Design, Poceedings of the -st ntenational Confeence on Mathematics fo ndusty /M 2003/, Thessaloniki-Geece (235-243). [5] Wold Wide Web Consotium, http://www.w3.og/tr/2004/. Extensible Makup Language (XML).0. W3C Recommendation, thid edition, Febuay 2004. [6] Dalington J., Hendeson P., and Tune D., Functional pogamming and its applications, Cambidge univesity pess, SBN 0 52 24503 6, 982 [7] Michael R.Head and Madhusudhan G. Paallel Pocessing of Lage-Scale XML- Based Application Documents on Multi-Coe Achitectues with PiXiMaL. n EEE Fouth ntenational Confeence on escience, pages 26 268, ndianapolis, N, Decembe 2008. doi: 0.09/eScience.2008.77. [8] Geogiev B. and Geogieva A., Realization of Algebaic Pocesso fo XML Documents Pocessing, AP Confeence Poceedings (36- th ntenational Confeence AMEE-0), vol.293, 200. [9] XML on Wall Steet, http://lighthousepatnes.com/xml [0]Shishedjiev, B., M. Goanova, Geogieva, XML-based Language fo Specific Scientific Data Desciption, Poceedings of The Fifth ntenational Confeence on ntenet and Web Applications and Sevices, CW 200, SBN: 978-0-7695-4022-, 9-5 May 200, Spain SBN: 978--6804-08-7 3