Evaluation of Partial Path Queries on XML Data

Similar documents
Evaluation of Partial Path Queries on XML data

Towards Adaptive Information Merging Using Selected XML Fragments

MapReduce Optimizations and Algorithms 2015 Professor Sasu Tarkoma

a Not yet implemented in current version SPARK: Research Kit Pointer Analysis Parameters Soot Pointer analysis. Objectives

RANDOM IRREGULAR BLOCK-HIERARCHICAL NETWORKS: ALGORITHMS FOR COMPUTATION OF MAIN PROPERTIES

IP Network Design by Modified Branch Exchange Method

Journal of World s Electrical Engineering and Technology J. World. Elect. Eng. Tech. 1(1): 12-16, 2012

Query Language #1/3: Relational Algebra Pure, Procedural, and Set-oriented

Controlled Information Maximization for SOM Knowledge Induced Learning

Detection and Recognition of Alert Traffic Signs

Embeddings into Crossed Cubes

An Extension to the Local Binary Patterns for Image Retrieval

Communication vs Distributed Computation: an alternative trade-off curve

arxiv: v4 [cs.ds] 7 Feb 2018

A modal estimation based multitype sensor placement method

Lecture # 04. Image Enhancement in Spatial Domain

And Ph.D. Candidate of Computer Science, University of Putra Malaysia 2 Faculty of Computer Science and Information Technology,

An Unsupervised Segmentation Framework For Texture Image Queries

Point-Biserial Correlation Analysis of Fuzzy Attributes

A Minutiae-based Fingerprint Matching Algorithm Using Phase Correlation

FACE VECTORS OF FLAG COMPLEXES

Effective Data Co-Reduction for Multimedia Similarity Search

Reader & ReaderT Monad (11A) Young Won Lim 8/20/18

Scaling Location-based Services with Dynamically Composed Location Index

Assessment of Track Sequence Optimization based on Recorded Field Operations

Segmentation of Casting Defects in X-Ray Images Based on Fractal Dimension

Lecture 27: Voronoi Diagrams

DEADLOCK AVOIDANCE IN BATCH PROCESSES. M. Tittus K. Åkesson

A Two-stage and Parameter-free Binarization Method for Degraded Document Images

Image Enhancement in the Spatial Domain. Spatial Domain

HISTOGRAMS are an important statistic reflecting the

Automatically Testing Interacting Software Components

Topic -3 Image Enhancement

A Shape-preserving Affine Takagi-Sugeno Model Based on a Piecewise Constant Nonuniform Fuzzification Transform

Extract Object Boundaries in Noisy Images using Level Set. Final Report

A Novel Automatic White Balance Method For Digital Still Cameras

Parametric Query Optimization for Linear and Piecewise Linear Cost Functions

Optical Flow for Large Motion Using Gradient Technique

Reachable State Spaces of Distributed Deadlock Avoidance Protocols

Parallel processing model for XML parsing

Spiral Recognition Methodology and Its Application for Recognition of Chinese Bank Checks

Shortest Paths for a Two-Robot Rendez-Vous

DUe to the recent developments of gigantic social networks

THE THETA BLOCKCHAIN

Information Retrieval. CS630 Representing and Accessing Digital Information. IR Basics. User Task. Basic IR Processes

Efficient Evaluation of Generalized Path Pattern Queries on XML Data

XML Data Integration By Graph Restructuring

The Internet Ecosystem and Evolution

INDEXATION OF WEB PAGES BASED ON THEIR VISUAL RENDERING

Generalized Grey Target Decision Method Based on Decision Makers Indifference Attribute Value Preferences

A VECTOR PERTURBATION APPROACH TO THE GENERALIZED AIRCRAFT SPARE PARTS GROUPING PROBLEM

On Error Estimation in Runge-Kutta Methods

Effective Missing Data Prediction for Collaborative Filtering

Color Correction Using 3D Multiview Geometry

Lecture 8 Introduction to Pipelines Adapated from slides by David Patterson

The EigenRumor Algorithm for Ranking Blogs

Modeling spatially-correlated data of sensor networks with irregular topologies

Separability and Topology Control of Quasi Unit Disk Graphs

PathStack : A Holistic Path Join Algorithm for Path Query with Not-predicates on XML Data

A Memory Efficient Array Architecture for Real-Time Motion Estimation

ADDING REALISM TO SOURCE CHARACTERIZATION USING A GENETIC ALGORITHM

A Recommender System for Online Personalization in the WUM Applications

Modelling, simulation, and performance analysis of a CAN FD system with SAE benchmark based message set

Conversion Functions for Symmetric Key Ciphers

n If S is in convex position, then thee ae exactly k convex k-gons detemined by subsets of S. In geneal, howeve, S may detemine fa fewe convex k-gons.

Multi-azimuth Prestack Time Migration for General Anisotropic, Weakly Heterogeneous Media - Field Data Examples

Illumination methods for optical wear detection

ART GALLERIES WITH INTERIOR WALLS. March 1998

Obstacle Avoidance of Autonomous Mobile Robot using Stereo Vision Sensor

Slotted Random Access Protocol with Dynamic Transmission Probability Control in CDMA System

Prioritized Traffic Recovery over GMPLS Networks

A Neural Network Model for Storing and Retrieving 2D Images of Rotated 3D Object Using Principal Components

Quality Aware Privacy Protection for Location-based Services

Improved Fourier-transform profilometry

(a, b) x y r. For this problem, is a point in the - coordinate plane and is a positive number.

Transmission Lines Modeling Based on Vector Fitting Algorithm and RLC Active/Passive Filter Design

Comparisons of Transient Analytical Methods for Determining Hydraulic Conductivity Using Disc Permeameters

Modeling Spatially Correlated Data in Sensor Networks

Input Layer f = 2 f = 0 f = f = 3 1,16 1,1 1,2 1,3 2, ,2 3,3 3,16. f = 1. f = Output Layer

Image Registration among UAV Image Sequence and Google Satellite Image Under Quality Mismatch

UNION FIND. naïve linking link-by-size link-by-rank path compression link-by-rank with path compression context. An Improved Equivalence Algorithm

SYSTEM LEVEL REUSE METRICS FOR OBJECT ORIENTED SOFTWARE : AN ALTERNATIVE APPROACH

On the Conversion between Binary Code and Binary-Reflected Gray Code on Boolean Cubes

Improvement of First-order Takagi-Sugeno Models Using Local Uniform B-splines 1

DYNAMIC STORAGE ALLOCATION. Hanan Samet

The Power-Method: A Comprehensive Estimation Technique for Multi-Dimensional Queries

Interference-Aware Multicast for Wireless Multihop Networks

Module 6 STILL IMAGE COMPRESSION STANDARDS

Performance Optimization in Structured Wireless Sensor Networks

A Family of Distributed Deadlock Avoidance Protocols and their Reachable State Spaces

Strongly Connected Components. Uses for SCC s. Two Simple SCC Algorithms. Directed Acyclic Graphs

LaSaS: an Aggregated Search based Graph Matching Approach

Data mining based automated reverse engineering and defect discovery

Event-based Location Dependent Data Services in Mobile WSNs

The International Conference in Knowledge Management (CIKM'94), Gaithersburg, MD, November 1994.

Efficient Execution Path Exploration for Detecting Races in Concurrent Programs

Configuring RSVP-ATM QoS Interworking

4.2. Co-terminal and Related Angles. Investigate

A ROI Focusing Mechanism for Digital Cameras

Shape Matching / Object Recognition

Transcription:

Evaluation of Patial Path Queies on XML Data Stefanos Souldatos Dept of EE & CE NTUA, Geece stef@dblab.ntua.g Theodoe Dalamagas Dept of EE & CE NTUA, Geece dalamag@dblab.ntua.g Xiaoying Wu Dept. of CS NJIT, USA xw43@njit.edu Timos Sellis Dept of EE & CE NTUA, Geece timos@dblab.ntua.g Dimiti Theodoatos Dept. of CS NJIT, USA dth@cs.njit.edu ABSTRACT XML quey languages typically allow the specification of stuctual pattens of elements. Finding the occuences of such pattens in an XML tee is the key opeation in XML quey pocessing. Many algoithms have been pesented fo this opeation. These algoithms focus mainly on the evaluation of path-patten o tee-patten queies. In this pape, we define a patial path-patten quey language, and we addess the poblem of its efficient evaluation on XML data. In ode to pocess patial path-patten queies, we intoduce a set of sound and complete infeence ules to chaacteize stuctual elationship deivation. We povide necessay and sufficient conditions fo detecting quey unsatisfiability and node edundancy. We show how patial pathpatten queies can be equivalently put in a canonical diected acyclic gaph fom. We developed two stack-based algoithms fo the evaluation of patial path-patten queies, PatialMJ and PatialPathStack. PatialMJ computes answes to the quey by mege-joining the esults of the oot-to-leaf paths of a spanning tee of the quey. Patial- PathStack exploits a topological ode of the nodes of the quey gaph to match the quey patten as a whole to the XML tee. The expeimental evaluation of ou algoithms shows that PatialPathStack is independent of intemediate esults and lagely outpefoms PatialMJ. Categoies and Subject Desciptos H.2.3 [Database Management]: Languages Quey languages; H.2.4 [Database Management]: Systems Quey pocessing Geneal Tems Algoithms, Languages, Expeimentation The autho is patially suppoted by the Alexande S. Onassis Public Benefit Foundation scholaship. Pemission to make digital o had copies of all o pat of this wok fo pesonal o classoom use is ganted without fee povided that copies ae not made o distibuted fo pofit o commecial advantage and that copies bea this notice and the full citation on the fist page. To copy othewise, to epublish, to post on seves o to edistibute to lists, equies pio specific pemission and/o a fee. CIKM 07, Novembe 6 8, 2007, Lisboa, Potugal. Copyight 2007 ACM 978-1-59593-803-9/07/0011...$5.00. Keywods tee-stuctued data, patial path-patten quey, quey evaluation 1. INTRODUCTION XML quey languages typically allow the specification of stuctual pattens of elements. Finding the occuences of such pattens in an XML tee is the key opeation in XML quey pocessing. Many algoithms have been pesented fo this opeation. These algoithms focus mainly on the evaluation of path-patten o tee-patten queies. A estictive featue of these queies is that they detemine a total ode fo the elements in evey path of the quey. Fo instance, the path quey in XPath //yea//autho//title etieves title nodes fom a bibliogaphic XML document. In this quey, node yea can only be an ancesto of node autho, and node autho can only be an ancesto of node title. Howeve, the standad quey language fo XML, XQuey, and even its coe language, XPath, allow fo stuctual pattens that do not fom a complete path o tee. Conside, fo instance, the following XPath quey that involves evese axes: //title[descendant-o-self::*[ancesto-o-self::yea][ancesto-o-self::autho]]. This quey asks fo title nodes in paths that also involve yea and autho nodes, but no specific ode is equied fo those nodes in the path. In this sense, this is a path-patten quey whee the stuctue of a path is patially specified in the patten. Olteanu et al. [20, 19] show that XPath queies with evese axes like the one shown above can be equivalently ewitten as a set of tee-patten queies. Howeve, they also show that this tansfomation may lead to an exponential blowup of the numbe of tee-patten queies. Gottlob et al. [11] show that the combined complexity of XPath is P-had. Futhe, in pactice, thee is a need to quey XML data when the stuctue is not fully known to the use, o to quey XML data souces with diffeent stuctues in an integated way [7, 12, 16, 21]. In ode to deal with these poblems, quey languages ae adopted that elax the stuctue of a path in a tee patten. An exteme case ae keywodbased languages fo XML [9, 7, 12, 16]. Clealy, these quey languages need to be accompanied with efficient evaluation techniques. Patial path quey language. In this pape, we define a quey language that allows a patial specification of path pattens. Queies in this language do not equie a total

ode fo the nodes in the patten. The language is geneal enough to encompass on the one side path-patten queies and on the othe side queies without stuctual elationships (nodes lying on the same path without ode). This language can expess diffeent types of XPath expessions as, fo instance, those mentioned above. It is also the constituent component of patial tee-patten queies [21, 22]. The poblem. We addess the poblem of efficiently evaluating patial path-patten queies. A patial path-patten quey can be expessed equivalently by a set of path-patten queies. Thee ae seveal algoithms fo evaluating pathpatten queies. Fo instance, Buno et al. [2] povide an algoithm which is asymptotically optimal fo path-patten queies without epetitions of the same node label in the patten fo the steam data model. Howeve, as mentioned ealie, this equivalent set may contain a numbe of path pattens, which is exponential on the numbe of quey nodes. Clealy, this dawback does not suggest fo efficient evaluation techniques though the geneation of an equivalent set of path pattens. Theefoe, we focus on techniques that can diectly pocess and match the patial path patten to the XML data tee. Contibution. The main contibutions of this pape ae the following: Because the stuctue of a path may not be fully specified in a patial path-patten quey, new stuctual expessions can be deived fom those explicitly specified in the quey. These stuctual expessions ae impotant in quey pocessing. We define a sound and complete set of infeence ules to fully chaacteize stuctual elationship deivation. Unlike path-patten queies, patial path-patten queies can be unsatisfiable. We povide necessay and sufficient conditions fo detecting quey unsatisfiability. Detecting unsatisfiable queies pevents accessing the data, which can be vey lage, at a small ovehead. Patial path-patten queies can contain edundant nodes, i.e., nodes that can be emoved without affecting the meaning of the quey. We povide conditions fo efficiently identifying edundant nodes. We show how patial pathpatten queies can be equivalently put in a canonical fom, which is a diected acyclic gaphs (dag). We exploit the canonical fom of queies to design quey evaluation algoithms. We developed a stack-based algoithm PatialMJ fo the evaluation of patial path-patten queies. PatialMJ extacts a spanning tee fom the quey dag. It uses an extension of algoithm PathStack [2] to compute the esults of the oot-to-leaf paths of the spanning tee. These esults ae poduced in oot-to-leaf ode in the XML tee and ae mege-joined to compute the answe of the quey. PatialMJ may geneate intemediate esults fo the ootto-leaf paths of the spanning tee that cannot contibute to the answe of the quey. To ovecome the intemediate esult poblem, we developed a novel holistic stack-based algoithm PatialPath- Stack fo the evaluation of patial path-patten queies. PatialPathStack exploits a topological ode of the nodes in the quey dag, and matches the quey dag as a whole to the XML tee. We analyze the complexity of PatialPathStack, and we show that it is independent of intemediate esults. We implemented both algoithms, and we pefomed an extensive expeimental evaluation. The expeimental esults confim the dependence of PatialMJ on intemediate esults and the supeioity of PatialPathStack. Pape outline. The next section discusses elated wok. Section 3 pesents the XML data model and ou language fo patial path queies. Section 4 addesses quey pocessing issues. In Section 5, we pesent ou two algoithms. Section 6 shows the expeimental esults. We conclude and discuss futue wok in Section 7. 2. RELATED WORK Pevious papes focus on finding matches of binay stuctual elationships (a.k.a. stuctual joins). In [25], the authos pesented the Multi-Pedicate Mege Join algoithm (MPMGJN) fo finding such matches. Al-Khalifa et al. [1] intoduced a family of stack-based join algoithms, which ae moe efficient compaed to MPMGJN, as they do not equie multiple tavesals of the XML tee. Algoithms fo stuctual join ode optimization wee intoduced in [23]. Stuctual join techniques can be futhe impoved using vaious types of indexes [6, 14, 24]. One can exploit the above techniques to evaluate a pathpatten quey o a tee-patten quey. The task involves the following phases: decomposing the quey into binay stuctual elationships, then, finding thei matches, and, finally, stitching togethe these matches. This is inefficient due to the lage numbe of intemediate esults. To deal with this poblem, Buno et al. [2] pesented two stackbased join algoithms (PathStack and TwigStack) fo the evaluation of path-patten queies and tee-patten queies, espectively. PathStack is optimal fo path-patten queies, while TwigStack is optimal fo tee-patten queies without child elationships. Seveal eseaches have woked on extending TwigStack. Fo example, in [17], algoithm TwigStackList evaluates efficiently tee-patten queies in the pesence of child elationships. Also, in [4], algoithm Twig2Stack can evaluate genealized tee-patten queies including optional elationships. Chen et al. [3] poposed algoithms that handle queies ove dag-stuctued data. Evaluation methods of tee-patten queies with OR pedicates ae developed in [13]. In [15], the XR-tee index [14] is used to avoid pocessing input that does not paticipate in the answe of the quey. Finally, [10] intoduces algoithm TwigOptimal, which applies the notion of vitual cusos [24] to enhance the tavesal of the XML tee duing tee-patten quey evaluation. All the above tee-patten quey evaluation techniques assume that thee ae access mechanisms, i.e., indexes, that efficiently etun a steam of nodes in the XML tee that satisfy a given node pedicate. Nodes within steams ae usually epesented by thei positional epesentation [25] (see Section 3.1). Othe types of steaming, e.g., Tag+Level Steaming and Pefix-Path Steaming, ae suggested in [5]. In [18], instead of the egion encoding positional epesentation, the authos used an extended Dewey labelling scheme to facilitate quey evaluation. Patially specified tee-patten queies wee intoduced in [21, 22]. In these papes, patial tee-patten queies ae evaluated by geneating a set of complete tee-patten queies based on index gaphs (stuctual summaies of data). Hee, we focus on the evaluation of patial path-patten

queies. These queies ae dags in thei canonical fom. To the best of ou knowledge, no pevious holistic algoithms exist fo thei evaluation. 3. DATA MODEL AND QUERY LANGUAGE In this section, we discuss about the XML data model and the egion encoding positional epesentation technique, and we intoduce the patial path quey language. 3.1 XML Data An XML database is commonly modelled by a tee stuctue. Tee nodes epesent and ae labelled by elements, attibutes, o values. Tee edges epesent element-subelement, element-attibute, and element-value elationships. Without loss of geneality, we assume that the oot node of an XML tee epesents an element labelled by, and no othe node is labelled by. Such a oot node can always be added to a tee if not initially thee. Figu shows an XML tee. The tiplets next to the nodes encode thei position in the tee, and they ae explained below. (1,19,1) b (3,3,3) a (2,4,2) c (6,16,3) a (5,18,2) g (17,17,3) Definition 3.1. Let a i denote a vaiable anging ove nodes in an XML tee labelled by a, a. A stuctual elationship is an expession of the fom /a i, a i/a j, o a i/b j (child elationship), o of the fom //a i, a i//a j, o a i //b j (descendant elationship). A patial path quey is a nonempty set of stuctual elationships. Figue 2 shows fou patial path queies. q 1={/, //} q 2={//, //} q 3 ={ //, // } q 4 ={ /, /, /c 2, //a 3 } Figue 2: Patial path queies We can epesent a quey as a node-labelled gaph. The nodes of the gaph coespond to the vaiables of the quey. Thee is a single (esp. double) aow fom node a i to node b j iff the stuctual elationship a i/b j (esp. a i//b j) belongs to the quey. Figue 3 shows the gaph epesentation of the queies of Figue 2. Notice that a quey gaph can be disconnected, e.g. quey q 4 in Figue 3(d). In the following, we identify queies with thei gaph epesentation. c 2 a 3 a (7,9,4) a (10,12,4) a (13,15,4) (a) q 1 (b) q 2 (c) q 3 (d) q 4 d (8,8,5) e (11,11,5) Figu: XML tee f (14,14,5) Positional epesentation. XML quey pocessing algoithms equie an efficient technique fo epesenting the position of nodes in an XML tee. A commonly used technique is the so called egion encoding [8, 25, 1, 2, 15], whee tee nodes ae epesented by tiplets of the fom (begin, end, level). The begin and end values of a node can be detemined though a depth-fist tavesal of the XML tee, by sequentially assigning numbes to the fist and the last visit of the node. The level value epesents the level of the node in the XML tee. Fo simplicity, we assume that one XML tee is pocessed at a time. If the database compises multiple tees, a fouth field, teeid, can be used to denote its XML tee in the database. Region encoding simplifies checking stuctual elationships between two nodes: node n 1 is an ancesto of node n 2 iff n 1.begin < n 2.begin, and n 2.end < n 1.end. Node n 1 is the paent of node n 2 iff n 1.begin < n 2.begin, n 2.end < n 1.end, and n 1.level = n 2.level 1. 3.2 Patial Path Queies We now intoduce the syntax and semantics of patial path queies. Syntax. A patial path quey specifies a path patten whee the stuctue may not be fully defined. Figue 3: Gaph epesentation of queies The use can flexibly specify the stuctue of a path in a quey fully, patially, o not at all. Semantics. The answe of a patial path quey on an XML tee is a set of tuples. Each tuple consists of tee nodes that lie on the same path and peseve the child and descendant elationships of the quey. Moe fomally: Definition 3.2. An embedding of a patial path quey Q into an XML tee T is a mapping M fom the nodes of Q to nodes of T such that: (a) any node a i in Q is mapped by M to a node of T labelled by a, and node in Q is mapped by M to the oot of T ; (b) the nodes of Q ae mapped by M to nodes that lie on the same path in T ; (c) a i/b j (esp. a i//b j) in Q, M(b j) is a child (esp. descendant) of M(a i) in T. We call image of Q unde an embedding M, denoted M(Q), a tuple that compises all the images of the nodes of Q unde M. Definition 3.3. The answe of Q on T is the set of the images of Q unde all possible embeddings of Q to T. Conside, fo instance, quey q 3 in Figue 3(c). The answe of q 3 on the XML tee of Figu is: { :(5,18,2), :(6,16,3), :(11, 11,5), :(6,16,3), :(10,12,4), :(11, 11,5) }. Notice that a quey may include two distinct nodes a i and a j, e.g., and c 2 in quey q 4 in Figue 3(d). The images of two such nodes unde an embedding may coincide unless this is pevented fom the stuctual elationships of the quey.

Quey q 1 is the only patial path quey in Figue 3 which is also a mee path quey, since the stuctual elationships in the quey induce a total ode fo the quey nodes. Quey q 2 is syntactically simila to a tee-patten quey (twig). Howeve, the semantics is diffeent: when quey q 2 is a patial path quey, the images of the quey nodes and should lie on the same path on the XML tee. A patial path quey may contain moe than one souce node (i.e. a node without incoming edges). Since, by assumption, evey XML tee is ooted at a node labelled by, we can add a node (if not aleady thee) and double aows to any souce node of a quey without alteing its meaning. This way, evey quey can be epesented as a ooted diected gaph. Figue 4 shows the fou queies of Figue 3 afte this tansfomation. (IR1) (IR2) (IR3) (IR4) (IR5) (IR6) (IR7) (IR8) (IR9) (IR10) (IR11) (IR12) (IR13) //a i x/y x//y x//y, y//z x//z x/a i, x//b j, a i //b j a i/x, b j//x, b j//a i x/y, y/w, x//z, z//w x/z x/y, x//z, w/z, w//y x/z x/y, y/w, x/z z/w x//y, y//w, x/z z//w x/y, x/z, w/z w/y x//y, x/z, w//z w//y x/y, y/w, z/w x/z x//y, y//w, z/w x//z Figue 5: Infeence ules (a) q 1 (b) q 2 (c) q 3 c 2 (d) q 4 Figue 4: Queies with oot node 4. QUERY PROCESSING As the stuctue of the path is patially specified in patial path queies, new stuctual elationships may be infeed fom those explicitly specified in a quey. Futhe, unlike path queies, patial path queies may be unsatisfiable and have edundant nodes. Deived stuctual elationships ae necessay in detecting unsatisfiable queies and edundant nodes. In this section, we addess these issues and we show how a quey can be pocessed and put in a canonical fom, which is convenient fo evaluation. 4.1 Stuctual Relationship Infeence Conside quey q 4 in Figue 4. Since is a paent of c 2 and an ancesto of a 3, we can infe that c 2 is an ancesto of a 3 as well. Indeed, since c 2 is a child of, a 3 can not be placed between and c 2. Next we fomalize the infeence of stuctual elationships. Definition 4.1. A stuctual elationship p is deived fom a quey Q iff fo evey embedding M of Q to any XML tee, M satisfies p. The closue of Q is the set that compises all the stuctual elationships that can be deived fom Q. In ode to chaacteize the deivation of stuctual elationships and compute closues of queies, we intoduce a set of infeence ules shown in Figue 5. Let a i and b j be quey nodes, and x, y, z, and w be vaiables anging ove quey nodes. Recall that denotes the oot node of a quey. We use the symbol to denote that the elationships that pecede it infe the elationship that follows it. The absence of expessions that pecede denotes an axiom. The next theoem states that the infeence ules coectly and completely chaacteize the deivation of stuctual e- a 3 lationships. Let Q be a quey, and p be a stuctual elationship not in Q. A set of infeence ules is sound if wheneve p can be poduced fom Q using the infeence ules, p can also be deived fom Q. It is complete if wheneve p can be deived fom Q, p appeas in Q o can be poduced fom Q using the infeence ules. Theoem 4.1. The set of infeence ules of Figue 5 is sound and complete. Based on the closue of a quey, we define the full fom of a quey. Definition 4.2. A quey is in full fom if it is equal to its closue. Clealy, the numbe of stuctual elationships in the closue of a quey is, in the wost case, a squae polynomial in the numbe of its nodes. In pactice, only a small pecentage of these elationships appeas in the closue of the quey. Since usually a quey is much smalle than the data, the cost of computing its closue is insignificant. 4.2 Quey Satisfiability Detecting an unsatisfiable quey saves execution time at a small ovehead. It pevents accessing the data to get an empty answe. Definition 4.3. A patial path quey is called satisfiable iff it has a non-empty answe on some XML tee. Othewise, it is called unsatisfiable. In contast to path queies, patial path queies can be unsatisfiable. Conside, fo instance, the quey q 5 of Figue 6(a). Clealy, this quey is unsatisfiable since no XML tee path can satisfy all fou stuctual elationships in it. The following poposition povides necessay and sufficient conditions fo quey satisfiability. Theoem 4.2. A patial path quey is unsatisfiable iff its full fom compises a tivial cycle, i.e. two stuctual elationships of the fom a//b and b//a. Conside the queies q 5 and q 6 of Figue 6. These queies ae unsatisfiable. One can see that the full fom of both queies compises tivial cycles. Fo instance, they both compise the tivial cycle // and //. Checking quey satisfiability amounts to checking the full fom of the quey fo tivial cycles. This is in the wost case a squae polynomial in the numbe of the quey nodes. Given that the size of a quey is not expected to be compaable to the size of the XML database, the cost of checking quey satisfiability is insignificant.

b 2 Queies q 1, q 2, and q 3 of Figue 3 ae aleady in canonical fom. Figue 8(a) epeats quey q 4 of Figue 3 and Figue 8(b) shows its canonical fom. (a) q 5 (b) q 6 c 2 Figue 6: Two unsatisfiable queies 4.3 Redundant Nodes in Queies Some nodes in a quey can be emoved without affecting the meaning of the quey. We call these nodes edundant: Definition 4.4. A node in a patial path quey is edundant iff in any tuple of any answe of the quey it has the same value as anothe (not necessaily the same) node of the quey. Redundant nodes can be detected based on the following theoem: Theoem 4.3. A node w in a patial path quey is edundant iff the full fom of the quey compises one of the following sets of stuctual elationships: (a) x/w and x/y, whee x and y ae quey nodes and w and y have the same label. (b) w/x and y/x, whee x and y ae quey nodes and w and y have the same label. (c) x/y 1, y 1 /y 2,..., y k /z, x//w, w//z, k 1, whee x, y 1,... y k, z, and w ae quey nodes and the label of w is the same as the label of one of y 1,..., y k. Figues 7(a), 7(b), and 7(c) gaphically display the thee conditions of Theoem 4.3. y x (a) w y x (b) w Figue 7: Quey pattens with edundant node w Clealy, identifying edundant nodes in a quey can be pefomed efficiently. 4.4 Canonical Fom of Queies Fo quey evaluation puposes, it is convenient to intoduce a nomal fom fo queies called canonical fom. Definition 4.5. A patial path quey Q is in canonical fom iff its set P of stuctual elationships contains exactly all the stuctual elationships of the closue of P except those that can be infeed by P using infeence ules IR2 and IR3. Since a satisfiable quey does not compise cycles in its full fom, it has a unique canonical fom. This canonical fom can be epesented as a ooted diected acyclic gaph. y 1 y k (c) x z w c 2 (a) q 4 a 3 a 3 (b) q 4 Figue 8: Quey q 4 is the canonical fom of q 4 Computing a canonical fom fo a quey can be done efficiently by emoving fom its full fom edges in any ode that can be infeed fom othe edges using IR2 o IR3, until no moe edges can be emoved. In the following, we assume that queies ae satisfiable, in canonical fom, without edundant nodes. A notable featue of this epesentation is that thee is a topological odeing of the nodes of a quey that satisfies its stuctual elationships (both child and descendant). We exploit this featue in the next section in designing the PatialPathStack algoithm. 5. PARTIAL PATH QUERY EVALUATION ALGORITHMS In this section, we pesent two stack-based algoithms fo the evaluation of patial path queies: PatialMJ and PatialPathStack. 5.1 Peliminaies Let q be a patial path quey in canonical fom and n be a node in q. Function nodes(q) etuns all nodes of q. Function isroot(n) etuns tue if n does not have incoming edges in q, and false othewise. Function issink(n) etuns tue if n does not have outgoing edges in q, and false othewise. Function paents(n) etuns all nodes in q with outgoing edges to n. Each quey node n labelled by l is associated with a steam T n of all nodes (positional epesentation) labelled by l in the XML tee. To sequentially access the nodes in T n, we maintain a cuso C n, initially pointing to the fist node in T n. Fo simplicity, C n may altenatively efe to the node pointed by pointe C n in T n. Opeation advance(c n ) moves C n to the next node in T n. Function eos(c n ) etuns tue if C n has eached the end of T n. C n.begin denotes the begin field in the positional epesentation of node C n (see Section 3.1). A stack S n is associated with each quey node n. In the case of algoithm PatialMJ, each enty of S n is a pai of a node fom steam T n and a pointe to an enty in the stack of a paent of n in the quey. In the case of algoithm PatialPathStack, each enty of S n is a pai of a node fom steam T n and a set of pointes to enties in the stacks of all the paents of n in the quey. Function empty(s n) etuns tue if stack S n is empty, and false othewise. Opeation push(s n,enty) pushes enty on top of stack S n. Opeation pop(s n ) pops the top enty fom stack S n. Functions bottom(s n ) and top(s n ) etun

the position of the bottom and top enty in stack S n, espectively. At evey point duing the execution of the algoithms (a) each node in a stack enty is a descendant in the XML tee of all nodes in the enties below it, and (b) all nodes in a stack lie on the same oot-to-leaf path in the XML tee. 5.2 Algoithm PatialMJ Given a patial path quey, algoithm PatialMJ extacts a spanning tee of the quey gaph. Then, it finds matches fo all oot-to-leaf paths of the spanning tee in the XML tee by using an extension of the path matching algoithm PathStack [2]. The esults fo each path of the spanning tee ae tuples poduced in a soted oot-to-leaf ode in the XML tee. These tuples ae mege-joined by guaanteing that (a) they lie on the same path in the XML tee, and (b) they satisfy the stuctual elationships that appea in the quey gaph and not in the spanning tee. Figue 9(b) shows the gaph of a quey q and Figu1(a) shows a spanning tee q s of q. Edge c//d of q is missing fom q s. Any two esults fom the two oot-to-leaf paths of q s that ae on the same path of the XML tee can be meged to poduce a esult fo q if they satisfy the identity conditions on and a and the stuctual condition c//d (see Figu1(b)). q: patial path quey q s : a spanning tee of q E: the set of edges in q which do not appea in q s Algoithm PatialMJ 01 while ( end) 02 n = getnextqueynode 03 cleanstacks(c n) 04 if (isroot(n) o m paents(n): empty(s m )) 05 movetostack(n) 06 if (isleaf(n)) 07 showresultswithblocking(s n, top(s n)) 08 advance(c n) 09 joinpathsolutions Function end etun n nodes(q): issink(n) eos(c n) Function getnextqueynode etun n nodes(q) such that C n.begin is minimal Pocedue cleanstacks(c n) 01 fo m in nodes(q) 02 pop all enties in S m whose nodes ae not ancestos of C n in the XML tee Pocedue movetostack(n) 01 pt = pointe to top of S m, whee m is the paent of n in q s 02 push(s n, (C n, pt)) (a) XML Tee a b c d e (b) Quey q Pocedue joinpathresults 01 ode the oot-to-leaf paths of q s in descending ode of the level of thei lowest banching node 02 mege-join the esults of the oot-to-leaf paths of q s that ae on the same path of the XML tee and satisfy the stuctual elationships in E Figu0: Algoithm PatialMJ Figue 9: Example of a tee path and a quey Algoithm PatialMJ is shown in Figu0. In this algoithm, each enty of a stack S n is a pai of (a) a node fom steam T n and (b) a pointe pt to the enty of its lowest ancesto in the XML tee appeaing in S m, whee m is the paent of n in the spanning tee of the quey. Function isleaf(n) etuns tue if n is a leaf node in q s, and false othewise. In lines 01-08, the algoithm scans the steams, and finds matches fo the oot-to-leaf paths in the spanning tee of the quey. Line 02 detemines the next quey node n to be pocessed. Line 03 pops out of the stacks all nodes that do not lie on the same oot-to-leaf path in the XML tee as the steam node C n cuently pocessed. Steam node C n is pushed on stack S n only if the stacks of the paents of node n in the quey ae not empty (lines 04-05). This way, we avoid stacking and pocessing steam nodes which do not contibute esults to the answe. When we push a node C n on stack S n, we also add a pointe to the top enty in stack S m, whee m is the paent of n in the spanning tee of the quey. Line 06 checks if node n is a leaf in the spanning tee. If this is the case, line 07 calls pocedue showresultswithblocking to poduce the esults fo the path of the spanning tee ending to node n. These esults must be soted in a oot-to-leaf ode in the XML tee so that they can be easily mege-joined to compute esults fo the quey. Fo this eason, pocedue showresultswithblocking uses a blocking technique to poduce esults fo a path, simila to pocedue showsolutionswithblocking [2]. The esults fo all paths of the spanning tee ae mege-joined in line 09. This join involves checking that (a) the esults ae on the same path in the XML tee, (b) matchings fo the common nodes of the paths in the quey ae identical, and (c) stuctual elationships in the quey that do not appea in the spanning tee ae satisfied. All these conditions can be checked in a staightfowad way using the positional epesentation fo the nodes in the XML tee. Figu1(c) shows the state of the stacks afte the evaluation of quey q of Figue 9(b) on the single-path XML tee of Figue 9(a). Figu1(a) shows the spanning tee q s of q used in the evaluation of q. Since the stuctual elationship c//d of q does not appea in q s, thee ae no pointes fom stack S d to stack S c. The esults fo the left oot-to-leaf path of q s ae {, }, and those fo the ight ootto-leaf path of q s ae {, }. One can see that fom the fou possible pais of esults of the two paths only

two can be mege-joined, and ae shown in Figu1(d). b a d c e (a) Spanning tee q s S b S a S d S S c (c) Stacks b a d identity identity a pecedence c e (b) Join conditions on the spanning tee q s S e (d) Results Figu1: PatialMJ example Child elationships. The algoithm pesented in Figu0 is designed fo the evaluation of queies that do not include child elationships. In the pesence of child elationships, two changes need to be done. Fist, wheneve a node C b fom steam T b is pocessed, and a/b is a child elationship in the quey, C b is pushed on stack S b only if its paent node in the XML tee appeas in (the top position of) stack S a. Second, in the computation of the esults of a oot-toleaf path of the spanning tee of the quey (using pocedue showresults), a node in stack S b appeas only in esults fo this path that also include fom S a its paent in the XML tee. Analysis of PatialMJ. Algoithm PatialMJ fills the stacks in a single pass of the input steams. Futhe, it uses pocedue showresults to poduce esults fo evey oot-toleaf path in the spanning tee of the quey. This pocedue is shown to be asymptotically optimal fo the evaluation of path queies [2]. Howeve, thee may be combinations of esults fom the oot-to-leaf paths in the spanning tee that cannot be meged to fom a esult of the quey. We call these combinations intemediate esults. Because of the intemediate esults, the algoithm is not asymptotically optimal. Clealy, if the quey is a path-patten quey, algoithm PatialMJ is asymptotically optimal. 5.3 Algoithm PatialPathStack To ovecome the poblem of intemediate esults of PatialMJ, we developed a novel holistic stack-based algoithm fo the evaluation of patial path queies. In contast to PatialMJ, PatialPathStack does not decompose a quey into paths, but ties to match the quey gaph to an XML tee as a whole. The key featue of algoithm PatialPathStack is that it employs a topological ode of the quey nodes, i.e., a linea odeing of nodes which espects the patial ode induced by the stuctual elationships of the quey. Algoithm PatialPathStack is shown in Figu2. Algoithm PatialPathStack manages steams and stacks as PatialMJ. The only diffeence is that, in the case of PatialPathStack, each enty in a stack S n is a pai of a node fom steam T n and a set of pointes to enties in the stacks of all the paents of n in the quey. Wheneve a steam node C n of a quey sink node n is pushed on a stack, the algoithm checks whethe esults can be geneated. Output is poduced in a evese topological ode, so that the stack of a quey node is pocessed afte the stacks of its childen nodes in the quey have been pocessed. To avoid edundantly epoducing esults, the algoithm outputs at this point only esults that include the steam node C n. Pocedue outputresults combines nodes fom all stacks to poduce the quey esults. No othe node fom the stack S n of node n is used at this point to poduce new esults fo the quey (lines 07-08). In contast, all nodes fom the othe sink node stacks can be used to fom esults (lines 09-11). All the nodes can be used fom non-sink node stacks if they (o nodes highe in the stack) ae pointed by q: patial path quey with N nodes Algoithm PatialPathStack 01 extact a topological od..n of the quey nodes 02 while ( end) 03 n = getnextqueynode 04 cleanstacks(c n) 05 if (isroot(n) o m paents(n): empty(s m )) 06 movetostack(n) 07 if (issink(n) and m nodes(q): issink(m) empty(s m)) 08 if (n == N) 09 outputresults(n, N, top(s N )) 10 else 11 fo i = bottom(s N ) to top(s N ) 12 outputresults(n, N, i) 13 advance(c n) Pocedue movetostack(n) 01 pts = pointes to top of all paents(n) in q 02 push(s n, (C n, pts)) Pocedue outputresults(n, m, stackp os) 01 solution[m] = stackp os 02 if (m = 1) //node m is the oot of the quey 03 output (S 1[solution[1]],...,S N [solution[n]]) 04 else 05 if (m 1 == n) 06 outputresults(n, n, top(s n )) 07 else if (issink(m 1)) 08 fo i = bottom(s m 1) to top(s m 1) 09 outputresults(n, m 1, i) 10 else 11 fo c in childen(m 1) 12 ptfom[c] = S c[solution[c]].pts[m 1] 13 maxpos = minag c childen(m 1) {ptfom[c]} 14 fo i = bottom(s m 1 ) to maxpos 15 outputresults(n, m 1, i) Figu2: Algoithm PatialPathStack

nodes in the stacks of all the childen of n in the quey (lines 12-15). Figu3(b) shows the state of the stacks afte the evaluation of quey q of Figue 9(b) on the single-path XML tee of Figue 9(a). When node is pushed on stack S d, new esults can be poduced that include, which ae poduced accoding to the topological ode shown in Figu3(a). The esults of the quey ae shown in Figu3(c). To pocess queies with child elationships, we need to modify the stacking of the nodes and the output of solutions as we did with PatialMJ. a b c d e (a) Topological ode S b S a S d S S c (b) Stacks S e Figu3: PatialPathStack example (c) Results Analysis of PatialPathStack. Hee, we show the coectness and completeness of PatialPathStack and discuss on its complexity. Poofs ae omitted due to lack of space. They will be included in the full vesion of the pape. Theoem 5.1. Given a patial path quey q and an XML tee T, algoithm PatialPathStack coectly etuns all the esults of q on T. Let input denote the sum of sizes of the input steams, output denote the size of the esults of q on T, indegee denote the maximum numbe of incoming edges to a quey node, outdegee denote the maximum numbe of outgoing edges fom a quey node, and maxpath denote the maximum length of a oot-to-leaf path in T. Theoem 5.2. Algoithm PatialPathStack has wostcase I/O and CPU time complexities O(indegee * input + outdegee * output). The wost-case space complexity of PatialPathStack is O(indegee * min(input, maxpath)). Based on the pevious theoem, PatialPathStack is a- symptotically optimal if the indegee and outdegee of the quey ae bound by a constant. Clealy, fo the case of queies whose gaph is a tee, only the outdegee needs to be bound by a constant. In any case, PatialPathStack does not geneate any intemediate esults. 6. EXPERIMENTAL EVALUATION We an a compehensive set of expeiments to measue the pefomance of PatialMJ and PatialPathStack. In this section, we epot on thei expeimental evaluation. Setup. We evaluated the pefomance of the algoithms on both benchmak and synthetic data. Fo benchmak data, we used the Teebank XML document 1. Teebank s XML tee consists of aound 2.5 million nodes having 250 distinct 1 http://www.cis.upenn.edu/ teebank element tags and its maximum depth is 36. It also has deep ecusive data. Synthetic data is andom XML tees. We geneated such tees using IBM s AlphaWoks XML geneato 2. In all the expeiments, the paamete MaxRepeats (that detemines the maximum numbe a node appeas as a child of its paent node) was set to 4, and the paamete numlevels (that detemines the maximum numbe of tee levels) was set to 14. The numbe of distinct element tags used in all tees was fixed to 11. Fo each measuement on synthetic data, 10 diffeent XML tees of the same numbe of nodes wee used. Each displayed value in the plots is the aveage ove thes0 measuements. Figu4 shows the types of queies used in ou expeiments. Queies Q 1 to Q 4 include only descendant elationships, while queies Q 5 to Q 8 include child elationships as well. The labels of the quey nodes, howeve, ae appopiately modified so that the queies can always poduce esults in the diffeent XML tees used in the expeiments. Ou quey set compises a full spectum of patial path queies, fom path-patten queies to non-tee gaph queies. We implemented all algoithms in C++, and an ou expeiments on a dedicated Linux PC (AMD Sempon 2600+) with 2GB of RAM. f 1 (a) Q 1 (Q 5 ) f 1 (b) Q 2 (Q 6 ) f 1 (c) Q 3 (Q 7 ) f 1 (d) Q 4 (Q 8 ) Figu4: Types of patial path queies used in the expeiments. Execution time on fixed datasets. We measued the execution time of PatialMJ and PatialPathStack fo evaluating all queies in Figu5 on both Teebank and synthetic data. Fo queies Q 1 and Q 5, which ae path-patten queies, we also measued the execution time of algoithm PathStack [2]. The synthetic XML tees used in this expeiment consist of 2.5 million nodes. Figues 15(a) an5(b) pesent the evaluation esults. Figu5(c) shows the numbe of esults obtained pe quey. PatialPathStack is moe efficient than PatialMJ. Regading queies Q 1 and Q 5, PatialPathStack pefoms as fast as PathStack. This is expected, since PatialPathStack educes to PathStack in case of path-patten queies. Execution time vaying the input size. We measued the execution time of PatialMJ and PatialPathStack fo 2 www.alphawoks.ibm.com/tech/xmlgeneato

evaluating queies Q 2, Q 3 and Q 7 of Figu4 ove synthetic XML tees of vaious sizes. Figues 16, 17 an8 pesent the esults obtained fo XML tees whose node steam sizes vay fom 1 to 3 million nodes. Clealy, in evey case, PatialPathStack is moe efficient than PatialMJ. In the expeimental evaluation of quey Q 2, an incease in the input size esults in an incease in the output size (Figue 16(b)). When the input and the output size goes up, the execution time of PatialMJ and PatialPathStack inceases (Figu6(a)). This confims the complexity esults that show dependency of the execution time on the input and output size. Howeve, the incease in the execution time of PatialMJ is slightly shape than that of PatialPath- Stack. The eason is that PatialMJ is also affected by the incease in the numbe of the intemediate esults shown in Figu6(c). In contast, PatialPathStack is independent of the size of the intemediate esults. In the expeimental evaluation of quey Q 3, the output size (Figu7(b)) is compaable to the output size of quey Q 2 (Figu6(b)). The execution time of PatialPath- Stack fo the evaluation of Q 3 (Figu7(a)) is compaable to the execution time fo the evaluation of Q 2 (Figu6(a)). This again confims the wost-case complexity esults. The numbe of intemediate esults in Q 3 (Figu7(c)) is lage than the numbe of intemediate esults in Q 2 (Figu6(c)) fo all input sizes used in the expeiments. This incease is eflected in the execution time of PatialMJ which inceases shape than PatialPathStack. Quey Q 7 used in the expeiment shown in Figu8 is moe estictive than quey Q 3 since it involves two child elationships not pesent in Q 3. Clealy, the numbe of intemediate esults (Figu8(c)) and the output size (Figue 18(b)) fom the evaluation of Q 7 is less than those of Q 3, and the same holds fo the execution time of both algoithms (Figu8(a)). The eduction is moe intense fo PatialMJ due to the stong decease in the numbe of intemediate esults. In all cases, PatialPathStack lagely outpefoms PatialMJ. 7. CONCLUSION We defined a patial path-patten quey language which epesents a class of XPath expessions, and is useful fo queying multiple XML data souces with unknown o diffeent stuctues. We studied the poblem of efficiently evaluating patial path-patten queies. In ode to pocess patial path-patten queies, we intoduced a set of sound and complete infeence ules to chaacteize stuctual elationship deivation, we povided necessay and sufficient conditions fo detecting quey unsatisfiability and node edundancy, and we showed patial path-patten queies can be equivalently put in a canonical dag fom. We developed two stack-based algoithms fo the evaluation of patial pathpatten queies, PatialMJ and PatialPathStack. PatialMJ evaluates a quey dag by decomposing it while PatialPathStack is a holistic one. An analysis and expeimental evaluation showed that PatialPathStack is independent of intemediate esults and lagely outpefoms PatialMJ. We plan to extend out wok, studying patial tee-patten queies and developing techniques fo thei evaluation. 8. REFERENCES [1] S. Al-Khalifa, H. Jagadish, N. Koudas, J. M. Patel, D. Sivastava, and Y. Wu. Stuctual joins: A pimitive fo efficient XML quey patten matching. In Poc. of ICDE, 2002. [2] N. Buno, N. Koudas, and D. Sivastava. Holistic twig joins: optimal XML patten matching. In Poc. of ACM SIGMOD, 2002. [3] L. Chen, A. Gupta, and M. E. Kuul. Stack-based algoithms fo patten matching on dags. In Poc. of VLDB, 2005. [4] S. Chen, H.-G. Li, J. Tatemua, W.-P. Hsiung, D. Agawal, and K. S. Candan. Twig2Stack: bottom-up pocessing of genealized-tee-patten queies ove XML documents. In Poc. of VLDB, 2006. [5] T. Chen, J. Lu, and T. W. Ling. On boosting holism in XML twig patten matching using stuctual indexing techniques. In Poc. of SIGMOD, 2005. [6] S.-Y. Chien, Z. Vagena, D. Zhang, V. J. Tsotas, and C. Zaniolo. Efficient stuctual joins on indexed XML documents. In Poc. of VLDB, 2002. [7] S. Cohen, J. Mamou, Y. Kanza, and Y. Sagiv. XSeach: A semantic seach engine fo XML. In Poc. of VLDB, 2003. [8] M. P. Consens and T. Milo. Algebas fo queying text egions. In Poc. of ACM PODS, 1995. [9] D. Floescu, D. Kossmann, and I. Manolescu. Integating keywod seach into xml quey pocessing. Compute Netwoks, 2000. [10] M. Fontoua, V. Josifovski, E. Shekita, and B. Yang. Optimizing cuso movement in holistic twig joins. In Poc. of CIKM, 2005. [11] G. Gottlob, C. Koch, and R. Pichle. The complexity of xpath quey evaluation. In Poc. of PODS, 2003. [12] V. Histidis, Y. Papakonstantinou, and A. Balmin. Keywod poximity seach on XML gaphs. In Poc. of ICDE, 2003. [13] H. Jiang, H. Lu, and W. Wang. Efficient pocessing of xml twig queies with o-pedicates. In Poc. of ACM SIGMOD, 2004. [14] H. Jiang, H. Lu, W. Wang, and B. C. Ooi. X-tee: Indexing xml data fo efficient stuctual joins. In Poc. of ICDE, 2003. [15] H. Jiang, W. Wang, H. Lu, and J. X. Yu. Holistic twig joins on indexed XML documents. In Poc. of VLDB, 2003. [16] Y. Li, C. Yu, and H. V. Jagadish. Schema-Fee XQuey. In Poc. of VLDB, 2004. [17] J. Lu, T. Chen, and T. W. Ling. Efficient pocessing of XML twig pattens with paent child edges: A look-ahead appoach. In Poc. of CIKM, 2004. [18] J. Lu, T. W. Ling, C.-Y. Chan, and T. Chen. Fom egion encoding to extended dewey: On efficient pocessing of XML twig patten matching. In Poc. of VLDB, 2005. [19] D. Olteanu. Fowad node-selecting queies ove tees. ACM Tans. Database Syst., 2007. [20] D. Olteanu, H. Meuss, T. Fuche, and F. By. Xpath: Looking fowad. In Poc. of the XMLDM, MDDE, and YRWS Wokshops, 2002. [21] D. Theodoatos, T. Dalamagas, A. Koufopoulos, and N. Gehani. Semantic queying of tee-stuctued data souces using patially specified tee pattens. In Poc. of ACM CIKM, 2005.

Time (sec) 250 200 150 100 50 0 PathStack PatialMJ PatialPathStack Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Quey (a) Execution time (Teebank) Time (sec) 250 200 150 100 50 0 PathStack PatialMJ PatialPathStack Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Quey (b) Execution time (Synthetic data) Quey Teebank Synth. data Q 1 86360 251112 Q 2 103313 145591 Q 3 8 149006 Q 4 2883598 1178937 Q 5 13714 130242 Q 6 12658 73065 Q 7 2 66179 Q 8 44233 443691 (c) Num. of esults pe quey Figu5: PatialMJ vs PatialPathStack fo fixed data sets. Time (sec) 300 200 100 PatialMJ PatialPathStack Output size (numbe of nodes) 1.5M 1M 0.5M Numbe of intemediate esults 600M 400M 200M 0 Input size (numbe of nodes) (a) Execution time Input size (numbe of nodes) (b) Numbe of esults Input size (numbe of nodes) (c) Numbe of intemediate esults Figu6: PatialMJ vs PatialPathStack fo Q 2, vaying the size of the XML tee. Time (sec) 300 200 100 PatialMJ PatialPathStack Output size (numbe of nodes) 1.5M 1M 0.5M Numbe of intemediate esults 600M 400M 200M 0 Input size (numbe of nodes) (a) Execution time Input size (numbe of nodes) (b) Numbe of esults Input size (numbe of nodes) (c) Numbe of intemediate esults Figu7: PatialMJ vs PatialPathStack fo Q 3, vaying the size of the XML tee. Time (sec) 300 200 100 PatialMJ PatialPathStack Output size (numbe of nodes) 1.5M 1M 0.5M Numbe of intemediate esults 600M 400M 200M 0 Input size (numbe of nodes) (a) Execution time Input size (numbe of nodes) (b) Numbe of esults Input size (numbe of nodes) (c) Numbe of intemediate esults Figu8: PatialMJ vs PatialPathStack fo Q 7, vaying the size of the XML tee. [22] D. Theodoatos, S. Souldatos, T. Dalamagas, P. Placek, and T. Sellis. Heuistic containment check of patial tee-patten queies in the pesence of index gaphs. In Poc. of ACM CIKM, 2006. [23] Y. Wu, J. M. Patel, and H. V. Jagadish. Stuctual join ode selection fo XML quey optimization. In Poc. of ICDE, 2003. [24] B. Yang, M. Fontoua, E. Shekita, S. Rajagopalan, and K. Beye. Vitual cusos fo XML joins. In Poc. of ICDE, 2004. [25] C. Zhang, J. F. Naughton, D. J. DeWitt, Q. Luo, and G. M. Lohman. On suppoting containment queies in elational database management systems. In Poc. of ACM SIGMOD, 2001.