Efficient and Scalable Sequence-Based XML Filtering

Similar documents
CMSC 430 Introduction to Compilers. Spring Register Allocation

Online Appendix to: Generalizing Database Forensics

Positions, Iterators, Tree Structures and Tree Traversals. Readings - Chapter 7.3, 7.4 and 8

Efficient filtering of XML documents with XPath expressions

Skyline Community Search in Multi-valued Networks

Efficient Filtering of XML Documents with XPath Expressions

SFilter: A Simple and Scalable Filter for XML Streams

Disjoint Multipath Routing in Dual Homing Networks using Colored Trees

An ECA-based Control-rule formalism for the BPEL Process Modularization *

An Introduction of BOM Modeling Framework

The Reconstruction of Graphs. Dhananjay P. Mehendale Sir Parashurambhau College, Tilak Road, Pune , India. Abstract

6.854J / J Advanced Algorithms Fall 2008

Generalized Edge Coloring for Channel Assignment in Wireless Networks

PathStack : A Holistic Path Join Algorithm for Path Query with Not-predicates on XML Data

Efficient Filtering of XML Documents with XPath Expressions

Generalized Edge Coloring for Channel Assignment in Wireless Networks

XML Filtering Technologies

Scalable Filtering and Matching of XML Documents in Publish/Subscribe Systems for Mobile Environment

Spanheight, A Natural Extension of Bandwidth and Treedepth

Adjacency Matrix Based Full-Text Indexing Models

Uninformed search methods

d 3 d 4 d d d d d d d d d d d 1 d d d d d d

Random Clustering for Multiple Sampling Units to Speed Up Run-time Sample Generation

Divide-and-Conquer Algorithms

BIJECTIONS FOR PLANAR MAPS WITH BOUNDARIES

Frequent Pattern Mining. Frequent Item Set Mining. Overview. Frequent Item Set Mining: Motivation. Frequent Pattern Mining comprises

Non-homogeneous Generalization in Privacy Preserving Data Publishing

TwigINLAB: A Decomposition-Matching-Merging Approach To Improving XML Query Processing

Coupling the User Interfaces of a Multiuser Program

CS 106 Winter 2016 Craig S. Kaplan. Module 01 Processing Recap. Topics

Variable Independence and Resolution Paths for Quantified Boolean Formulas

Compiler Optimisation

Overlap Interval Partition Join

Computer Organization

On the Energy Efficiency of Content Delivery Architectures

Answering XML Twig Queries with Automata

Learning convex bodies is hard

An Algorithm for Building an Enterprise Network Topology Using Widespread Data Sources

Recitation Caches and Blocking. 4 March 2019

A Classification of 3R Orthogonal Manipulators by the Topology of their Workspace

A Convex Clustering-based Regularizer for Image Segmentation

Indexing the Edges A simple and yet efficient approach to high-dimensional indexing

On the Role of Multiply Sectioned Bayesian Networks to Cooperative Multiagent Systems

Lab work #8. Congestion control

Additional Divide and Conquer Algorithms. Skipping from chapter 4: Quicksort Binary Search Binary Tree Traversal Matrix Multiplication

6.823 Computer System Architecture. Problem Set #3 Spring 2002

High-Performance Holistic XML Twig Filtering Using GPUs. Ildar Absalyamov, Roger Moussalli, Walid Najjar and Vassilis Tsotras

A Keyword-Based Filtering Technique of Document-Centric XML using NFA Representation

Aggregate Query Processing of Streaming XML Data

Distributed Line Graphs: A Universal Technique for Designing DHTs Based on Arbitrary Regular Graphs

Politehnica University of Timisoara Mobile Computing, Sensors Network and Embedded Systems Laboratory. Testing Techniques

CS269I: Incentives in Computer Science Lecture #8: Incentives in BGP Routing

The diamonds Package

Optimal Oblivious Path Selection on the Mesh

Cluster Center Initialization Method for K-means Algorithm Over Data Sets with Two Clusters

Politecnico di Torino. Porto Institutional Repository

MORA: a Movement-Based Routing Algorithm for Vehicle Ad Hoc Networks

Particle Swarm Optimization Based on Smoothing Approach for Solving a Class of Bi-Level Multiobjective Programming Problem

Data & Knowledge Engineering

Proving Vizing s Theorem with Rodin

Avoiding Unnecessary Ordering Operations in XPath

Accelerating XML Structural Matching Using Suffix Bitmaps

An Efficient XML Index Structure with Bottom-Up Query Processing

Mining Sequential Patterns with Periodic Wildcard Gaps

1 Disjoint-set data structure.

TREE SPECIES CLASSIFICATION USING RADIOMETRY, TEXTURE AND SHAPE BASED FEATURES

Evaluating XPath Queries

Politecnico di Torino. Porto Institutional Repository

Kinematic Analysis of a Family of 3R Manipulators

Research Article Inviscid Uniform Shear Flow past a Smooth Concave Body

Comparison of Methods for Increasing the Performance of a DUA Computation

Throughput Characterization of Node-based Scheduling in Multihop Wireless Networks: A Novel Application of the Gallai-Edmonds Structure Theorem

Published by: PIONEER RESEARCH & DEVELOPMENT GROUP ( ) 1

Pairwise alignment using shortest path algorithms, Gunnar Klau, November 29, 2005, 11:

Using Vector and Raster-Based Techniques in Categorical Map Generalization

Solution Representation for Job Shop Scheduling Problems in Ant Colony Optimisation

Authenticated indexing for outsourced spatial databases

filtering LETTER An Improved Neighbor Selection Algorithm in Collaborative Taek-Hun KIM a), Student Member and Sung-Bong YANG b), Nonmember

Improving Spatial Reuse of IEEE Based Ad Hoc Networks

IJSER 1 INTRODUCTION. Sathiya G. In this paper, proposed a novel, energy and latency. efficient wireless XML streaming scheme supporting twig

Accelerating XML Query Matching through Custom Stack Generation on FPGAs

Fast Fractal Image Compression using PSO Based Optimization Techniques

PERFECT ONE-ERROR-CORRECTING CODES ON ITERATED COMPLETE GRAPHS: ENCODING AND DECODING FOR THE SF LABELING

UC Santa Cruz UC Santa Cruz Previously Published Works

Interior Permanent Magnet Synchronous Motor (IPMSM) Adaptive Genetic Parameter Estimation

APPLYING GENETIC ALGORITHM IN QUERY IMPROVEMENT PROBLEM. Abdelmgeid A. Aly

Massively Parallel XML Twig Filtering Using Dynamic Programming on FPGAs

NEW METHOD FOR FINDING A REFERENCE POINT IN FINGERPRINT IMAGES WITH THE USE OF THE IPAN99 ALGORITHM 1. INTRODUCTION 2.

CONSTRUCTION AND ANALYSIS OF INVERSIONS IN S 2 AND H 2. Arunima Ray. Final Paper, MATH 399. Spring 2008 ABSTRACT

PART 2. Organization Of An Operating System

CS350 - Exam 4 (100 Points)

Robust Camera Calibration for an Autonomous Underwater Vehicle

arxiv: v2 [math.co] 5 Jun 2018

Image Segmentation using K-means clustering and Thresholding

Queueing Model and Optimization of Packet Dropping in Real-Time Wireless Sensor Networks

Loop Scheduling and Partitions for Hiding Memory Latencies

A General Technique for Non-blocking Trees

Shift-map Image Registration

Tracking and Regulation Control of a Mobile Robot System With Kinematic Disturbances: A Variable Structure-Like Approach

Scalable Processing of Read-Only Transactions in Broadcast Push

Transcription:

Efficient an Scalale Sequence-Base XML Filtering Mariam Salloum University of California, Riversie, CA, USA msalloum@cs.ucr.eu ABSTRACT The uiquitous aoption of XML as the stanar of ata exchange over the we has le to increase interest in uiling efficient an scalale XML pulish-suscrie (pu-su) systems. The central function of an XML-ase pu-su system is to perform XML filtering efficiently, i.e. ientify those XPath expressions that have a match in a streaming XML ocument. In this paper, we propose a new sequence-ase approach, which transforms oth XML ocuments an XPath twig expressions into Noe Encoe Tree Sequences (NETS). In terms of this encoing, we provie a necessary an sufficient conition for an XPath twig to represent a match in a given XML ocument. The propose filtering proceure is ase on a new susequence matching algorithm evise for NETS, which ientifies the set of matche queries free of false positives with a single scan of the XML ocument. Extensive experimental results show that the NETS metho outperforms previous XML filtering approaches.. INTRODUCTION With the increase aoption of XML as the e facto stanar for pulishing an exchanging of information over the we, XML-ase pu-su systems have emerge. In a typical XMLase pu-su system, users (suscriers) express their interests (profiles) using XML query languages (such as XPath []), while pulishers istriute their messages encoe as XML ocuments. Each pulishe message is matche against user queries so that messages are only elivere to intereste users. Several approaches have een propose to solve the XML filtering prolem; the main two categories are (i) FSMase an (ii) Sequence-ase approaches. Several Finite State Automata approaches have een propose. An early work, XFilter [], consiere simple path profiles an propose uiling an FSM for each istinct profile. The FSM states are then traverse, while XML tag events are generate y the parsing of the streaming ocument. YFilter [] [] was a successor of XFilter an propose uiling a NFA representation that comines all user profiles into a single machine. This approach yiels etter results since it exploits the commonality among path expressions. In orer to implement twig filtering, FSM-ase approaches typically reak twig queries into their simple linear paths. This approach however, requires an expensive post-processing phase to join the results. Copyright is hel y the author/owner. Twelfth International Workshop on the We an Dataases (WeDB ), June,, Provience, Rhoe Islan, USA. Vassilis J. Tsotras University of California, Riversie, CA, USA tsotras@cs.ucr.eu In the sequence-ase approaches [][], the XML ocument an profile twigs are transforme into sequences. FiST [] was the first to propose a sequence-ase XML filtering system using Prufer sequence encoing. This approach was shown to e more efficient than automata-ase approaches since whole twig profiles are processe at once. Nevertheless, susequence matching coul create false positives an thus a post-processing phase is require to filter them. XFIS [] is another Prufer-like sequence encoing of XML ocuments an profile twigs, however it oes not account for tag recursion in the XML ata. In this paper, we propose a new sequence-ase approach to solve the XML filtering prolem. The key contriutions of this paper can e summarize as follows: We present a new, simple an effective sequence representation for XML ocuments an query twig patterns, calle NETS (Noe Encoe Tree Sequence). Using NETS, we present a necessary an sufficient conition for a twig profile to match a given XML ocument tree. We present a filtering system for orere twig pattern matching that employs concurrent susequence matching of query profiles. Our filtering approach is unique since it oes not require a post-refinement phase. The approach guarantees that returne matches are free of false positives (an false negatives) with a single scan of the XML ocument. Experimental results show that the propose approach outperforms previous XML filtering approaches. Performance improvements are achieve through holistic processing of twig patterns an on-the-fly etection of false matches. We procee with the escription of the NETS encoing in Section. In Section we provie the etails of our filtering system. Section presents experimental results while conclusions appear in Section.. NODE ENCODED TREE SEQUENCE An XML ocument is moele as a roote orere laele tree where each noe correspons to an element tag, attriute, or value, an eges represent structural relationships etween noes. Several sequence representations for laele trees have een propose an utilize for XML filtering or query matching [][][]. We present a simple yet efficient sequence representation for XML trees calle Noe Encoe Tree Sequence (NETS). For each noe in a given tree, the NETS of the tree contains two symols referre to as start-symol an -symol. For example, given a tree T with a noe laele x, the NETS(T) will contain the start-symol Sx an -symol Ex for noe laele x. The Noe Encoe Tree Sequence of T, NETS(T), is efine recursively as follows. An example of a XML tree T an its corresponing NETS is shown in Figure (a).

Definition (Noe Encoe Tree Sequence):. If tree T consists of a single noe laele y r, then the NETS(T) is Sr Er.. Let r e the root of tree T an assume r has m chilren laele from left-to-right as r, r,, r m. Let sequences S, S,, S m e the NETS of the sutrees whose roots are r, r,, r m, respectively. Then the NETS(T) is Sr S S S m Er. c a g XML Tree T (a) k c a g Q: /a/[c]/g Q: /a[/c]///g () (c) NETS(T) = Sa S Sc Ec S E E S S E Sg Eg Sk Ek E Ea NETS(Q ) = Sa S Sc Ec Sg Eg E Ea NETS(Q ) = Sa S Sc Ec E S Sg Eg E Ea Figure : NETS Representation of XML Tree an Query Twigs An XPath query expression can also e moele as a laele tree (referre to as a query twig) where each noe represents an element or value, an eges enote parent-chil ( / ) or ancestor-escant ( // ) relationships (see Figure ()(c)). The NETS of a query twig is generate in accorance to Definition. Note, when generating NETS of a twig query no istinction is mae etween parent-chil an ancestorescant eges in the tree. In Section., we associate aitional attriutes to the NETS sequence noes to encoe /, //, an *. We note that NETS is essentially equivalent to SAX tokens, thus it has several properties. For a given lael, x, the NETS must have an equal numer of start-symols (Sx) an symols (Ex). For every two noes in the tree with laels x an y, the corresponing segments Sx.Ex an Sy Ey in the sequence are either isjoint or neste. The start an symols of the same noe are calle corresponing symols. If a noe is laele y x, the preorer of the corresponing symols Sx an Ex are efine as preorer(sx)=preorer(ex)=preorer(x). Hence, any two start an symols in the NETS are sai to e corresponing symols if they have the same preorer numer. The ojective of the filtering algorithm is to etermine whether a given query twig has a match in a XML tree. For a query to e a match, the query twig must e a sugraph (Definition ) of the XML tree an satisfy the level-consistent property (Definition ), which we formally efine as follows: Definition (Sugraph): Let T=(V,E) e a laele tree, where V is the set of all noes in T an E is the set of all eges in T. For every noe n in V, let lael(n) enote the lael of n an let preorer(n) enote the numer associate with the noe ase on the tree preorer traversal. A laele tree Q=(V,E ) is a sugraph of T if the following two conitions hol: () There is a one-to-one mapping f() from V into V such that for every noe n in V lael(n)=lael(f(n)), an for every ege (n,n ) in E there is a path from f(n ) to f(n ) in T. () For every two noes n an n in Q, if preorer(n ) < preorer(n ) then preorer(f(n )) < preorer(f(n )). While () guarantees that ancestor-escent relationships in Q are 'foun' in T, () guarantees that the relative orer of noes in Q correspons to the one in T. Definition (Level-Consistent): Let Q e a query an sugraph of an XML tree T, an let level(m) enote the level of a noe m in T. A query Q satisfies the level-consistent property iff the following conition hols: For every ege (n,n ) c a g NEST(T): Sa S Sc Ec S E E S S E Sg Eg Sk Ek E Ea NEST(Q ): Sa S Sc Ec Sg Eg E Ea NEST(T): Sa S Sc Ec S E E S S E Sg Eg Sk Ek E Ea NEST(Q ): Sa S Sc Ec E S Sg Eg E Ea Figure : Q is a False Match an Q is a Match in Q, if the ege is of type //, then level(f(n )) level(f(n )). Otherwise, if (n,n ) is of type /, then level(f(n )) level(f(n )) =, where n an n enote noes in Q an f is one-to-one mapping as escrie in Definition. Note, the root is assigne a level of zero, the root s chilren are assigne a level of one, e.t. The filtering algorithm involves susequence matching etween the NETS sequences of the twig profile an the ocument, to etermine if there is a match. The following theorem states the relation etween the query twig an XML tree an their sequence representation. Theorem : Given two roote laele trees T an Q, if Q has a match in T then the NETS(Q) is a susequence of NETS(T). Proof: Since Q has a match in T, Q is a sugraph an consequently there is a one-to-one mapping f from the noes in V into the noes of V such that the lael(n) = lael(f(n)) for every noe n in V an satisfies other properties of Definition. Thus, each start-symol an -symol in NETS(Q) must appear in NETS(T), ut we nee to prove that occurrence of these symols appear in the same orer in oth NETS(Q) an NETS(T). Two cases are consiere for every two noes n an n in Q, whose corresponing start an symols are (Sx,Ex) an (Sy,Ey), respectively. Case : There is a path from n to n in Q. Thus, the segment Sx Ex in NETS(Q) must contain the segment Sy Ey such that the orer of the symols in NETS(Q) is Sx..Sy..Ey..Ex. Since f(n ) an f(n ) have the same symols as that of n an n an there is a path from f(n ) an f(n ) in T, then the orer of the start an laels of f(n ) an f(n ) in NETS(T) is also Sx Sy Ey..Ex. Case : No path exists etween n an n. Since there is no path etween noes n an n, then the segment Sx Ex an Sy Ey in NETS(Q) must e isjoint. Thus, the orer of the symols in NETS(Q) is either Sx Ex Sy Ey or Sy Sy Sx Ex eping on the preorer numer of noes n an n. Since Q is a sugraph of T, the preorer(n ) < preorer(n ) iff preorer(f(n)) < preorer(f(n )), consequently, the symols in NETS(T) must appear in the same orer. Given trees Q an T, if we enumerate all possile susequences of NETS(T) that match NETS(Q), then we are guarantee to report all matches with no false ismissals. However, we note that the result may contain false positives. Consier the XML tree T an query twigs Q an Q shown in Figure. Figure shows the NETS of T, Q an Q. The NETS(Q ) an NETS(Q ) are oth susequences of the NETS(T). However, we note that Q is a false match ecause it is not a sugraph of tree T. As we procee, we shall prove a necessary an sufficient conition for a query to have a match in a given XML ocument. Definition (Tree Susequence): Let T e a NETS laele tree. A susequence S of NETS(T) is calle a tree susequence if the following two conitions hol:

. If Sx(Ex) is in S, then the corresponing symol Ex(Sx) is also in S, i.e. if Sx(Ex) is in S then the Ex(Sx) with the same preorer numer is also in S.. S = Sr Er, where Sr an Er are corresponing symols. Reconsier the XML tree an query twigs shown in Figure. The XML tree noes are numere with a preorer traversal of the tree so that each noe has a unique numer (See Figure ). When NETS(Q ) is matche with NETS(T), we note the matche susequence is not a tree susequence an hence Q is not reporte as a match. The matche susequence for Q is a tree susequence an hence Q is reporte as a match. Definition (Level of Start an En Symols): Given two roote laele trees Q an T, let T e a tree an x e a noe in T. The Level(Sx) = Level(Ex) = Level(x), where Sx an Ex are the corresponing start an laels of x. Theorem : Given two roote laele trees Q an T, Q has a match in T iff there is a tree susequence S of NETS(T) such that:. S = NETS(Q). For every ege (n,n ) in Q the following property hols: If (n,n ) is of type /, then the level(sy) - level(sx) =, If (n,n ) is of type //, then the level(sy) - level(sx), where n an n are laele tree noes in Q an Sx an Sy are the corresponing start symols in S. Proof: Necessary Conition - Assume that query Q has a match in XML ocument T. By Theorem, NETS(Q) is a susequence of NETS(T). Let S e a susequence of NETS(T) that correspons to NETS(Q). Since there is a one-to-one mapping f from the noes of Q into the noes of T that satisfies the properties state in the efinition of sugraph, then S satisfies the properties of tree susequence. Also, since query Q has a match in T, then Q satisfies the level-consistent property. It is easy to show that the secon conition of Theorem follows from the level-consistent property. Sufficient Conition - Let Q e a query such that there is a tree susequence S of NETS(T) that satisfies the two conitions of Theorem. To prove that Q has a match in T, we first prove that Q is a sugraph of T, i.e. we efine a - mapping f from the noes of Q into the noes of T, which satisfies the properties state in the sugraph efinition. Let n e a noe of Q with lael x. The mapping f(n) is efine as follows: Since S=NETS(Q), the start an symols of noe n must appear in S. As we progress, we will show that these two symols in S are the start an symols of the same noe in T (i.e. they are not symols of ifferent noes in T). This noe of T will e enote as f(n). Two cases nee to e consiere: Case : There is only noe of Q with lael x. Thus, tree susequence S contains only one Sx an one Ex an they must e the start an symols of the same noe in T, ecause they have the same preorer numer. Case : There is more than one noe in Q with lael x. Without loss of generality, we assume that there are only two noes (n an m) of Q with lael x an preorer(n) < preorer(m). Thus, the symols of n an m in NETS(Q) must appear in the following orer: n n m m or n m m n Since S=NETS(Q), these symols also appear in S. Since S is a tree susequence, then there are only two Sx an two Ex in S an they must appear in the same orer as aove. These symols correspon to the two noes in T enote y T m an T n. We will prove y contraiction that the Sx an Ex of noe n in Q correspon to either the Sx an Ex of T m or the Sx an Ex of T n. Suppose that Sx an Ex of n correspons to the Sx of T n an Ex of T m, respectively as shown elow: Tn Tm Tm Tn or Tn Tm Tn Tm The first case can e exclue ecause the sequence is not a vali NETS, since the Ex procees its corresponing Sx in the NETS. The secon case can also e exclue ecause the property of NETS requires that the segment Sx.Ex an Sx Ex of noes T m an T n must either e isjoint or neste. In this case, the two segments overlap an thus it s not a possile NETS. Thus, Sx an Ex of n in Q correspon to the Sx an Ex of the same noe f(n) in T. It is easy to verify that f(n) satisfies the two properties state in the sugraph efinition. Thus, Q is a sugraph of T. The secon conition of Theorem can e use to verify that Q satisfies the level consistency property; as a result, Q is a match of T.. FILTERING SYSTEM In this section, we first escrie how to encoe wilcars, / an // for the query twig NETS. We procee to escrie the core ata structures utilize an the filtering algorithm.. NETS Query Encoing an Data Structures The NETS representation of query profiles is generate y encoing a start an symol for each noe in the tree as presente in Definition. For each start-symol in the NETS, an aitional attriute referre to as relationship is encoe to specify the relationship etween a noe an its parent noe. For example, if the relation is a parent-chil / then the attriute shall e =, specifying that two noes must e one level apart. If the relation is an ancestor-escant // then the attriute shall e, specifying that two noes must e at least one level apart. Wilcars * are hanle ifferently eping on the occurrence of the * in the query. If the wilcar operator appears as a ranch noe in the twig (See Q in Figure ), then the * is encoe as a regular noe. Otherwise, if it is a nonranch noe (See Q in Figure ), then the next non-wilcar noe is encoe in the NETS, an the occurrence of the wilcar is reflecte y the relationship attriute. The encoe NETS of a query twig is referre to as query sequence. To support value-ase preicates, the query sequence can e augmente to inclue an aitional attriute that specifies the preicate value an operator {=, <, >,, }. Example : Consier the query twigs shown in Figure. Q contains a ranch wilcar noe, thus the * is encoe as a regular noe in the NETS(Q ). Q contains a non-ranch wilcar noe, thus, the * is not encoe as part of NETS(Q ) an the next non-wilcar noe is associate with relation =. a Q: /a/*[]//c/ a Q: /a[//c]/*/*/ * Sa S* S E Sc S E Ec E* Ea * Sa S Sc Ec E S E Ea c c * - = = = - = = Figure : NETS Encoing of Query Twigs The filtering algorithm utilizes several structures for susequence matching, as illustrate y the example in Figure. A runtime gloal stack is maintaine y the filtering algorithm where each entry in the stack is a tuple that contains a startsymol of the XML sequence an its preorer an level. The tuples are pushe to the stack as start-symols are generate. At the start of the filtering algorithm, concurrent susequence

searches are initiate over the query sequences. For nonrecursive XML ocuments, only one susequence search is neee for each query, however, for recursive XML, more than one susequence search over the same query sequence may e initiate. Each susequence search maintains an. The is an integer that enotes the current position in the query sequence. The is a stack where each entry is an orere pair. The first component in the stack is the inex of the matche XML start-symol in the gloal stack. The secon component is the position of the matche startsymol in the query sequence. The orere pairs are pushe an poppe to an from the as start an symols in the XML sequence are matche to the query sequence. The filtering algorithm also utilizes a ynamic hashtale, calle sequenceinex, to facilitate concurrent processing of query twigs. The sequenceinex uses the start an symol assigne to each XML tag as a key into the hashtale. For each key (start or symol) it maintains a list of queries to e matche specifie y their queryis, the query s unique ientification.. Filtering Algorithm At the start of filtering, the first noe of each query sequence is inserte into the sequenceinex. The streaming XML ocument is parse y the SAX parser; the ProcessStartSymol(.) function is calle when an open tag is generate an the ProcessEnSymol is calle when an tag is generate. Note that the SAX methos have een slightly altere to maintain level an preorer information for each XML tag (see Algorithm ). Algorithm : XML SAX Parser Filtering Algorithm int level = - int preorer = - Stack gloalstack; /* at the start of each new ocument, initialize sequenceinex an */ proceure startdocument() foreach query twig q o /* let nextsymol enote the initial symol in the sequence */ sequenceinex[nextsymol].insert (q s queryi) set to /* generate start-symol, preorer an level of XML tag*/ proceure startelement (tag) startsymol = S + tag level + = preorer + = gloalstack.push(startsymol, preorer, level) ProcessStartSymol (startsymol, preorer, level) /* generate -symol, preorer an level of XML tag */ proceure Element (tag) Symol = E + tag preorer = pop gloal stack to get preorer of noe ProcessEnSymol (Symol, preorer, level) level - = The ProcessStartSymol(.) function is escrie y Algorithm a. This function receives a start-symol an its preorer numer an level as input. The filtering algorithm proes the sequenceinex for a list of queries that match the startsymol. For each query in the currentlist, the algorithm verifies that the level-consistent property hols for the current query noe (lines ). The top entry of the is retrieve an the first component, referre to as inexparanc, is use to retrieve the xmlparancnoe from the gloalstack. The ifference etween the current startsymol s level an the xmlparancnoe s level is calculate to etermine whether the query s relationship attriute is satisfie. If the property is satisfie, then the following steps are performe. First, the inex of the matche startsymol (its inex in the gloalstack) an the are inserte into the. Secon, the is incremente y one. Lastly, the next symol in the query sequence is ae to the sequenceinex. Algorithm a: Filtering Algorithm - Start Symol Haning proceure ProcessStartSymol( startsymol, preorer, level ) /* proe sequenceinex for matching queries */ currentlist = sequenceinex [ startsymol ] foreach q in currentlist o /* Let inexparanc enote the first component of the top element in the */ xmlparancnoe = gloalstack. get ( inexparanc ) /* verify level-consistent property hols */ If xmlparancnoe.level level satisfies query s relationship attriute then /* let nextsymol enote the next symol in query sequence */ sequenceinex[ nextsymol ]. insert (q s queryi) /* let inex enote the inex of the parameter startsymol in the gloalstack */ push inex & onto + = / * avance query sequence position */ for Algorithm : Filtering Algorithm En Symol Hanling proceure ProcessEnSymol( Symol, preorer, level) /* proe sequenceinex for matching queries */ currentlist = sequenceinex [ Symol ] foreach q in currentlist o /* Let inexparanc enote the first component of the top element of the */ xmlparancnoe = gloalstack. get(inexparanc) /* verify preorer match s */ if xmlparancnoe. preorer = preorer then /* remove an start symol from sequenceinex */ /* let startsymol enote the symol otaine y replacing E with S in the parameter Symol */ sequenceinex[symol].remove(q s queryi) sequenceinex[startsymol].remove(q s queryi).pop() /* pop the top element off the */ + = /* let nextsymol enote the next symol in query sequence*/ if nextsymol is null then /* of query sequence */ report query as a match else sequenceinex[nextsymol].insert(q s queryi) for /* hanle query acktracking */ currentlist = sequenceinex[startsymol] foreach q in currentlist o /* Let inexparanc enote the first component of the top element of the */ xmlparancnoe = gloalstack. get(inexparanc) if xmlparancnoe.preorer = preorer then /* must acktrack query sequence position */.pop() /*pop top element off the */ elete last inserte queryi in sequenceinex upate for

SequenceInex Initial state Sa Q, Q S E Q Q (a) ProcessStartSymol(Sa,, ) () ProcessStartSymol(S,, ) (c) ProcessStartSymol(Sc,, ) Sa Q, Q S Q, Q E Q (,) Q (,) (Sa,,) Sa Q, Q S Q, Q Q, Q E Q (,), (,) Q (,), (,) (S,,) (Sa,,) Sa Q, Q S Q, Q Q, Q E Q, Q Q (,), (,), (,) Q (,), (,), (,) (Sc,,) (S,,) (Sa,,) () ProcessEnSymol(Ec,, ) Sa Q, Q S Q, Q Q, Q Q E Q Q, Q Q (,), (,), (,) (,), (,), (,) Q (Sc,,) (S,,) (Sa,,) (e) ProcessEnSymol(E,, ) Sa Q, Q S Q, Q Q E Q Q (,), (,) (,), (,) Q (S,,) (Sa,,) (f) ProcessStartSymol(S,, ) (g) ProcessStartSymol(Sg,, ) (h) ProcessEnSymol(Eg,, ) (i) ProcessEnSymol(E,, ) (j) ProcessEnSymol(Ea,, ) En of Filtering Algoritm Sa Q, Q S Q, Q Q Q E Q (,), (,) (,), (,) Q (S,,) (Sa,,) Sa Q, Q S Q, Q Q Q E Q Q (,), (,) (,), (,), (,) Q (Sg,,) (S,,) (Sa,,) Sa Q, Q S Q, Q Q Q E Q Q Q (,), (,) Q (,), (,), (,) (Sg,,) (S,,) (Sa,,) Sa Q, Q S Q, Q Q Ea Q E Q Q (,), (,) Q (,), (,) (S,,) (Sa,,) Figure : Filtering Algorithm Example Sa Q, Q S Q Q E Q (,) Q (,) (Sa,,) En of NETS(Q) is reache, thus, Q is reporte as a match Denotes eletion ue to acktacking Denotes eletion ue to a match The ProcessEnSymol(.) function is escrie y Algorithm. For each query in the currentlist, the filtering algorithm first verifies that the current Symol s preorer numer equals that of the xmlparancnoe. If there is a match, then the queryi is elete from the start-symol s an -symol s lists in the sequenceinex. If the of the query sequence is reache, then the query is reporte as a match. Otherwise, the next symol is ae to the sequenceinex. At times, acktracking to a previous query sequence position is require ue to a false match (lines -). The sequenceinex is proe for the start-symol an a list of queryis is retrieve. For each query in currentlist, the top entry of the is retrieve an the first component, referre to as inexparanc, is use to retrieve the xmlparancnoe from the gloalstack. The xmlparancnoe s preorer is compare to the current Symol s preorer. If there is a preorer match, then acktracke is performe y the following steps. First, the top entry is poppe off the. Secon, the last inserte queryi is elete from the sequenceinex, an lastly the is upate to inicate the new position in the query sequence. Below, we illustrate the execution of our filtering algorithm with the XML tree T an twig patterns Q an Q shown in Figure. Example : In Figure (a), the ProcessStartSymol(.) is invoke for Sa an the sequenceinex contains Q an Q for key Sa. The level-consistent property is automatically satisfie since the points to the first symol in the query sequence. Thus, the susequence search for oth Q an Q is avance. First, the next symol in the query sequence ( S for oth Q an Q ) is retrieve an the queryis are inserte into the sequenceinex for that key. Secon, the inex of the startsymol in the gloal stack (the inex is ) an the (the is ) are inserte into the, thus the tuple (,) is pushe to the Q an Q. Lastly, the of Q an Q is incremente y. The algorithm procees to process XML noes (S,,) an (Sc,,) in Figures () an (c). In Figure (), the ProcessEnSymol(.) is invoke for Ec, an the sequenceinex contains Q an Q for key Ec. The top orere pair in the Q an Q is (,). The first component of the pair is use to retrieve the xmlparancnoe in the gloal stack, thus, (Sc,,) is retrieve. The preorer of the current Symol an xmlparancnoe match, thus, the search for oth Q an Q is avance. The queryis, Q an Q, are elete from the list of Sc an Ec in the sequenceinex. The next symol in the sequence of Q an Q is Sg an E, respectively. The queryis are inserte into the sequenceinex for their corresponing keys. For oth Q an Q, the top entry is poppe off the an the is incremente y. In Figure (e), the ProcessEnSymol(.) is invoke with lael E. Q is retrieve an the preorer check is verifie, thus Q s search is avance. First, the top element is poppe off Q s. Secon, Q s is incremente y. Lastly, Q is elete from the sequenceinex list for keys Sc an Ec. Note that acktracking of Q occurs as well. After processing Q, the sequenceinex is proe for the corresponing start-symol S, an Q is retrieve. The preorer check returns a match, thus, inicating that Q shoul e acktracke. Thus, the Q is poppe, queryi Q is elete from the sequenceinex list for key Sg, an the Q s is acktracke to. In Figure (f i) the algorithm procees as escrie y Algorithm. In Figure (j), the last query sequence symol of Q is matche an thus Q is reporte as a match. Q, however, is not reporte as match ecause the of the query sequence is not reache. Please note (S,,), (E,,), (Sk,,) an (Ek,,) are not shown in the Figure since the state of the structures oes not change. Example shows the execution of the asic filtering algorithm for non-recursive XML ocument. If recursion occurs in the XML ocuments, multiple susequence searches may e initiate over each query sequence. The ata structures an the algorithm steps were slightly moifie in the final implementation of the algorithm to process multiple searches

for each query sequence. For each query search initiate, a an is maintaine as efore. The entries inserte into the sequenceinex are exte to inclue the search ID, thus, for each key the sequenceinex will contain a list of orere pairs that contains the queryi an searchi. The ProcessStartSymol(.) was slightly moifie to initiate a new search when a recursive start-symol is encountere. The filtering algorithm will first verify that the level-consistent property hols. If the level-consistent property is satisfie, then a new search is initiate an its corresponing an are upate to enote the current position of the search in the query sequence. The other operations of the filtering algorithm are maintaine with the exception that multiple searches must e initiate as recursive elements are encountere.. EXPERIMENTAL RESULTS In our experiments, we compare the performance of our system to that of YFilter[] an FiST[]. We utilize the YFilter Java implementation provie y the authors. We implemente FiST an NETS in Java as well. All experiments were performe on a Qua Core.GHz processor with GB of memory running Linux Re Hat. We use the synthetic Sigmo [] ataset for our experiments. We also generate twig patterns using the XPath generator availale in the YFilter package. The element names were chosen from uniform istriution an the max epth of a twig pattern was fixe at. The numer of ranches in the twig patterns was fixe to, an the proaility of * an // was fixe to percent. wall clock time wall clock time Yfilter [KB] NETS[KB] FiST[KB] K K K K K K K K Numer of Queries (a) Yfilter [K] NETS[K] FiST[K] KB KB KB KB KB Size of XML Document (KB) () Figure : Filtering Time Experimental Results Figure compares the performance of our filtering system with that of YFilter an FiST. In Figure (a), the ocument size was fixe to KB an the numers of twig queries were varie from, to, in steps of K. For K queries, all three methos perform comparaly well, however, as the numer of twig patterns were increase, the filtering time for FiST an YFilter increase ramatically. In Figure (), the numer of queries was fixe to, an the ocument size was varie from KB to KB. For KB case, all three methos yiele comparale results. The performance improvement of NETS was notice as the ocument size was increase to KB. The performance improvement achieve y the NETS approach is ue to two reasons. First, our filtering approach employs holistic filtering of the query sequences, thus a match or a ismissal of a particular query twig can e mae earlier in the filtering process. Secon, our filtering approach filters false matches on-the-fly, thus avoiing an expensive post-processing phase. We conclue that our approach provies an efficient filtering algorithm that is scalale in terms of the numer of user profiles an the size of the XML ocument.. CONCLUSION We presente an XML-ase filtering system that uses a new sequence encoing (NETS) for oth the streaming ocuments an query profiles. Using NETS we provie a necessary an sufficient conition for a query profile to have a match in a given ocument. An important property of our system is that false positives an false ismissals are eliminate on-the-fly with a single scan of the streaming ocument. Experimental evaluation showe that NETS filtering provie performance improvements in comparison to state-of-the-art filtering systems. We plan to explore NETS encoing for the fullflege case that reports the positions of all query matches in a streaming XML ocument. Moreover, we will explore the enefits of this encoing for traitional structural XML query processing.. ACKNOWLEDGMENTS This research was partially supporte y NSF grant. REFERENCES [] Altinel, M., an Franklin M., Efficient Filtering of XML Documents for Selective Dissemination of Information. In VLDB Journal, pages -, Septemer. [] Antonellis, P., an Makris, C., XFIS: An XML Filtering System ase on String Representation an Matching. International Journal on We Engineering an Technology (IJWET). v., n., pages,. [] Berglun, A., Boag, D., Chamerlin, M., Fernanez, M., Kay, M., Roie, J., Dimeon, J., XML Path Language (XPath).. In WC Propose Recommation, http://www.w.org/tr/xpath,. [] Diao, Y., Rizvi, S. an Franklin, M.J., Towars an Internet-Scale XML Dissemination Service. In Proc. Of VLDB,. [] Diao, Y., Altinel, M., Franklin, M.J., Zhang, H., an Fischer, P. Path Sharing an Preicate Evaluation for High-Performance XML Filtering. ACM Trans. Dataase Syst., v., n., pages,. [] Kwon, J., Rao, P., Moon, B., Lee, S. Value-ase Preicate Filteirng of XML Documents. Data an Knowlege Engineering, v., n., pages -,. [] Kwon, J., Rao, P., Moon, B., an Lee, S. FiST: Scalale XML Document Filtering y Sequencing Twig Patterns. In VLDB : Proceeings of the st international conference on Very large ata ases, pages. VLDB Enowment,. [] Rao, P., Moon, B., Sequencing XML Data an Query Twigs for Fast Pattern Matching, ACM Transactions on Dataase Systems (TODS), v. n., pages -,. [] University of Washington XML Repository,, http://www.cs.washington.eu/research/xmlatasets/ [] Wang, H., an Meng, X., On the Sequencing of Tree Structures for XML Inexing. In Proceeings of ICDE,. [] Wang, H., Park, S., Fan, W., An Yu, P., ViST: A ynamic inex metho for querying xml ata y tree structures, In Proceeings of the SIGMOD Conference, pages,.