XSelMark: A Micro-Benchmark for Selectivity Estimation Approaches of XML Queries

Similar documents
Cardinality estimation of navigational XPath expressions

FlexBench: A Flexible XML Query Benchmark

XQuery Optimization in Relational Database Systems

StatiX: Making XML Count

Estimating the Selectivity of XML Path Expression with predicates by Histograms

A Sampling Approach for XML Query Selectivity Estimation

Multi-User Evaluation of XML Data Management Systems with XMach-1

A FRAMEWORK FOR EFFICIENT DATA SEARCH THROUGH XML TREE PATTERNS

Schema-Based XML-to-SQL Query Translation Using Interval Encoding

XPathMark: an XPath Benchmark for the XMark Generated Data

Symmetrically Exploiting XML

Integrating Path Index with Value Index for XML data

Part XII. Mapping XML to Databases. Torsten Grust (WSI) Database-Supported XML Processors Winter 2008/09 321

ADT 2009 Other Approaches to XQuery Processing

A Framework for Estimating XML Query Cardinality

An XML Routing Synopsis for Unstructured P2P Networks

A Scheme for Evaluating XML Engine on RDBMS

Evaluating XPath Queries

XQuery Implementation Paradigms (06472)

Parameterized XPath Views

Ecient XPath Axis Evaluation for DOM Data Structures

PathStack : A Holistic Path Join Algorithm for Path Query with Not-predicates on XML Data

XQuery Optimization Based on Rewriting

Big Data Management and NoSQL Databases

One of the main selling points of a database engine is the ability to make declarative queries---like SQL---that specify what should be done while

Effective Schema-Based XML Query Optimization Techniques

StatiX: Making XML Count

XML databases. Jan Chomicki. University at Buffalo. Jan Chomicki (University at Buffalo) XML databases 1 / 9

A Clustering-based Scheme for Labeling XML Trees

Accelerating XML Structural Matching Using Suffix Bitmaps

Module 9: Selectivity Estimation

An Efficient Eigenvalue-based P2P XML Routing Framework

Querying and Updating XML with XML Schema constraints in an RDBMS

Fractional XSketch Synopses for XML Databases

Summarization of XML Documents

Object Placement in Shared Nothing Architecture Zhen He, Jeffrey Xu Yu and Stephen Blackburn Λ

XML Query Processing and Optimization

XML Systems & Benchmarks

TwigINLAB: A Decomposition-Matching-Merging Approach To Improving XML Query Processing

Efficient, Scalable, and Provenance-Aware Management of Linked Data

Element Algebra. 1 Introduction. M. G. Manukyan

Data Structures for Maintaining Path Statistics in Distributed XML Stores

Parallelizing Structural Joins to Process Queries over Big XML Data Using MapReduce

The Michigan Benchmark: A Micro-Benchmark for XML Query Performance Diagnostics

Set-at-a-time Access to XML through DOM

τ-xsynopses - A System for Run-time Management of XML Synopses

Contents Contents Introduction Basic Steps in Query Processing Introduction Transformation of Relational Expressions...

1 Introduction. Philippe Michiels. Jan Hidders University of Antwerp. University of Antwerp. Roel Vercammen. University of Antwerp

Relational Model: History

XML Query Processing. Announcements (March 31) Overview. CPS 216 Advanced Database Systems. Course project milestone 2 due today

Efficient XQuery Evaluation of Grouping Conditions with Duplicate Removals

MonetDB/XQuery (2/2): High-Performance, Purely Relational XQuery Processing

XML Tree Structure Compression

Compacting XML Structures Using a Dynamic Labeling Scheme

CHAPTER 3 LITERATURE REVIEW

Query Optimization. Query Optimization. Optimization considerations. Example. Interaction of algorithm choice and tree arrangement.

An Efficient XML Index Structure with Bottom-Up Query Processing

ADT 2010 ADT XQuery Updates in MonetDB/XQuery & Other Approaches to XQuery Processing

QuickXDB: A Prototype of a Native XML QuickXDB: Prototype of Native XML DBMS DBMS

Compression of the Stream Array Data Structure

Query Processing and Optimization in Native XML Databases

MemBeR: A Micro-benchmark Repository for XQuery

Using an Oracle Repository to Accelerate XPath Queries

Full-Text and Structural XML Indexing on B + -Tree

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Fall 2007 Lecture 7 - Query execution

Analysis of Different Approaches for Storing GML Documents

XML Index Recommendation with Tight Optimizer Coupling

STRUCTURE-BASED QUERY EXPANSION FOR XML SEARCH ENGINE

Database Management

SFilter: A Simple and Scalable Filter for XML Streams

Time Complexity and Parallel Speedup to Compute the Gamma Summarization Matrix

Data Centric Integrated Framework on Hotel Industry. Bridging XML to Relational Database

Something to think about. Problems. Purpose. Vocabulary. Query Evaluation Techniques for large DB. Part 1. Fact:

The Michigan Benchmark: Towards XML Query Performance Diagnostics

XML Filtering Technologies

TDDD43. Theme 1.2: XML query languages. Fang Wei- Kleiner h?p:// TDDD43

CXHist : An On-line Classification-Based Histogram for XML String Selectivity Estimation

Extending database technology: a new document data type

SQL, XQuery, and SPARQL:Making the Picture Prettier

TwigList: Make Twig Pattern Matching Fast

Advanced Database Systems

Title: STEP: Extending Relational Query Engines for Efficient XML Query Processing

XQuery Query Processing in Relational Systems

Pathfinder: Compiling XQuery for Execution on the Monet Database Engine

XML/Relational mapping Introduction of the Main Challenges

XML Native Storage and Query Processing

Join Processing for Flash SSDs: Remembering Past Lessons

A New Way of Generating Reusable Index Labels for Dynamic XML

Introduction to Query Processing and Query Optimization Techniques. Copyright 2011 Ramez Elmasri and Shamkant Navathe

Designing Views to Answer Queries under Set, Bag,and BagSet Semantics

Child Prime Label Approaches to Evaluate XML Structured Queries

Towards microbenchmarking. June 30, 2006

Leveraging Set Relations in Exact Set Similarity Join

XML: Extensible Markup Language

RiMOM Results for OAEI 2009

FedX: A Federation Layer for Distributed Query Processing on Linked Open Data

Relational Query Optimization

An Implementation of Tree Pattern Matching Algorithms for Enhancement of Query Processing Operations in Large XML Trees

Administrivia. CS 133: Databases. Cost-based Query Sub-System. Goals for Today. Midterm on Thursday 10/18. Assignments

XML and Databases. Lecture 10 XPath Evaluation using RDBMS. Sebastian Maneth NICTA and UNSW

Transcription:

XSelMark: A Micro-Benchmark for Selectivity Estimation Approaches of XML Queries Sherif Sakr National ICT Australia (NICTA) Sydney, Australia sherif.sakr@nicta.com.au Abstract. Estimating the sizes of query results and intermediate results is a crucial part of any effective query optimization process. Due to several reasons, the selectivity estimation problem in the XML domain is more complicated than that in the relational domain. Several research efforts have proposed selectivity estimation approaches in the XML domain. Lacking of a suitable benchmark was one of the main reasons which prevented a real assessment and comparison between the approaches to be conducted. This paper is a first step towards a comprehensive assessment of the available selectivity estimation approaches of XML queries along with their strengths and weaknesses. We propose a selectivity estimation benchmark for XML queries, XSelMark. It consists of a set of 25 queries organized into seven groups and covers the main aspects of selectivity estimation of XML queries. These queries have been designed with respect to an XML document instance of a popular benchmark for XML data management, XMark. In addition, we suggest some criteria of assessing the capability and quality of XML queries selectivity estimation approaches. Finally, we use the proposed benchmark to assess the capabilities of the-state-of-the-art of the selectivity estimation approaches. 1 Introduction Modern implementations of query processors are heavily relying for their efficient performance on sophisticated optimizer components to achieve a proper selection of many optimization decisions such as: access paths, join orders and materialization strategies. Estimating the sizes of query results and intermediate results is a crucial part of any effective query optimization process. In fact, the selectivity estimation problem in the XML domain is more complicated than that in the relational domain. There are several reasons behind this such as: 1) the absence of strict schema notion in the XML data. 2) the dualism between structural and value-based querying. 3) the high expressiveness of the XML query languages [8]. 4) the non-uniform distribution of tags and data. 5) the correlation and dependencies between the occurrences of the elements. In the recent past, several research efforts have proposed different selectivity estimation approaches in the XML domain [9, 19, 20, 24]. However, these approaches are never

comprehensively assessed, evaluated and compared. One of the main reasons for this situation is that there is a lack of a suitable benchmark that facilitates the ability to conduct such real assessments and comparisons. This implies that there is no clear view about the state-of-the-art in this domain, which in turn makes it difficult to decide what further steps should be taken next. Although the XML research community has proposed several benchmarks [4, 5, 10, 16, 17, 21, 23] which are very useful for their intended targets and perspectives, none of these benchmarks fits in the context of being able to assess and evaluate the different selectivity estimation approaches of XML queries. The author of this paper has been faced with this problem during his work in [20]. In general, XML benchmarks can be classified into two main categories: 1) Application (Macro) benchmarks [4, 5, 17, 21, 23] which are used to evaluate the overall performance of an XML management system. Hence, this kind of benchmarks are not very useful for conducting a detailed assessment of specific aspects of an implementation that need improvement. 2) Micro-benchmarks [10, 16] which are designed to assess the performance of specific features of a system. In [16], Michiels et al. have motivated the crucial need of different microbenchmarks in order to get a good understanding of the different aspects in implementing efficient query processors in the XML domain. Therefore, the goal of this paper is to contribute and develop an XML Micro-benchmark, XSel- Mark, which is mainly focussed on exercising the selectivity estimation aspects of XML queries. The proposed benchmark is considered as a first step to bring an overview of the state-of-the-art of the available approaches in the domain of selectivity estimation of XML queries along with their strengths and weaknesses. It aims of to be a guide for researchers and implementors in benchmarking and improving their research efforts in this domain. XSelMark consists of 25 queries organized into seven groups where each group is intended to address the challenges posed by the different aspects of XML query result size estimation. The remainder of this paper is organized as follows. Section 2 briefly gives an overview on the related benchmarks in the XML domain. Section 3 describes the main aspects of the selectivity estimation problem in the XML domain. Section 4 presents the set of queries of the XSelMark benchmark. An overview and an assessment of the supported features of the-state-of-the-art in the selectivity estimation approaches of XML queries is presented in Section 5 before we conclude Section 6. 2 Related Work Several benchmarks for the evaluation of XML data management systems have been proposed by the XML research community [4, 5, 10, 16, 17, 21, 23]. Most of these benchmarks are application oriented [4, 5, 17, 21, 23], while few others are designed as Micro-benchmarks [10, 16]. In this section we give a brief overview about the state-of-the-art of XML benchmarks. XMach-1 [4] is a scalable multi-user benchmark. It is based on a web application and considers text documents and catalog data. It only defines a small

number of XML queries that cover multiple functions and update operations for which system performance is determined. The benchmark consists of 8 queries and 3 update operations. The goal of the benchmark is to test how many queries per second the query engine can execute and to stress the XML systems under a multi-user workload. XOO7 [5] is considered to be the XML counterpart of the OO7 benchmark [6] which is geared towards object repositories. Besides mapping the database and original queries of OO7 into XML, XOO7 is enriched with document and navigational queries that are specific for XML databases. The goal of XOO7 is to evaluate the performance of XML management systems. XBench [23] is a comprehensive XML database benchmark that covers a large number of XML database applications. These applications are characterized by whether they are data-centric or text-centric and whether they consist of a single document or multiple documents. XBench workload covers the functionality of XQuery as captured in the Use Cases. XMark [21] is a single-user benchmark. The database model is based on an internet auction site and consists of one big regularly structured XML document with text and non-text data. It provides a concise and comprehensive set of queries which allows users and developers to assess the performance characteristics of the different XML engines. The TPOX benchmark [17] is an application-level XML database benchmark based on a financial application scenario. It is used to evaluate the performance of XML database systems. It is mainly focussed on exercising all aspects of XML database management systems such as: storage, indexing, logging, transaction processing and concurrency control. The work load of TPOX consists of insert, update and delete operations as well as query operations. XPathMark [10] is a Micro XPath 1.0 benchmark for XMark. It presents a set of XPath queries which covers the major aspects of the XPath language including different axes, node tests, Boolean operators, references, and functions. The targets of XPathMark is to assess the functional completeness, correctness, efficiency and data scalability of XPath implementations. MemBeR [16] is another Micro-Benchmark which has a main focus to benchmark the XQuery engines with respect to the efficiency of their implementation to four important XQuery constructs: XPath navigation, XPath predicates, XQuery FLWORs and XQuery Node Construction. These four constructs form the foundation of the language and thus their efficient implementation greatly impact the overall query engine performance. 3 Main Aspects of Selectivity Estimation in the XML Domain When looking for an efficient, capable and accurate selectivity estimation approach for XML queries, there are several issues that need to be addressed. From the experience of our work in [20], the major issues of this problem include:

It should support structural and data value queries. In principal, all XML query languages can involve structural conditions in addition to the valuebased conditions. Therefore, any complete selectivity estimation system for the XML queries requires maintaining statistical summary information about both of the structure and the data values of the the underlying XML documents. A recommended way of doing this is to apply the XMill approach [14] in separating the structural part of the XML document from the data part and then group the related data values according to their path and data types into homogenous sets. A suitable summary structure for each set can then be easily selected. For example, the most common approaches in summarizing the numerical data values is histograms or wavelets while several tree synopses could be used to summarize the structural part. It must be practical. In general, one of the main usages of the selectivity estimation approaches is to accelerate the performance of the query evaluation process. Thus, while theoretical guarantees are important for any proposed approach, practical considerations is much more important. The performance characteristics of the selectivity estimation process is a crucial aspect for any approach. The selectivity estimation process of any query or sub-query must be much faster than the real evaluation process. In other words, the cost savings on the query evaluation process using the selectivity information must be higher than the cost of performing the selectivity estimation process. In addition, the required summary structure(s) for achieving the selectivity estimation process must be efficient in terms of memory and space consumption. It should be strongly capable. The standard query language for XML namely XPath and XQuery are very rich languages. It provides a rich set of functions and features. These features include structure and content-based search, path expressions, element construction, join, sort, duplicate elimination and aggregation operations. Thus, a good selectivity estimation approach should be able to provide accurate estimates for a wide range of these features. In addition, it should maintain a set of special summary information about the underlying source XML documents. For example, a universal assumptions about uniform distribution of the elements structure and the data values may lead to many potential estimation errors because of the irregular nature of many XML documents. It should be composable. The XML query languages, specially XQuery, are compositional in nature as sub-expressions are combined with each other to form the final query. Hence, a good selectivity estimation approach should be able to estimate the selectivity of the final expressions as well as each sub-expressions. This feature is crucial for any cost-based query optimizer to enable a proper selection of a cheap execution plans according to the feeded selectivity information of each sub-expression. It must be accurate. On the one hand, providing an accurate estimation for the query optimizer can effectively accelerate evaluation process of any query. However, on the other hand, providing the query optimizer with incorrect

selectivity information will lead the query optimizer to incorrect decisions and consequently to inefficient execution plans. It should be independent. It is recommended that the selectivity estimation process be independent of the actual evaluation process and it can be used with different query engines which are applying different evaluation mechanisms. 4 XSelMark Benchmark Queries XMark [21] is a well-known benchmark for XML data management. The XMark database is modelling an internet auction web site. XMark comes with an XML generator that produces XML documents according to a numeric scaling factor proportional to the document size. We base the queries of our proposed benchmark on the structure of the XMark document auction.xml which is described in detail in [21]. The set of queries of our proposed benchmark, XSelMark, represents a mix of XML queries which covers a wide set of the major selectivity estimation aspects in the domain of XML queries. They are designed in a way to allow a realistic assessment for the advantages and shortcomings of the proposed XML selectivity estimation approaches and to identify their respective impact. The set of queries are expressed using two standard XML query languages XPath and XQuery. They are concise, easy to read and understand and available at the web page of the benchmark [1]. 4.1 Group 1: Path Expressions Path expression is a fundamental building block on querying XML data. This group of queries investigates the ability of the selectivity estimation approaches on dealing with the structural XML queries. Q1) Path expression with non-recursive axes: Find the names of all persons. /site/people/person/name/text() where non-recursive axes are child, parent, attribute, following-sibling and precedingsibling. Q2) Path expression with recursive axes: Find all description nodes descendant of all item nodes. /site//item//description where recursive axes are descendant, descendant-or-self, ancestor and ancestoror-self. Q3) Path expression with wild cards: Return the item subtrees of all regions. /site/regions/*//item/*

Q4) Path expression with ordered-based axes: Return the description nodes which are following the tags with the name closed auction. /site//closed_auction/following::description where ordered-based axes are following, following-sibling, preceding and precedingsibling. Supporting such type of queries requires the selectivity estimation approach to capture specific statistical information about the order of the elements in the XML documents. Q5) Branching XPath Expressions: Return the names of all persons who have age information in their profiles. /site//person[profile/age]/name 4.2 Group 2: Twig Expressions Q6) Simple twig expression: Return the names and descriptions of all items. for $b in //item return ($b/name,$b/description) Q7) Twig expression with element construction: Return the restructured results of the names and descriptions of all items. for $b in //item return <Result> <name>{$b/name}</name> <price>{$b/price}</price> </Result> 4.3 Group 3: Predicates The estimation of predicate selectivity is a well-known problem in database theory and practice. Most common solutions of this problem rely on histograms for capturing the distribution of data values, and on the use of the uniform distribution when nothing is known about the data involved in the predicate. In the context of XML, predicate selectivity estimation poses new challenges such as: 1) The predicates can be structural-based as well as value based. 2) Positional predicates represents a special form of predicates over the order information of the elements in the XML document. 3) XML elements are usually distributed in a non-uniform way, hence assuming a simple uniform distribution of the elements structure may lead to many potential estimation errors especially when the operated sequence of nodes are constructed by merging nodes from different groups of data elements. Q8) Positional Predicates: Return the third bidder of each open auction. /site/open_auctions/open_auction/bidder[3]

Q9) Equality Predicates: Return the closed auctions with price equal to 40. /site//closed_auction[price = 40] Q10) Range Predicates: Return the closed auctions with price less than 40. /site//closed_auction[price < 40] where the range predicates uses any of the operators (<,, =,! =, >, ). Q11) Conjunctive/Disjunctive Predicates: Return the closed auctions with price greater than 40 and less than 100. /site//closed_auction[price > 40 and price < 100] where conjunctive predicates can use any of the conjunctive/disjunctive operators (AND, OR). Q12) Predicates with merged nodes from different paths: Return the african and asian items with id value greater than 100. for $b in (/site/africa/item, /site/asia/item) where data($b/@id)> 100 return $b An accurate estimation of such query should consider the different distribution for the data values nodes resulting from each different path expression as well as the percentage of each path in construcing the nodes of the operated sequence. Q13) Predicates with merged nodes from different paths and hybrid nature: Return the price nodes and quantity nodes with value greater than 100. for $b in (/site//price,/site//quantity) where data($b) > 1 and data($b) > 100 return $b This query is more challenging than the previous one because the resulting nodes of the operated sequence are representing completely different data items (price, quantity) which may have totally different distributions for their data values. Q14) String Predicates: Return all persons with id value greater than person200. /site/people/person[@id > "person200"] 4.4 Group 4: Value-Based Joins (Theta Joins) This group of queries assess the ability and the accuracy of the selectivity estimation approaches on effective and accurate dealing with value-based join operations between the data values of XML nodes. Q15) Value-based join instances where the values of each operand are constructed by path expression: Return all pairs of increase value and price value where the increase value is greater than the price value.

for $x in /site//increase, $y in /site//price where data($x) > data($y) return <pair>{$x,$y}</pair> Q16) Value-based join instances where the values of one operand are constructed by path expression and the values of the other operand are constructed by path expression manipulated with arithmetic expression: Return all pairs of increase value and price value where the increase value is greater than the price value multiplied by 2. for $x in /site//increase, $y in /site//price where data($x) > data($y) * 2 return <pair>{$x,$y}</pair> Q17) Equi-Joins of data values: Return all pairs of increase value and price value where the increase value is equal to the price value. for $x in /site//increase, $y in /site//price where data($x) = data($y) return <pair>{$x,$y}</pair> 4.5 Group 5: Arithmetic and Comparison operations over Data Value Statistics This group of queries assess the ability of the selectivity estimation approaches on their ability of not only being able to capture summary information about the data values of the XML elements but also on their ability of applying arithmetic and comparison operations over these summary information in a consistent and accurate way which does not hurt the quality of the selectivity estimation results. Q18) Arithmetic over Data Value Statistics 1: Return all pairs of increase value and price value where the sum of the increase value and the price value is greater than 100. for $x in /site//increase, $y in /site//price where data($x) + data($y) > 100 return <pair>{$x,$y}</pair> Q19) Arithmetic over Data Value Statistics 2: Return all pairs of increase value and price value where the sum of the increase value and the price value is equal to 100. for $x in /site//increase,$y in /site//price where data($x) + data($y) = 100 return <pair>{$x,$y}</pair> Q20) Arithmetic and Comparison operations over Data Value Statistics 3: Return all triples of increase value, price value and income where the sum of the increase value and the income value is greater than the sum of the price value and the income value.

for $x in /site//increase, $y in /site//price, $z in /site//@income where data($x) + data($z) > data($y) + data($z) return <pair>{$x,$y,$z}</pair> 4.6 Group 6: Nested Expressions XQuery, as with many other XML query languages such as SQL/XML [7], is a free nesting language, where nested queries can be used for many targets such as reshaping elements or computing aggregate values. Since the result of nested queries may be the input for navigational or filtering operations in the outer query, predicting the size of nested query results will require building on-the-fly statistics about these intermediate results. Q21) Let - Aggregates: Return the names of persons and the number of items that they bought. for $p in /site/people/person let $a := for $t in /site//closed_auction where $t/buyer/@person = $p/@id return $t return <item> <person>{$p/name/text()}</person> <count>{count($a)}</count> </item> Q22) Predicates with values constructed by aggregate function: Return the open auctions with sum of bidder increases that are greater than 1000. for $b in /site/open_auctions/open_auction where sum(data($b/bidder/increase)) > 1000 return <increase>{$b}</increase> 4.7 Group 7: Data Dependent Estimations This group of queries requires capturing additional specific forms of summary information about the data values of the underlying XML documents. Q23) Full Text Search: Return the names of all items whose description contains the word gold. /site//item[contains(description, gold )] Q24) Distinct Operator: Return the distinct price values. for $b in distinct-values(//price/text()) Q25) Existential Document Order: Return the open auctions where a certain person issued a bid before another person.

for $b in /site/open_auctions/open_auction where some $pr1 in $b/bidder/personref[@person = "person20"], $pr2 in $b/bidder/personref[@person = "person51"] satisfies $pr1 << $pr2 return <history>{$b}</history> 5 XML Selectivity Estimation Approaches: state-of-the-art In this section we give an overview of the state-of-the-art of the selectivity estimation approaches in the XML domain after which we will use the set of XSelMark queries to assess the capabilities and features supported by the functionality of each approach. The work of Aboulnaga et al. [2] is considered to be the first to deal with the selectivity estimation of simple path expressions. They presented two different techniques for capturing the structure of the XML documents and for providing accurate selectivity estimations for simple path expressions The first technique is a summarizing tree structure called a path tree. It is a tree containing each distinct rooted path in the database. The second technique is a statistical structure called Markov table. This table, implemented as an ordinary hash table, contains any distinct path of length up to m and its selectivity. The presented techniques only work for simple path expressions that are without predicates, inline conditions, recursive axes and order-based axes. In [15], the authors present an XPathLearner as a selectivity estimation system for XPath expressions which employs the same summarization and estimation techniques presented in [2] with two main modifications. The first modification is that it gathers and refines the required statistical information in an on-line manner from query feedbacks and the second modification is that it supports the handling of predicates by storing statistical information for each distinct tag-value pair in the source XML document. The work of Zhang et al. in [24] is mainly focusing on the handling of XPath expressions which involve only structural conditions. The main idea behind the paper is to provide an efficient treatment of recursive XML documents and the accurate estimation of recursive queries. The authors define a summary structure for summarizing the source XML documents into a compact graph structure called XSEED. Relying on the defined statistic graph structure, the authors propose an algorithm for the selectivity estimation of the structural XPath expressions. In [11] Freire et al. have presented an XML Schema-based statistics collection technique called StatiX. StatiX leverages the available information in the XML Schema to capture both structural and value statistics about the source XML documents. These structural and value statistics are collected in the form of histograms. The StatiX systems is employed in a cost-based XML-to-relational

storage mapping engine which tries to generate efficient relational configurations for the XML documents, LegoDB [3]. In [22] Wang et al. have proposed a special histogram structure for the selectivity estimation of XPath queries in a dynamic context named as Bloom Histogram. The Bloom Histogram keeps a count of the statistics for paths in XML data. A bloom histogram H is constructed by sorting the frequency values of the distinct paths in XML data and then grouping the paths with similar frequency values into buckets. Although, Bloom Histogram is designed to deal with data updates and the estimation error is theoretically bounded by its size, it is very limited as it deals only with simple forms of path expressions. In [13], Li et al. have described a framework for estimating the selectivity of XPath expressions with a main focus on the order-based axes (following, preceding, following-sibling, preceding-sibling). They used two histogram structures to aggregate the path and order information of XML data called p-histogram and o-histogram. A p-histogram is built for each distinct element tag to summarize the pathid-frequency information. In this histogram, each bucket contains a set of path ids and their average frequency value. The o-histogram summarizes the path-order information of each distinct element tag name to capture the siblingorder information based on the path ids. In [9] Fisher et al. have proposed the SLT XML tree synopsis. The main idea of this synopsis is to remove the repeated patterns in the XML tree and to replace the multiple occurrences of equal subtrees by pointers to a single occurrence of the subtree. They described an algorithm for representing the resulting DAG structures using a special form of grammars alled an SLT grammar (straight line tree grammar). A tree automata is designed to run over the generated lossy SLT grammars to estimate the selectivity of queries containing all XPath axes, including the order-sensitive ones. The proposed synopes can deal only with structural XPath queries with no support of any form of predicate queries or XQuery expressions. In [19] Polyzotis et al. have proposed the XCluster synopses as a clusteringbased framework that can capture the key correlations between and across structure and values of different types. XCluster is considered to be a generalized form of the XSketch tree synopses which is a previous work of the authors represented in [18]. It employs the well-known histogram techniques for numeric and string values, and introduces the class of end-biased term histograms for summarizing the distribution of unique terms within textual XML content. This approach can support twig queries with predicates on numeric content, string content, and textual content. However, the authors did not mention how XCluster can be extended to deal with more complicated query situations such as value-based join operations and nested expressions. The work of [20] has described the design and implementation of a relational algebraic based framework for estimating the selectivity of XQuery expressions. In this approach, XML queries are translated into relational algebraic plans [12]. Summary information about the structure and the data values of the underlying XML documents are kept separately. Then by exploiting the relational alge-

braic infrastructure, the special properties of the generated algebraic plans, the summary information and a set of inference rules, the relational estimation approach is able to provide accurate selectivity estimations in the context of XML and XQuery domains. The framework enjoys the flexibility of integrating any XPath or predicate selectivity estimation technique, which enables it to support the selectivity estimation of a large subset of the powerful XML query language XQuery and to provide estimates not only of the whole XQuery expression but also of each sub-expression as well as the selectivity of each iteration in the context of FLWOR. Features Assessment One of the main goals of XSelMark benchmark is to provide a framework of assessing the completeness of the selectivity estimation approaches of XML queries. We used the set of XSelMark benchmark queries for an initial assessment of the supported features by the state-of-the-art. Table 1 lists the set of queries supported by each approach where the symbol X is used to indicate the ability of the approach to support the associated query and the symbol - is used to indicate the inability to support the associated query. The assessment has shown some interesting preliminary results: 1) Most of the selectivity estimation approaches [11, 13, 15, 24, 22] are limited on their abilities to support only small subsets of the XML query languages. They are only able to deal with structural XPath queries. 2) The two synopses of [13, 9] are the only two synopses which are able to support the selectivity estimation of order-sensitive XPath axes. 3) The approaches of [19, 20] cover a wider range of the XML query features. The synopsis of [19] is the only one which is able to deal with the estimation of full text search queries while [20] is able to uniquely deal with many of the features of XQuery languages such as join operation and different type of predicates. 6 Conclusion Several research efforts have been invested on designing Macro-Benchmarks to assess the overall performance of XML data management systems. There is currently a big demand for several Micro-Benchmarks which assess specific aspects in the XPath, XQuery and XML management system domains. Several research efforts have proposed different selectivity estimation approaches in the XML domain. Due to the lack of a suitable benchmark, it was difficult to assess, evaluate and compare these approaches and in order to get a clear view about the state-of-the-art. This paper is considered as a first step towards a comprehensive assessment of the available selectivity estimation approaches of XML queries. We proposed XSelMark as a Micro-Benchmark to assess the state-of-the-art of the selectivity estimation approach of XML queries. An initial assessment for the features and capabilities of the current approaches has shown that most of them are limited to supporting the estimation of the structural XPath queries. Hence, several avenues for further research and development are still widely open

XPath- XSEED StatiX Path-Order Bloom SLT XCluster Relational Learner [15] [24] [11] Histogram [13] Histogram [22] Gramar [9] [19] Alg. Est. [20] Q1 X X X X X X X X Q2 X X X X X X X X Q3 X X X X X X X X Q4 - - - X - X - X Q5 X X X X - X X - Q6 - - - - - - X X Q7 - - - - - - X X Q8 - - - - - X - X Q9 X - X - - - X X Q10 X - X - - - X X Q11 X - - - - - X X Q12 - - - - - - - X Q13 - - - - - - - X Q14 - - - - - - X X Q15 - - - - - - - X Q16 - - - - - - - X Q17 - - - - - - - X Q18 - - - - - - - X Q19 - - - - - - - X Q20 - - - - - - - X Q21 - - - - - - - X Q22 - - - - - - - - Q23 - - - - - - X - Q24 - - - - - - - - Q25 - - - - - - - - Table 1. An assessment of the capabilities of the state-of-the-art of the selectivity estimation approaches using XSelMark benchmark. in this domain to provide accurate, capable and complete frameworks aligned with the rich querying capabilities of the standard XML query languages. We believe that XSelMark is useful for both researchers and developers. It identifies the major aspects of selectivity estimation of XML queries, helps researchers to discover the strengths and weaknesses of the current approaches and provides the researchers and developers with a clearer view of developing more enhanced mechanisms of selectivity estimation of XML queries. In addition, we believe that the selectivity estimation problem is an important research field which has many useful applications other than being a crucial piece for an effective query optimization process such as: 1) allowing the query engines to provide the users with an early feedback about the expected outcome of their queries and the associated computational efforts. 2) providing the query engines with hints on the possible avenues to optimize the resource allocation of the execution process. 3) playing an effective role for efficient approximate query answering techniques. As a future work, we are planning to use XSelMark to perform more detailed assessment of the selectivity estimation approaches of XML queries in terms of their accuracy, performance and memory requirements. References 1. XSelMark: A Micro-Benchmark of Selectivity Estimation of XML Queries. http://xselmark.sourceforge.net/. 2. A. Aboulnaga, A. Alameldeen, and J. Naughton. Estimating the Selectivity of XML Path Expressions for Internet Scale Applications. In VLDB, 2001.

3. P. Bohannon, J. Freire, P. Roy, and J. Siméon. From XML Schema to Relations: A Cost-Based Approach to XML Storage. In ICDE, 2002. 4. T. Böhme and E. Rahm. XMach-1: A Benchmark for XML Data Management. In BTW, 2001. 5. S. Bressan, M. Lee, Y. Li, Z. Lacroix, and U. Nambiar. The XOO7 Benchmark. In VLDB 2002 Workshops, London, UK, 2003. 6. M. Carey, D. DeWitt, and J. Naughton. The OO7 Benchmark. SIGMOD Record (ACM Special Interest Group on Management of Data), 22, 1993. 7. Andrew Eisenberg and Jim Melton. Advancements in SQL/XML. SIGMOD Record, 33(3):79 86, 2004. 8. M. Fernández, A. Malhotra, J. Marsh, M. Nagy, and N. Walsh. XQuery 1.0 and XPath 2.0 Data Model (XDM). World Wide Web Consortium Proposed Recommendation, November 2006. http://www.w3.org/tr/xpath-datamodel. 9. D. Fisher and S. Maneth. Structural Selectivity Estimation for XML Documents. In ICDE, 2007. 10. M. Franceschet. XPathMark: An XPath Benchmark for the XMark Generated Data. Database and XML Technologies, 2005. 11. J. Freire, J. Haritsa, M. Ramanath, P. Roy, and J. Siméon. StatiX: making XML count. In SIGMOD, 2002. 12. T. Grust, S. Sakr, and J. Teubner. XQuery on SQL Hosts. In VLDB, 2004. 13. H. Li, M. Lee, W. Hsu, and G. Cong. An Estimation System for XPath Expressions. In ICDE, 2006. 14. H. Liefke and D. Suciu. XMill: An efficient compressor for XML data. In W. Chen, J. F. Naughton, and P. A. Bernstein, editors, SIGMOD, 2000. 15. L. Lim, M. Wang, S. admanabhan, J. Vitter, and R. Parr. XPathLearner: An On-line Self-Tuning Markov Histogram for XML Path Selectivity Estimation. In VLDB, 2002. 16. P. Michiels, I. Manolescu, and C. Miachon. Toward microbenchmarking XQuery. Information System, 33(2), 2008. 17. M. Nicola, I. Kogan, and B. Schiefer. An XML transaction processing benchmark. In SIGMOD, 2007. 18. N. Polyzotis and M. Garofalakis. Structure and Value Synopses for XML Data Graphs. In VLDB, 2002. 19. N. Polyzotis and M. Garofalakis. XCluster Synopses for Structured XML Content. In ICDE, 2006. 20. S. Sakr. Cardinality-Aware and Purely Relational Implementation of an XQuery Processor. PhD thesis, University of Konstanz, 2007. 21. A. Schmidt, F. Waas, M. Kersten, M. Carey, I. Manolescu, and R. Busse. XMark: A Benchmark for XML Data Management. In VLDB, 2002. 22. W. Wang, H. Jiang, H. Lu, and J. Xu Yu. Bloom Histogram: Path Selectivity Estimation for XML Data with Updates. In VLDB, 2004. 23. B. Yao, T. Özsu, and J. Keenleyside. XBench - A Family of Benchmarks for XML DBMSs. In VLDB Workshop, 2003. 24. N. Zhang, T. Özsu, A. Aboulnaga, and I. Ilyas. XSEED: Accurate and Fast Cardinality Estimation for XPath Queries. In ICDE, 2006.