ARQo: The Architecture for an ARQ Static Query Optimizer

Size: px

Start display at page:

Download "ARQo: The Architecture for an ARQ Static Query Optimizer"

Sybil Jackson
6 years ago
Views:

ARQo: The Architecture for an ARQ Static Query Optimizer Markus Stocker, Andy Seaborne Digital Media Systems Laboratory HP Laboratories Bristol HPL-2007-92 June 26, 2007* semantic web, SPARQL, query

1 ARQo: The Architecture for an ARQ Static Query Optimizer Markus Stocker, Andy Seaborne Digital Media Systems Laboratory HP Laboratories Bristol HPL June 26, 2007* semantic web, SPARQL, query optimization In this paper we describe the architecture of ARQo, a first approach for SPARQL static query optimization in ARQ. Specifically, we focus on static optimization of BasicGraphPattern (BGP) for in-memory models. Static query optimization is intended as a query rewriting process where the set of triple patterns defined for a BGP are rewritten according to a specific order. We propose a rewriting process according to the estimated execution cost of joined triple patterns in increasing order. Specifically, the estimated execution cost is a function of multiple parameters such as the estimated selectivity of joined triple patterns, the availability of indexes or pre-calculated result sets. * Internal Accession Date Only Approved for External Publication Copyright 2007 Hewlett-Packard Development Company, L.P.

2 ARQo: The Architecture for an ARQ Static Query Optimizer Markus Stocker and Andy Seaborne HP Labs Bristol Abstract. In this paper we describe the architecture of ARQo, a rst approach for SPARQL static query optimization in ARQ. Specically, we focus on static optimization of BasicGraphPattern (BGP) for in-memory models. Static query optimization is intended as a query rewriting process where the set of triple patterns dened for a BGP are rewritten according to a specic order. We propose a rewriting process according to the estimated execution cost of joined triple patterns in increasing order. Specically, the estimated execution cost is a function of multiple parameters such as the estimated selectivity of joined triple patterns, the availability of indexes or precalculated result sets. 1 Introduction Query optimization is an on-going research area for Relational Database Management Systems (RDBMS). Generally, database systems which provide a declarative language to describe what a query should return, require advanced optimization techniques to enable query evaluation within user friendly time frames. Modern database systems implement a number of optimization techniques on different abstraction levels. For instance, the evaluation of Query Execution Plans (QEP) is a fundamental optimization and sophisticated techniques greatly aect the query execution performance. In this paper we present and discuss ARQo, a static query optimizer for ARQ 1. Specically, we focus on static optimization of SPARQL [7] BasicGraph- Patterns (BGP) for in-memory models. While SPARQL query engines such as ARQ and Sesame 2 generally implement index structures for in-memory models to allow fast retrieval of triples, they still lack advanced static ( i.e. query rewriting) optimization techniques to optimize the query at the QEP level. ARQo is a rst proposal to address the problem of SPARQL static query optimization in ARQ. The goal is to nd the QEP which minimizes the execution costs. Thereby, execution costs are a function of multiple parameters which may be arbitrary combined depending whether or not the required information is Contact in case of questions

3 available within the query execution context. For instance, index structures or pre-calculated result sets are examples of such parameters. ARQo implements an extensible architecture for QEP cost estimation. Essentially, the framework enables the implementation of cost estimation techniques and implements a generic strategy to search for the most promising QEP, the QEP which is expected to evaluate the query fastest. Therefore, we formalize the concept of BasicGraphPattern (BGP), QEP plan space and QEP to meet the needs of the architecture. Moreover, we extend the framework with cost estimation techniques which may be used within dierent ARQ deployments depending on the specic settings. Specically, we intend to develop (1) a generic technique based on variable counting which is especially convenient if specialized information about the underlying ontology is unavailable. Further, we intend to implement (2) a more sophisticated technique based on the concept of joined triple pattern selectivity introduced by Bernstein et al. [1]. We intend to implement the approaches of Selectivity Estimation Index (SEI) and Query Pattern Index (QPI) for the estimation of joined triple pattern selectivity which the authors describe in [1]. The remainder of this paper is structured as follows. In Section 2, we briey review some related work. In Section 3, we formalize the concepts required for the ARQo architecture, especially the BGP, the QEP and the plan space. In Section 4, we describe the proposed architecture in more details, focusing on the selected approach to identify the most promising QEP. Finally, in Section 5, we present and discuss the QEP cost estimation extensions we intend to provide. 2 Related Work Static query optimization is an on-going research eld especially for relational database management systems. Years of research have enabled modern database systems to powerful and ne tuned infrastructures which allow interactive querying for most applications. In the literature (IBM System R [8], INGRES [9] POSTGRES 3 ) the problem of nding the best QEP is mainly solved using two dierent approaches, query decomposition or exhaustive search. Query decomposition is a heuristic greedy algorithm that proceeds in a stepwise fashion. The major drawback of this approach is that potentially good plans are ignored by the algorithm. An exhaustive search of the plan space identies every join combination, i.e. all plans for processing two-way joins are considered. The drawback of this approach is the required time for planning, which is bigger than for the query decomposition approach since the algorithm has to scan a potentially huge plan space. The benet of using an exhaustive search approach is that good plans are not overlooked. IBM System R [8] is probably the most prominent and the rst example of a relational database system which considers static query optimization techniques

4 Listing 1.1. Example BasicGraphPattern (LUBM Query 2) 1?X r d f : type ub : GraduateStudent. 2?Y r d f : type ub : U n i v e r s i t y. 3?Z r d f : type ub : Department. 4?X ub : memberof?z. 5?Z ub : suborganizationof?y. 6?X ub : undergraduatedegreefrom?y. The static optimizer of System R implements an exhaustive search of the plan space where the cost of a plan is calculated based on the required disk accesses and, thus, on the selectivity of conditions [6]. The Sesame open source RDF framework 4 uses some general query optimization techniques based on query rewriting according to the specicity of triple patterns. Essentially, the specicity is a function of the number and type of variables dened for the subject, the predicate and the object of a triple pattern. The query planner selects the triple patterns according to their decreasing specicity. Harth et al. propose in [4] an optimized index structure for RDF [2] which is extended with statistical information about the data set in the form of occurrence counts of access patterns. This allows ecient lookup for the result set size of an access pattern and, thus, static query optimization according to the selectivity of access patterns. In our approach, we provide selectivity information about joined patterns, which is fundamental since the selectivity of two patterns may be low when considered independently but the joined pattern may be highly selective. 3 ARQo Formalization In this section we formalize the concepts required for ARQo and the goal of SPARQL static query optimization of BasicGraphPattern (BGP) for in-memory models. To clarify the idea, we provide some gures based on the BGP of Listing 1.1 (i.e. the WHERE clause of the Lehigh University Benchmark (LUBM) [3]). Given a BGP, B, we dene B to be a graph G which is described by the set G of undirected connected graphs. The elements g G are the components of G. For each pair g i, g j G, g i, g j are disconnected. For SPARQL, an undirected connected graph g G is an ordered pair g := (N, E), where N is a set, whose elements are triple patterns (i.e. the nodes of g) and E is a set of unordered pairs of distinct triple patterns which are joined by a common variable (i.e. the edges of g). In Figure 1, we display the undirected connected graph g 1 G for the BGP of Listing 1.1. Since the triple patterns of the BGP are all joined together, G contains only one element, i.e. the connected graph g 1. Note, that the numbering 4 3

5 Fig. 1. Undirected connected graph g 1 G used for the nodes of g 1 corresponds to the numbering of the triple patterns of the BGP in Listing 1.1. For any two nodes (i, j) g 1, there is a path between i and j, if the triple patterns corresponding to i and j are joined together by one or more variables. In SPARQL, the execution order of pairwise disconnected graphs g G is arbitrary, since the overall result set corresponds to the Cartesian product of the result sets for any g G. Thus, the problem of statically optimizing B is a sub-problem of optimizing each g G. Denition 1. The size of g G is the number of nodes of g. Denition 2. A QEP for g G is an ordered set of the nodes of g. Denition 3. For g G we dene the set P as the QEP space of g. The size of P is the total number of QEPs p P, i.e. the number of ordered permutations of the nodes of g. Given the size N of g G, the size of P is N! (for an uniprocessor machine). Thus, even for simple SPARQL queries with a few joined triple patterns, the expanded QEP space P is huge. Therefore, we need an ecient way how to identify the most promising QEP. A QEP for B is described by an unordered set, Q, whose elements p P are QEPs. The size of Q equals the size of G. Thus, the QEP for B is described by the QEP of each component of G. Therefore, we state the optimization problem as the search for the most promising QEP p within the plan space P for each undirected connected graph g G, building, hence, the unordered set Q of QEPs for the components of G, where G is the graph which reects the BGP B. Note, that a QEP p P can be represented as a Directed Acyclic Graph (DAG). We dene D as the set of DAGs d for a g G. For any d D, the unordered set E of joined triple patterns (i.e. edges) of g G is transformed to a set A of directed edges (i.e. arcs). In Figure 2, we show the DAG corresponding to the QEP which executes the BGP of Listing 1.1 top-down. For any two nodes (i, j) d 1, there is a directed 4

6 Fig. 2. DAG d 1 D path between i and j, if the triple patterns corresponding to i and j are joined together by one or more variables and i is executed rst. Hence, there is a clear relationship between the set P of QEPs and the set D of DAGs. More precisely, we can state the following function, f : QEP DAG (1) which is injective, not surjective and not bijective. Thus, a QEP can be uniquely mapped to a DAG, but a DAG is an abstraction for one or more QEPs. Following Equation 1, a QEP may be described as a materialization of an undirected connected graph g G in the form of a directed acyclic graph d D. 4 ARQo Architecture In this section we concisely describe the architecture of ARQo. Essentially, ARQo is a plug-in module for ARQ with three main components: (1) the optimization core, a (2) heuristic broker and an (3) extensible pool of heuristic techniques. 4.1 The Optimization Core The optimization core implements the main technique used to statically optimize SPARQL Basic Graph Patterns (BGP). As described in Section 3, we intend to abstract a BGP as a graph G, where G is a set G of undirected connected graphs, i.e. the components of G. We, hence, optimize each g G, i.e. we search for the most promising materialization of g in the form of a DAG. This task is mainly solved by the following steps. Given a g G, we (1) provide g with estimated execution costs for the nodes and edges. Thereby, costs are estimated by heuristics which are implemented in ARQo and are extensible for user specic purposes. Then, we (2) identify the Minimum Spanning Tree (MST) of g by means of an algorithm which 5

7 returns an ordered set T of joined triple patterns (i.e. edges). For instance, the Kruskal MST algorithm [5] identies the edges of the MST in increasing weight. The elements t T are ordered according to the total cost for each element t, i.e. the edges of g. The total cost of t is calculated as a function of the costs of the edge and the related nodes. Thus, we dene the partial order relation (T, ) with the properties of reexivity, antisymmetry and transitivity. Moreover, we dene an order relation (N, ) for the elements t T, where N is the set of nodes for a specic t T. Thus, we get a set T with edges t which are ordered by increasing estimated total cost, where each t T is a set N of the nodes of t. The set N is a pair of nodes ordered by increasing estimated costs. Finally, we (3) extract the nodes from T to a set Q by selecting the nodes according to the two order relations (T, ) and (N, ). The set Q features a list of distinct nodes, i.e. triple patterns and reects the most promising QEP. Note that the elements q Q, i.e. the nodes of g, represent a specic QEP and, thus, dene a DAG. We may describe Q as a materialization of g in the form of the DAG specied by the elements q Q. In fact, the order relation of Q, which is specied by (T, ) and (N, ), allows to construct the corresponding DAG by selecting the head and tail nodes q Q according to the order relation of Q and the join relationship between node pairings , , , ,874 1, , ,874 2, , Fig. 3. MST for the undirected connected graph g 1 G of Listing 1.1 In Figure 3, we show the undirected connected graph g 1 G for the BGP of Listing 1.1. For each node and edge, we provide the selectivities of the corresponding (joined) triple pattern(s). Further, we highlight the MST. Based on this solution, we may build the ordered set T of edges with the elements, T = {(5, 6), (5, 3), (5, 2), (1, 4), (6, 4)} Finally, the ordered set Q of distinct nodes which reects the most promising QEP contains the elements, Q = {5, 6, 3, 2, 1, 4} 6

8 Fig. 4. DAG of the most promising QEP Listing 1.2. Optimized BasicGraphPattern (LUBM Query 2)?Z ub : suborganizationof?y.?x ub : undergraduatedegreefrom?y.?z r d f : type ub : Department.?Y r d f : type ub : U n i v e r s i t y.?x r d f : type ub : GraduateStudent.?X ub : memberof?z. In Figure 4 we display the DAG which corresponds to the most promising QEP and reects the selected sequence of distinct nodes (i.e. triple patterns) of the set Q, whereas in Listing 1.2 we list the reordered optimized BGP. 4.2 The Heuristic Broker The heuristic broker is a component which selects the best heuristic cost estimation for joined triple pattern according to specic settings. Such settings may be the availability of specialized statistical information about the underlying ontology, cached result sets for joined triple patterns or specialized access paths for triple pattern evaluation. The main goal of the broker is to provide the optimization core with the best available cost estimation heuristic according to the settings, or with the heuristic specically selected by the user. 4.3 The Extensible Pool of Heuristic Techniques The extensible pool of heuristic techniques implements a number of dierent heuristics to estimate the execution cost of (joined) triple patterns. Such heuristics are required by the optimization core to select the most promising QEP. In fact, they provide the estimated execution costs for the nodes and edges of each g G. We implement the third component as a typical framework which allows to extend specic, user-dened, techniques. In Section 5, we describe in more details the techniques we intend to provide out-of-the-box. 7

9 5 ARQo Extensions In this section, we briey introduce the two main heuristics for (joined) triple pattern cost estimation we intend to develop for ARQo. 5.1 Variable Counting The rst heuristic is a simple variable counting technique. The goal is to provide a simple optimization technique also if specialized statistical information about the underlying ontology is not available to ARQo. This heuristic will provide the optimization core with a constant cost of 1.0 for edges. However, the cost of a node will be a function of the number and type of variables dened in the corresponding triple pattern. In fact, a higher execution cost will be assigned to triple patterns with more variables. Moreover, a variable subject leads to a higher execution cost as a variable predicate or object. 5.2 Selectivity Estimation A second heuristic is an implementation of the Selectivity Estimation Index (SEI) and the Query Pattern Index (QPI) rst introduced by Bernstein et al. and described in [1]. The SEI is used to estimate the cost execution of triple patterns, i.e. nodes, whereas the QPI is used to estimate the cost of joins, i.e. edges. Based on specialized statistical indexes about the underlying ontology, the SEI and QPI are expected to result into improved SPARQL static query optimization (as illustrated by the evaluation performed by Bernstein et al. in [1]). Next, we concisely describe the SEI and the QPI and present an approach how both may be combined to estimate more accurately the selectivity of joined triple patterns dened in SPARQL queries. SEI: The Selectivity Estimation Index. The SEI is used to estimate the selectivity [8] of a triple pattern. Given a triple pattern T, the formula sel(t ) = sel(s) sel(p ) sel(o) estimates the selectivity of T. Thus, the selectivity of a triple pattern is calculated as the multiplication of the selectivity for the building blocks of a triple pattern, i.e. the subject S, the predicate P and the object O. We refer to Bernstein et al. [1] for a detailed description of the SEI technique. QPI: The Query Pattern Index. The QPI provides information about the selectivity of joined triple patterns, more precisely, the selectivity of two joined triple patterns. The ontology schema is used to automatically generate the QPI. In fact, the rdfs:domain and rdfs:range properties of rdf:property instances, allow to identify pairs of rdf:property instances which can be joined because of a matching class for the rdfs:domain and rdfs:range properties. 8

10 For a detailed discussion about the QPI technique we refer to Bernstein et al. 5 [1]. Combining the SEI and the QPI. The QPI is essentially a list of two joined triple patterns with variables dened for the subject and the object 6. However, SPARQL queries do not always dene variables for the subject and the object of joined triple patterns. Thus, the QPI enables the retrieval of the selectivity for joined patterns which are an upper bound for the selectivity of the pattern dened in the query. For instance, consider the following SPARQL query (we list only the WHERE clause)?x rdf:type ub:graduatestudent.?x ub:undergraduatedegreefrom?y. The QPI allows a lookup for the selectivity of?x rdf:type?y.?x ub:undergraduatedegreefrom?z. which only approximates the real selectivity of the joined pattern dened in the SPARQL query. In fact, the QPI selectivity of the pattern is an upper bound for the selectivity of any possible query pattern which can be built for the two joined patterns. Thus, we know that the selectivity of a SPARQL query pattern is potentially lower. We intend to use the SEI selectivity estimation to estimate the selectivity for a given subject and object. The QPI selectivity may be reduced using the SEI selectivity estimation by a specic formula. 5.3 Other Heuristics Of course we may think of many other cost estimation heuristic. For instance, we could extend ARQ with a cache for the selectivity of executed query patterns. This incremental cache could be exploited by a corresponding heuristic which considers the cached selectivities while estimating the cost of (joined) triple patterns. As another example, we could extend ARQ with a cache for result sets of query patterns and implement a heuristic which assigns lower execution costs to triple patterns whose result sets are cached. As an example for a user specic setting, if someone provides a simple index about the occurrences of distinct predicates, we may implement a simple heuristic which considers this specic index to estimate the execution cost of triple patterns. 5 The QPI technique is not yet published. However, the authors have submitted both techniques as contribution to the ISWC Note, that the variables are of the set (?x,?y,?z), i.e. the joined triple pattern denes a common variable where the class type matches for the rdfs:domain and rdfs:range properties 9

11 6 Future Work and Conclusions As future work, we intend to implement the proposed ARQo architecture with its three components based on the description provided in this paper. We will investigate dierent cost estimation techniques which allow QEP evaluation also for specialized settings which may include caching of result sets or learning systems which incrementally store selectivity information of query patterns. Finally, we intend to extensively evaluate the performance of ARQ extended with ARQo on dierent data sets and retrieval tasks. We believe, ARQo will provide ARQ a robust static query optimization module for basic graph patterns and in-memory models which is extensible and adaptable to specic settings but will lead to improved performance also for generic ARQ deployments. References 1. Abraham Bernstein, Christoph Kiefer, and Markus Stocker. OptARQ: A SPARQL Optimization Approach based on Triple Pattern Selectivity Estimation. Technical Report i , Department of Informatics, University of Zurich, Dan Brickley and R.V. Guha. Rdf vocabulary description language 1.0: Rdf schema. Technical report, W3C, Y. Guo, Z. Pan, and J. Hein. LUBM: A Benchmark for OWL Knowledge Base Systems. Journal of Web Semantics, 3(2):158182, Andreas Harth and Stefan Decker. Optimized index structures for querying rdf from the web. In 3rd Latin American Web Congress, Jr. Joseph B. Kruskal. On the shortest spanning subtree of a graph and the traveling salesman problem. In Proceedings of the American Mathematical Society, volume 7, pages 4850, Gregory Piatetsky-Shapiro and Charles Connell. Accurate estimation of the number of tuples satisfying a condition. In SIGMOD '84: Proceedings of the 1984 ACM SIGMOD international conference on Management of data, pages , New York, NY, USA, ACM Press. 7. Eric Prud'hommeaux and Andy Seaborne. Sparql query language for rdf. Technical report, W3C, P. Griths Selinger, M. M. Astrahan, D. D. Chamberlin, R. A. Lorie, and T. G. Price. Access path selection in a relational database management system. In SIG- MOD '79: Proceedings of the 1979 ACM SIGMOD international conference on Management of data, pages 2334, New York, NY, USA, ACM Press. 9. Eugene Wong and Karel Yousse. Decomposition a strategy for query processing. ACM Trans. Database Syst., 1(3):223241,

Efficient Optimization of Sparql Basic Graph Pattern

Efficient Optimization of Sparql Basic Graph Pattern Ms.M.Manju 1, Mrs. R Gomathi 2 PG Scholar, Department of CSE, Bannari Amman Institute of Technology, Sathyamangalam, Tamilnadu, India 1 Associate Professor/Senior