Processing Regular Path Queries Using Views or What Do We Need for Integrating Semistructured Data?

Similar documents
View-based query processing: On the relationship between rewriting, answering and losslessness

Graph Databases. Advanced Topics in Foundations of Databases, University of Edinburgh, 2017/18

Data Integration 1. Giuseppe De Giacomo. Dipartimento di Informatica e Sistemistica Antonio Ruberti Università di Roma La Sapienza

Data Integration: A Theoretical Perspective

New Rewritings and Optimizations for Regular Path Queries

following syntax: R ::= > n j P j $i=n : C j :R j R 1 u R 2 C ::= > 1 j A j :C j C 1 u C 2 j 9[$i]R j (» k [$i]r) where i and j denote components of r

Rewriting Ontology-Mediated Queries. Carsten Lutz University of Bremen

Structural characterizations of schema mapping languages

Preferentially Annotated Regular Path Queries

Data integration lecture 2

Data integration lecture 3

A Retrospective on Datalog 1.0

Semantic data integration in P2P systems

Conjunctive queries. Many computational problems are much easier for conjunctive queries than for general first-order queries.

XML Research for Formal Language Theorists

Regular Expressions for Data Words


Finite Model Theory and Its Applications

Exercises Computational Complexity

On Reconciling Data Exchange, Data Integration, and Peer Data Management

Logical Aspects of Massively Parallel and Distributed Systems

Structural Characterization of Graph Navigational Languages (Extended Abstract)

The Complexity of Relational Queries: A Personal Perspective

MASTRO-I: Efficient integration of relational data through DL ontologies

Logic as a Query Language: from Frege to XML. Victor Vianu U.C. San Diego

XML Data Exchange. Marcelo Arenas P. Universidad Católica de Chile. Joint work with Leonid Libkin (U. of Toronto)

OWL 2 Profiles. An Introduction to Lightweight Ontology Languages. Markus Krötzsch University of Oxford. Reasoning Web 2012

On the Role of Integrity Constraints in Data Integration

Rewrite and Conquer: Dealing with Integrity Constraints in Data Integration

Provable data privacy

SPARQL with Property Paths

Containment of Data Graph Queries

Who won the Universal Relation wars? Alberto Mendelzon University of Toronto

Ontologies and Databases

10/24/12. What We Have Learned So Far. XML Outline. Where We are Going Next. XML vs Relational. What is XML? Introduction to Data Management CSE 344

Towards Implementing Finite Model Reasoning in Description Logics

Mastro Studio: a system for Ontology-Based Data Management

Finding Equivalent Rewritings in the Presence of Arithmetic Comparisons

Conjunctive Query Containment in Description Logics with n-ary Relations

COMP718: Ontologies and Knowledge Bases

On the Hardness of Counting the Solutions of SPARQL Queries

On the Data Complexity of Consistent Query Answering over Graph Databases

Finite automata. We have looked at using Lex to build a scanner on the basis of regular expressions.

Introduction to Database Systems CSE 414

Introduction to Database Systems CSE 414

Introduction to Data Management CSE 344

Counting multiplicity over infinite alphabets

Monadic Datalog Containment on Trees

Representing and Querying XML with Incomplete Information

Designing Views to Answer Queries under Set, Bag,and BagSet Semantics

Answering Pattern Queries Using Views

Learning Path Queries on Graph Databases

Relative Information Completeness

Last lecture CMSC330. This lecture. Finite Automata: States. Finite Automata. Implementing Regular Expressions. Languages. Regular expressions

Reasoning about XML Update Constraints

Data Integration: Logic Query Languages

Finding Optimal Probabilistic Generators for XML Collections Serge Abiteboul, Yael Amsterdamer, Daniel Deutch, Tova Milo, Pierre Senellart

Querying Graph Databases 1/ 83

Regular Path Queries on Graphs with Data

Local Constraints in Semistructured Data Schemas

Data Integration: A Logic-Based Perspective

Semantic Acyclicity on Graph Databases

DATABASE THEORY. Lecture 11: Introduction to Datalog. TU Dresden, 12th June Markus Krötzsch Knowledge-Based Systems

SEMANTIC ACYCLICITY ON GRAPH DATABASES

A Framework for Ontology Integration

The Relational Model

Maurizio Lenzerini. Dipartimento di Ingegneria Informatica Automatica e Gestionale Antonio Ruberti

Propagate the Right Thing: How Preferences Can Speed-Up Constraint Solving

Semantic Subtyping. Alain Frisch (ENS Paris) Giuseppe Castagna (ENS Paris) Véronique Benzaken (LRI U Paris Sud)

Reasoning over Evolving Graph-structured Data Under Constraints

CSE-6490B Final Exam

Introduction to Database Systems CSE 444

Introduction to Semistructured Data and XML. Overview. How the Web is Today. Based on slides by Dan Suciu University of Washington

RELATIONAL REPRESENTATION OF ALN KNOWLEDGE BASES

Comparison of XPath Containment Algorithms

e-service Composition by Description Logics Based reasoning. 1 Introduction

Structural Characterizations of Schema-Mapping Languages

Efficiently Managing Data Intensive Ontologies

Streaming Tree Transducers

Consistency and Set Intersection

Logic and Databases. Phokion G. Kolaitis. UC Santa Cruz & IBM Research - Almaden

Theorem 2.9: nearest addition algorithm

NP-Hardness. We start by defining types of problem, and then move on to defining the polynomial-time reductions.

DL-Lite: Tractable Description Logics for Ontologies

SMO System Management Ontology

Introduction to Semistructured Data and XML

On the Complexity of Query Answering over Incomplete XML Documents

Foundations of Schema Mapping Management

FOUNDATIONS OF DATABASES AND QUERY LANGUAGES

Static Analysis and Verification in Databases. Victor Vianu U.C. San Diego

The Inverse of a Schema Mapping

When Can We Answer Queries Using Result-Bounded Data Interfaces?

Center for Semantic Web Research.

Semantic Optimization of Preference Queries

A TriAL: A navigational algebra for RDF triplestores

Three easy pieces on schema mappings for tree-structured data

Complexity and Composition of Synthesized Web Services

Posets, graphs and algebras: a case study for the fine-grained complexity of CSP s

Schema.org as a Description Logic

Query Containment for Data Integration Systems

Transcription:

Processing Regular Path Queries Using Views or What Do We Need for Integrating Semistructured Data? Diego Calvanese University of Rome La Sapienza joint work with G. De Giacomo, M. Lenzerini, M.Y. Vardi Logic-based Methods for Information Integration Vienna August 23, 2003

Data integration Deals with the problem of providing a uniform access to a collection of data stored in multiple, autonomous, and heterogeneous data sources. Basic problem in: management of distributed information systems data warehousing data re-engineering enterprise knowledge management querying multiple sources on the web e-commerce, e-business, e-government, e- integration of data from distributed scientific experiments D. Calvanese Processing Regular Path Queries Using Views 1

Framework for data integration Query Global Schema A R B C S D T E Mapping U V W u 1 v 1 w 1 u 2 v 2 w 2 Source Schema 1 Source Schema 2 Source 1 Source 2 D. Calvanese Processing Regular Path Queries Using Views 2

Quality in query answering Among the various tasks in data integration, we deal with how to answer queries expressed on the global schema: View-based query processing D. Calvanese Processing Regular Path Queries Using Views 3

Quality in query answering Among the various tasks in data integration, we deal with how to answer queries expressed on the global schema: View-based query processing The data integration system should be designed in such a way that suitable quality criteria are met. Here, we concentrate on: Soundness: the answer to a query includes only what is known to be true Completeness: the answer to a query includes all that is known to be true We aim at getting exactly what is known. But, what is known depends on how the data integration system is modeled D. Calvanese Processing Regular Path Queries Using Views 3-a

Formal framework A data integration system I is a triple G, S, M, where G is the global schema S is the source schema M is the mapping between G and S D. Calvanese Processing Regular Path Queries Using Views 4

Formal framework A data integration system I is a triple G, S, M, where G is the global schema S is the source schema M is the mapping between G and S Semantics of I: which are the (global) databases that satisfy I? We start from a source database D, representing the data at the sources The (global) databases B that satisfy I wrt D are those that: are legal wrt the global schema G, and satisfy the mapping M wrt D D. Calvanese Processing Regular Path Queries Using Views 4-a

Semistructured data Semistructured data are an abstraction for data on the web, structured documents, XML: A semistructured database is an edge-labeled graph bib article book o1 article... book reference reference o15 author o68 title o75...... o37...... reference author o64 reference o42 title author o71 o58 o83... Victor Vianu Regular... Tova Milo firstname Type Inference... lastname o52 o53 Dan Suciu D. Calvanese Processing Regular Path Queries Using Views 5

Semistructured data Semistructured data are an abstraction for data on the web, structured documents, XML: A semistructured database is an edge-labeled graph Queries need to provide the ability to navigate the graph: regular path queries (RPQs) and 2-way regular-path-queries (2RPQs) bib reference o15 author o68 title o75 Victor Vianu Regular... o1 article book reference o37............ reference author... article book reference o42 o58 title author... o64 o71 o83 Tova Milo Type Inference... firstname lastname o52 o53 Dan Suciu Q 1 (x, y) x ((article + book) ref title) y Q 2 (x, y) x (article (ref + ref ) title) y D. Calvanese Processing Regular Path Queries Using Views 5-a

Integrating semistructured data We consider data integration systems I = G, S, M where: The global schema G simply fixes the set of labels of the database The sources in S are binary relations The mapping M is of type local-as-view (LAV): to each source s it associates a 2RPQ view V s over G D. Calvanese Processing Regular Path Queries Using Views 6

Integrating semistructured data We consider data integration systems I = G, S, M where: The global schema G simply fixes the set of labels of the database Example: G = {article, ref, title, author,...} The sources in S are binary relations Example: S = {s 1, s 2, s 3 }, where s 1 stores for each bibliography its articles s 2 stores for each publ. the ones it references directly or indirectly s 3 stores for each publication its title The mapping M is of type local-as-view (LAV): to each source s it associates a 2RPQ view V s over G Example: V s1 (b, a) b (article) a V s2 (p 1, p 2 ) p 1 (ref ) p 2 V s3 (p, t) p (title) t D. Calvanese Processing Regular Path Queries Using Views 6-a

Assumptions on the sources Let D be a source database and B a global database that satisfies I wrt D: sound source: s(d) V s (B) all tuples in the source satisfy V s, but there may be other tuples satisfying V s that are not in the source complete source: s(d) V s (B) all tuples that satisfy V s are in the source, but there may be also tuples in the source not satisfying V s exact source: s(d) = V s (B) the tuples in the source are exactly those that satisfy V s (i.e., both sound and complete) We will assume that sources are sound (unless we explicitly say otherwise) D. Calvanese Processing Regular Path Queries Using Views 7

View-based query processing tasks View-based query answering: compute the set of certain answers to a query over the global schema is the basic basic query processing task View-based query rewriting: reformulate a query over the global schema in terms of the sources provides an indirect means for view-based query answering Query containment and view-based query containment: check whether the answers to one query are contained in the answers to another query, possibly taking into account the views in the mapping allow for establishing quality criteria of the answering process D. Calvanese Processing Regular Path Queries Using Views 8

View-based query answering Given: a semistructured data integration system I = G, S, M a source database D a 2RPQ Q over G a pair of objects (c, d) check whether (c, d) is a certain answer to Q wrt I and D A certain answer is a tuple that is in the answer to Q for every database B that satisfies I wrt D View-based query answering is the basic query processing task in data integration [Levy+al 95, Rajaraman+al 95, Abiteboul+Duschka 98, ICDE 00, PODS 00, LICS 00,... ] D. Calvanese Processing Regular Path Queries Using Views 9

View-based query answering for 2RPQs Technique based on search for a counterexample database: 1. it is sufficient to restrict the attention to counterexamples of a special form (canonical databases) 2. represent canonical databases by means of words 3. construct two-way finite automaton that accepts words encoding canonical counterexample databases 4. check for emptiness of the automaton D. Calvanese Processing Regular Path Queries Using Views 10

View-based query answering for 2RPQs Technique based on search for a counterexample database: 1. it is sufficient to restrict the attention to counterexamples of a special form (canonical databases) 2. represent canonical databases by means of words 3. construct two-way finite automaton that accepts words encoding canonical counterexample databases 4. check for emptiness of the automaton The non-emptiness of the automaton can be rephrased in terms of constraint satisfaction (CSP) tight relationship between view-based query answering and CSP D. Calvanese Processing Regular Path Queries Using Views 10-a

Constraint satisfaction problems Let A and B be relational structures over the same alphabet A homomorphism h is a mapping from A to B such that for every relation R, if (c 1,..., c n ) R(A), then (h(c 1 ),..., h(c n )) R(B). Non-uniform constraint satisfaction problem CSP(B): the set of relational structures A such that there is a homomorphism from A to B. Complexity: CSP(B) is in NP there are structures B for which CSP(B) is NP-hard D. Calvanese Processing Regular Path Queries Using Views 11

CSP and view-based query answering for 2RPQs Consider I = G, S, M and a 2RPQ query Q over G: We can define a relational structure CT Q,M, called constraint template of Q wrt M Given a source database D and two objects c, d, we can define another relational structure CI c,d D over the same alphabet, called the constraint instance CT Q,M can be computed in exponential time in Q and polynomial time in M D. Calvanese Processing Regular Path Queries Using Views 12

CSP and view-based query answering for 2RPQs Consider I = G, S, M and a 2RPQ query Q over G: We can define a relational structure CT Q,M, called constraint template of Q wrt M Given a source database D and two objects c, d, we can define another relational structure CI c,d D over the same alphabet, called the constraint instance CT Q,M can be computed in exponential time in Q and polynomial time in M Theorem: (c, d) is not a certain answer to Q wrt I and D if and only if there is a homomorphism from CI c,d D to CT Q,M, i.e, CI c,d D CSP(CT Q,M) Characterization of view-based query answering for 2RPQs in terms of CSP D. Calvanese Processing Regular Path Queries Using Views 12-a

Complexity of view-based answering for RPQs and 2RPQs From [ICDE 00] for RPQs and [PODS 00] for 2RPQs Assumption on Assumption on Complexity domain sources data expression combined all sound conp conp conp closed all exact conp conp conp arbitrary conp conp conp all sound conp PSPACE PSPACE open all exact conp PSPACE PSPACE arbitrary conp PSPACE PSPACE D. Calvanese Processing Regular Path Queries Using Views 13

Consequence of complexity results + The view-based query answering algorithm provides a set of answers that is sound and complete A conp data complexity does not allow for effective deployment of the query answering algorithm Note that conp-hardness holds already for queries and views that are unions of simple paths (no reflexive-transitive closure) D. Calvanese Processing Regular Path Queries Using Views 14

Consequence of complexity results + The view-based query answering algorithm provides a set of answers that is sound and complete A conp data complexity does not allow for effective deployment of the query answering algorithm Note that conp-hardness holds already for queries and views that are unions of simple paths (no reflexive-transitive closure) Adopt an indirect approach to answering queries to a data integration system, via query rewriting D. Calvanese Processing Regular Path Queries Using Views 14-a

Query rewriting A rewriting R of a query Q is a query over the source alphabet that, when evaluated over a source database D, provides only certain answers for Q We consider rewritings belonging to a certain class C (e.g., 2RPQs) We want rewritings that are maximal among those in C We aim at rewritings that are exact, i.e., equivalent to the query D. Calvanese Processing Regular Path Queries Using Views 15

Query rewriting A rewriting R of a query Q is a query over the source alphabet that, when evaluated over a source database D, provides only certain answers for Q We consider rewritings belonging to a certain class C (e.g., 2RPQs) We want rewritings that are maximal among those in C We aim at rewritings that are exact, i.e., equivalent to the query Example: Given s 1, s 2, s 3 and the mapping V s1 (b, a) b (article) a V s2 (p 1, p 2 ) p 1 (ref ) p 2 V s3 (p, t) p (title) t Consider Q(x, y) x (article (ref + ref ) title) y D. Calvanese Processing Regular Path Queries Using Views 15-a

Query rewriting A rewriting R of a query Q is a query over the source alphabet that, when evaluated over a source database D, provides only certain answers for Q We consider rewritings belonging to a certain class C (e.g., 2RPQs) We want rewritings that are maximal among those in C We aim at rewritings that are exact, i.e., equivalent to the query Example: Given s 1, s 2, s 3 and the mapping V s1 (b, a) b (article) a V s2 (p 1, p 2 ) p 1 (ref ) p 2 V s3 (p, t) p (title) t Consider Q(x, y) x (article (ref + ref ) title) y R 1 (x, y) x (s 1 s 2 s 3 ) y is an RPQ rewriting of Q D. Calvanese Processing Regular Path Queries Using Views 15-b

Query rewriting A rewriting R of a query Q is a query over the source alphabet that, when evaluated over a source database D, provides only certain answers for Q We consider rewritings belonging to a certain class C (e.g., 2RPQs) We want rewritings that are maximal among those in C We aim at rewritings that are exact, i.e., equivalent to the query Example: Given s 1, s 2, s 3 and the mapping V s1 (b, a) b (article) a V s2 (p 1, p 2 ) p 1 (ref ) p 2 V s3 (p, t) p (title) t Consider Q(x, y) x (article (ref + ref ) title) y R 1 (x, y) x (s 1 s 2 s 3 ) y is an RPQ rewriting of Q R 2 (x, y) x (s 1 (s 2 + s 2 ) s 3 ) y is a 2RPQ rewriting of Q D. Calvanese Processing Regular Path Queries Using Views 15-c

Query rewriting A rewriting R of a query Q is a query over the source alphabet that, when evaluated over a source database D, provides only certain answers for Q We consider rewritings belonging to a certain class C (e.g., 2RPQs) We want rewritings that are maximal among those in C We aim at rewritings that are exact, i.e., equivalent to the query Example: Given s 1, s 2, s 3 and the mapping V s1 (b, a) b (article) a V s2 (p 1, p 2 ) p 1 (ref ) p 2 V s3 (p, t) p (title) t Consider Q(x, y) x (article (ref + ref ) title) y R 1 (x, y) x (s 1 s 2 s 3 ) y is an RPQ rewriting of Q R 2 (x, y) x (s 1 (s 2 + s 2 ) s 3 ) y is a 2RPQ rewriting of Q R 3 (x, y) x (s 1 (s 2 + s 2 ) s 3 ) y is a 2RPQ-maximal rewriting of Q that is also exact D. Calvanese Processing Regular Path Queries Using Views 15-d

Complexity of query rewriting for RPQs and 2RPQs We consider RPQ/2RPQ queries and views, and rewritings that are RPQs/2RPQs: Existence of a nonempty rewriting is EXPSPACE-complete The shortest nonempty rewriting may be of double exponential size Existence of an exact rewriting is 2EXPSPACE-complete Upper bounds by automata-based techniques Lower bounds by reductions from bounded tiling problems (from [PODS 99, JCSS 02] for RPQs and [PODS 00] for 2RPQs) D. Calvanese Processing Regular Path Queries Using Views 16

Complexity of query rewriting for RPQs and 2RPQs We consider RPQ/2RPQ queries and views, and rewritings that are RPQs/2RPQs: Existence of a nonempty rewriting is EXPSPACE-complete The shortest nonempty rewriting may be of double exponential size Existence of an exact rewriting is 2EXPSPACE-complete Upper bounds by automata-based techniques Lower bounds by reductions from bounded tiling problems (from [PODS 99, JCSS 02] for RPQs and [PODS 00] for 2RPQs) Note that the complexity is in the size of the query and the views, and not in the size of the data D. Calvanese Processing Regular Path Queries Using Views 16-a

Query answering by rewriting To answer a query Q wrt a data integration system I = G, S, M and a source database D: 1. re-express Q in terms of the sources S, i.e., compute a rewriting of Q 2. directly evaluate the rewriting over D D. Calvanese Processing Regular Path Queries Using Views 17

Query answering by rewriting To answer a query Q wrt a data integration system I = G, S, M and a source database D: 1. re-express Q in terms of the sources S, i.e., compute a rewriting of Q 2. directly evaluate the rewriting over D Comparison with direct approach to query answering: + We can consider rewritings in a class with polynomial data complexity (e.g., 2RPQs) the data complexity for query answering is polynomial +/ We have traded expression complexity for data complexity We may lose completeness (i.e., not obtain all certain answers) D. Calvanese Processing Regular Path Queries Using Views 17-a

Query answering by rewriting To answer a query Q wrt a data integration system I = G, S, M and a source database D: 1. re-express Q in terms of the sources S, i.e., compute a rewriting of Q 2. directly evaluate the rewriting over D Comparison with direct approach to query answering: + We can consider rewritings in a class with polynomial data complexity (e.g., 2RPQs) the data complexity for query answering is polynomial +/ We have traded expression complexity for data complexity We may lose completeness (i.e., not obtain all certain answers) We need to establish the quality of a rewriting: When does the (maximal) rewriting compute all certain answers? What do we gain or lose by considering rewritings in a given class? D. Calvanese Processing Regular Path Queries Using Views 17-b

Query containment We need techniques to compare the results of queries and/or rewritings Basic task is query containment: Q 1 is contained in Q 2 if Q 1 (B) Q 2 (B) for every database B Complexity of containment for queries over semistructured data: Language Complexity RPQs PSPACE [PODS 99] 2RPQs PSPACE [PODS 00] Tree-2RPQs PSPACE [DBPL 01] Conjunctive-2RPQs EXPSPACE [KR 00] Datalog in Unions of C2RPQs 2EXPTIME [ICDT 03] D. Calvanese Processing Regular Path Queries Using Views 18

View-based containment In a data integration setting, traditional containment does not do the right job: we may need to compare a query over the global schema with a query over the sources we must take into account the information in the mapping M, considering also that views are sound D. Calvanese Processing Regular Path Queries Using Views 19

View-based containment In a data integration setting, traditional containment does not do the right job: we may need to compare a query over the global schema with a query over the sources we must take into account the information in the mapping M, considering also that views are sound We need to resort to view-based containment: Compare the results of two queries over the global schema / the sources for all source databases D and all databases B that satisfy M wrt D. for a query over the global schema, the result is the certain answers wrt D for a query over the sources, the result is the evaluation over D D. Calvanese Processing Regular Path Queries Using Views 19-a

Complexity of view-based containment for 2RPQs Let Q G i Q S i be a query over the global schema be a query over the sources Case Complexity (1) Q G 1 M Q G 2 NEXPTIME-complete (2) Q S 1 M Q G 2 PSPACE-complete (3) Q G 1 M Q S 2 NEXPTIME-complete (4) Q S 1 M Q S 2 PSPACE-complete Upper bounds exploit characterization of certain answers via CSP [PODS 03] Allows for: (2) establishing whether a given query over the sources is a rewriting (3) determining whether a rewriting is perfect (i.e, provides all certain answers) (4) comparing rewritings D. Calvanese Processing Regular Path Queries Using Views 20

Conclusions Established decidability and characterized complexity of fundamental query processing tasks for integration of semistructured data: view-based query answering view-based query rewriting query containment and view-based query containment Basic technical tools to establish upper bounds: two-way word automata characterization of certain answers in terms of constraint satisfaction D. Calvanese Processing Regular Path Queries Using Views 21

Conclusions Established decidability and characterized complexity of fundamental query processing tasks for integration of semistructured data: view-based query answering view-based query rewriting query containment and view-based query containment Basic technical tools to establish upper bounds: two-way word automata characterization of certain answers in terms of constraint satisfaction Extend results to exact views Further work Take into account constraints (e.g., DTDs, keys,... ) Identify and study most expressive well-behaved query language for semistructured data D. Calvanese Processing Regular Path Queries Using Views 21-a

Thank you! D. Calvanese Processing Regular Path Queries Using Views 22