Rewriting Ontology-Mediated Queries. Carsten Lutz University of Bremen

Similar documents
Structural characterizations of schema mapping languages

Conjunctive queries. Many computational problems are much easier for conjunctive queries than for general first-order queries.

Processing Regular Path Queries Using Views or What Do We Need for Integrating Semistructured Data?

Schema.org as a Description Logic

Ontology-Based Data Access: A Study through Disjunctive Datalog, CSP, and MMSNP

A Retrospective on Datalog 1.0

Scalable Ontology-Based Information Systems

Lecture 1: Conjunctive Queries

OWL 2 Profiles. An Introduction to Lightweight Ontology Languages. Markus Krötzsch University of Oxford. Reasoning Web 2012

DATABASE THEORY. Lecture 11: Introduction to Datalog. TU Dresden, 12th June Markus Krötzsch Knowledge-Based Systems

Guarded Negation in query languages

The Complexity of Relational Queries: A Personal Perspective

View-based query processing: On the relationship between rewriting, answering and losslessness

The Inverse of a Schema Mapping

Inverting Schema Mappings: Bridging the Gap between Theory and Practice

Ontologies and Databases

Query Rewriting under EL-TBoxes: Efficient Algorithms

Maurizio Lenzerini. Dipartimento di Ingegneria Informatica Automatica e Gestionale Antonio Ruberti

Data integration lecture 2

On the Hardness of Counting the Solutions of SPARQL Queries

Finite Model Theory and Its Applications

Semantic Characterizations of XPath

Nonstandard Inferences in Description Logics

Database Theory VU , SS Introduction: Relational Query Languages. Reinhard Pichler

Access Patterns and Integrity Constraints Revisited

On the Boolean Algebra of Shape Analysis Constraints

RELATIONAL REPRESENTATION OF ALN KNOWLEDGE BASES

FOUNDATIONS OF DATABASES AND QUERY LANGUAGES

Range Restriction for General Formulas

Composition and Inversion of Schema Mappings

Relative Information Completeness

OWL 2 The Next Generation. Ian Horrocks Information Systems Group Oxford University Computing Laboratory

Database Theory VU , SS Introduction: Relational Query Languages. Reinhard Pichler

Craig Interpolation Theorems and Database Applications

SQL, DLs, Datalog, and ASP: comparison

Database Theory: Beyond FO

Foundations of Schema Mapping Management

Logic and Databases. Phokion G. Kolaitis. UC Santa Cruz & IBM Research - Almaden

Structural Characterizations of Schema-Mapping Languages

LTCS Report. Concept Descriptions with Set Constraints and Cardinality Constraints. Franz Baader. LTCS-Report 17-02

Schema Mappings and Data Exchange

Posets, graphs and algebras: a case study for the fine-grained complexity of CSP s

Positive higher-order queries

Database Theory VU , SS Codd s Theorem. Reinhard Pichler

3 No-Wait Job Shops with Variable Processing Times

OWL 2 Profiles. An Introduction to Lightweight Ontology Languages. Маркус Крёч (Markus Krötzsch) University of Oxford. KESW Summer School 2012

XML Research for Formal Language Theorists

Monadic Datalog Containment on Trees

COMP718: Ontologies and Knowledge Bases

Rewrite and Conquer: Dealing with Integrity Constraints in Data Integration

Department of Computer Science CS-RR-15-01

Vertex Cover is Fixed-Parameter Tractable

Logical Aspects of Massively Parallel and Distributed Systems

Knowledge Compilation Properties of Tree-of-BDDs

Regular Path Queries on Graphs with Data

Week 4. COMP62342 Sean Bechhofer, Uli Sattler

Query Minimization. CSE 544: Lecture 11 Theory. Query Minimization In Practice. Query Minimization. Query Minimization for Views

Semantic Acyclicity on Graph Databases

The Logic of the Semantic Web. Enrico Franconi Free University of Bozen-Bolzano, Italy

Semantic reasoning for dynamic knowledge bases. Lionel Médini M2IA Knowledge Dynamics 2018

TECHNICAL REPORT Leapfrog Triejoin: A Worst-Case Optimal Join Algorithm. October 1, 2012 Todd L. Veldhuizen LB1201

Modularity in Ontologies: Introduction (Part A)

Finding Equivalent Rewritings in the Presence of Arithmetic Comparisons

Data Integration: A Theoretical Perspective

Notes. Notes. Introduction. Notes. Propositional Functions. Slides by Christopher M. Bourke Instructor: Berthe Y. Choueiry.

Integrity Constraints For Access Control Models

An Evolution of Mathematical Tools

CSE 544: Principles of Database Systems

The Relational Model

Bibliographic citation

Foundations of Databases

We ve studied the main models and concepts of the theory of computation:

( A(x) B(x) C(x)) (A(x) A(y)) (C(x) C(y))

Towards Efficient Reasoning for Description Logics with Inverse Roles

Exercises Computational Complexity

When Can We Answer Queries Using Result-Bounded Data Interfaces?

Introduction to Finite Model Theory. Jan Van den Bussche Universiteit Hasselt

On the Data Complexity of Consistent Query Answering over Graph Databases

LOGIC AND DISCRETE MATHEMATICS

Static, Incremental and Parameterized Complexity of Consistent Query Answering in Databases Under Cardinality-Based Semantics

Checking Containment of Schema Mappings (Preliminary Report)

Capturing Topology in Graph Pattern Matching

Reasoning and Query Answering in Description Logics

News on Temporal Conjunctive Queries

Treewidth and graph minors

a standard database system

On Reconciling Data Exchange, Data Integration, and Peer Data Management

Introduction to Data Management CSE 344. Lecture 14: Datalog

Infinite transducers on terms denoting graphs

Computing Complete Answers to Queries in the Presence of Limited Access Patterns (Revision)

Data Exchange in the Relational and RDF Worlds

Stream Reasoning For Linked Data

DATABASE THEORY. Lecture 18: Dependencies. TU Dresden, 3rd July Markus Krötzsch Knowledge-Based Systems

V1.0: Seth Gilbert, V1.1: Steven Halim August 30, Abstract. d(e), and we assume that the distance function is non-negative (i.e., d(x, y) 0).

PART 1 GRAPHICAL STRUCTURE

Safe Stratified Datalog With Integer Order Does not Have Syntax

XXXII Conference on Very Large Data Bases VLDB 2006 Seoul, Korea, 15 th September 2006

A Knowledge Compilation Technique for ALC Tboxes

Toward Analytics for RDF Graphs


Transcription:

Rewriting Ontology-Mediated Queries Carsten Lutz University of Bremen

Data Access and Ontologies Today, data is often highly incomplete and very heterogeneous Examples include web data and large-scale data integration Director(ww) Person(jj) Movie(dbl) directed(jj,dbl) Ontology is semantic technology from AI: adds domain knowledge to data and interrelates diverging vocabularies 2

Example Ontology: 8x( Director(x)! (Person(x) ^9y(directed(x, y) ^ Movie(y))) ) Query: q(x) =9y( Person(x) ^ directed(x, y) ^ Movie(y) ) Database: Person Movie Director jj dbl ww directed Person directed Movie Answers: jj ww Will be interested in ontologies formulated in description logics, standardised as web ontology language OWL 3

Ontology-Mediated Query Ontology-mediated query (OMQ): triple Q =(O,,q) where O is ontology is data signature (schema), possibly full q is query, e.g. atomic query (AQ) / conjunctive query (CQ) / UCQ A(x) Certain answer semantics: a answer to Q on D iff D [ O = q(a) OMQ language: pair (L, Q) with L ontology language and Q query language for example (EL, AQ), (ALC, UCQ), etc. Now interested in query answering, containment, rewriting, etc.

Query Rewriting Attaining scalable querying on large-scale data is far from trivial Existing database systems (SQL, Datalog) highly optimised, but (mostly) not prepared to deal with ontologies Query rewriting: Data Query and Ontology DB query DB 5

(Non-)Existence of Rewritings Rewritings into SQL (= FO, for now) need not exist: O: 8x8y (R(x, y) ^ A(y))! A(x) q: A(x) = all nodes that reach an A-labeled node along an R-path Same true for Datalog: O: every node must be labeled with R or G or B (disjunction!) + any two endpoints of an edge with same color satisfy D q: 9xD(x) = non-3-colorability But ontologies from practice tend to have very simple structure Challenges: construct rewritings when they exist, characterise existence, etc 6

The EL-Family of Description Logics (a tgd / datalog world) 7

EL Family of Description Logics The basic description logic EL (existential language) For example: Director v Person u 9directed.Movie Often used for medical and bio-ontologies: Pericardium v Tissue u 9partOf.Heart Pericarditis. = Inflammation u9location.pericardium Inflammation v Disease u 9actsOn.Tissue E.g. medical ontology SNOMED CT, ~500 ontologies on bioportal Spirit: a little semantics goes a long way 8

EL Family of Description Logics Concept formation: > A C u D 9r.C Ontology: set of C v D A u9r.b v 9r.A u9s.b A x ) x r r s B A B EL ontology set of tuple-generating dependencies w. single frontier variable + tree shaped body and head ELI: extension of EL with inverse roles (i.e., body and head are now trees in the undirected sense) (EL, CQ) and (ELI, CQ) have PTime data complexity, universal models 9

Computing vs Deciding Given an OMQ from (EL, CQ) or a related language, we would like to construct an (efficiently executable) FO-rewriting when it exists and report failure otherwise First aim: characterise and decide the existence of a rewriting, without necessarily computing it Allows to study complexity, abstract away from representation issues

Unraveling Tolerance OMQ (O,, A(x)) is unraveling tolerant if for every -database D: D [ O = A[a] iff D u a [ O = A[a] D r, s D u a r s a t a t b s t r t r s Theorem [L_WolterKR12] Every OMQ from (ELI, AQ) is unraveling tolerant.

Characterizing Non-Rewritability Unraveling tolerance enables characterization of FO-rewritability Theorem [BienvenuL_WolterIJCAI13] OMQ (O,,A(x)) in (ELI, AQ) is not FO-rewritable iff there are -databases D 1 D 2 D 3 D 4 D 0 1 1 D 2 2 0 D 0 3 3 D4 0 4 such that for all i 1: D i [ O = A(a 0 ), but D 0 i [ O 6 = A(a 0)

Decidability and Complexity Only two more steps to decidability and tight upper bounds: bound depth of tree-databases to consider (single exponential) via pumping argument use tree-automata to check existence of tree-db exceeding bound For CQs, tree databases are not quite trees anymore. Theorem [BienvenuL_WolterIJCAI13,BienvenuHansenL_WolterIJCAI16] Complexity of FO-rewritability: (EL, AQ) (EL, CQ) (ELI, AQ) ExpTime-c. (ELI, CQ) 2ExpTime-c.

Constructing FO-Rewritings: Backwards Chaining Proposed in [KönigLeclereMugnierThomazoRR12] for tgds, here adapted to (EL, AQ): Ontology: 9r.A u9r.b v A 0 Query: A 0 (x) A 0 9r.9s.> va 0 9s.B v B r r r r r A B A s s B Termination for positive cases guaranteed, general termination achievable via pumping argument [HansenL_SeylanWolterIJCAI15] 14

Constructing FO-Rewritings: Backwards Chaining Proposed in [KönigLeclereMugnierThomazoRR12] for tgds, here adapted to (EL, AQ): Ontology: 9r.A u9r.b v A 0 Query: A 0 (x) A 0 9r.9s.> va 0 9s.B v B r r r r r A B A s s B Yields UCQ-rewriting (c.f. Rossman s homomorphism theorem) But UCQ representation of rewriting quickly grows out of bounds 15

Constructing FO-Rewritings Efficiently Backwards chaining can be realized in decomposed calculus so that [HansenL_SeylanWolterIJCAI15] Implements structure sharing, generates non-recursive Datalog rewriting

Digression: Linear Datalog Rewritability Since SQL3 (1999), linear recursion is available (unlike in FO) This suggests linear Datalog as a target language for rewritings Again admits elegant characterisations: Theorem [L_SabellekSubmitted] OMQ (O,,A(x)) in (EL, AQ) is not LDLog-rewritable iff there are tree-shaped -databases D 1,D 2,... such that D i [ O = A(a 0 ) and is -minimal with this property D i contains a full binary tree of depth i as a minor From this (via other steps) decidability and ExpTime-completeness, LDLog-rewritable iff in NL data complexity, and other interesting things

The ALC-Family of Description Logics (a modal logic world)

ALC Family of Description Logics ALC extends EL with negation, disjunction, universal quantification (attributive language with complement) ALC concept formation: >? A C C u D C t D 9r.C 8r.D For example: Director v Person u9directed.(movie t TVseries) These features are costly : (ALC, AQ) and (ALC, CQ) are conp-complete in data complexity there are no universal models Again many extensions, e.g. with inverse roles: ALCI 19

No Unraveling Tolerance ALC is NOT unraveling tolerant: Ontology: 9x.9y.P u9y.9x.p v A 0 9x.9y. P u9y.9x. P v A 0 Query: A 0 (x) y P? P? y P P x x x x A 0 y A 0 y How can we find useful characterizations / algorithms? 20

Recap: Constraint Satisfaction Problems CSPs emerged in AI, can be viewed as generalized coloring problems Several equivalent definitions; here: homomorphism problems A template is a finite relational structure T. CSP(T ) is: Given: finite relational structure (i.e.: database) S Question: T S? I.e., is there a homomorphism from S to T? For example: S T R G B 21

OMQ and CSP A BAQ is a query of the form 9xA(x) Theorem [BienvenuTenCateL_WolterPODS13] Every OMQ from (ALCI, BAQ) is equivalent to the complement of a CSP from (ALCI, AQ) is equivalent to the complement of a multi-template CSP with a single constant (The converses are actually also true!) Construction incurs (unavoidable) exponential blowup Theorem [LaroseLotenTardiffLMCS07] FO-definability of (co)csps is NP-complete 22

FO-Rewritability Theorem [BienvenuTenCateL_WolterPODS13] FO-rewritability in (ALCI, BAQ), (ALCI, AQ) are NEXPTIME-complete. Characterization: OMQ (O,, 9xA(x)) not FO-rewritable iff there are -databases D 1 D 2 D 3 D 4 D 0 2 such that for all i 1: D 0 3 D i [ O 6 = 9xA(x), but D 0 i D 0 4 [ O = 9xA(x) Bound size of D 1 (actually D 1 = T 2!), establish bound on i by pumping do some further magic (since bound is double exponential) 23

Shape of FO-Rewritings Corollary of [Atserias07EuJComb, RossmanJACM08] If an OMQ from (ALCI, BAQ) is FO-rewritable then it is UCQ-rewritable. This can be improved further: UCQ-rewritability implies monadic Datalog (MDLog)-rewritability MDLog-rewritability equivalent to unraveling tolerance [FederVardi98] one can thus replace every CQ in a UCQ-rewriting with all its tree-shaped identifications Theorem If an OMQ from (ALCI, BAQ) is FO-rewritable then it is tucq-rewritable. Wanted: more practical procedure For non-boolean queries, this is not true: cycles through answer variable can occur, but no other cycles. 24

Datalog-Rewritability Theorem [BartoKozikFOCS09,BartoJLC16] 1. Datalog-rewritability of cocsps is decidable and NP-complete 2. Whenever a cocsp is Datalog-rewritable, there is a width two rewriting Theorem [BienvenuTenCateL_WolterPODS13] Datalog-rewritability in (ALCI, BAQ), (ALCI, AQ) is NEXPTIME-complete. Canonical Datalog-program of width two [FederVardiSIAMJComp98]: derives everything that any width two rewriting could ever derive is a rewriting whenever there is one Unfortunate: is of double exponential size, even in the best case 25

From AQs to UCQs We replace atomic queries in OMQs with unions of conjunctive queries Theorem [BienvenuTenCateL_WolterPODS13] Every OMQ from (ALCI,UCQ) is equivalent to a monadic disjunctive Datalog (MDDLog) program; the converse also holds. Translation involves double exponential blowup Beyond CSP in expressive power, but CSP still a valuable tool 26

From MDDLog to cocsp MDDLog program is simple if each rule body contains single EDB atom and this atom contains all body variables, exactly once [FederVardiSIAMJComp98]: Each MDDLog program can be translated into simple one of same complexity (signature change, exponential blowup) Each simple MDDLog program equivalent to complement of CSP (exponential blowup) Important: this cannot happen in structures of high girth! 27

Reducing Rewritability Translation of to simple program S : UCQ-rewriting of S yields UCQ-rewriting of UCQ-rewriting of yields UCQ-rewriting of S that is - unconditionally complete (that is, D = S implies D = S ) - sound on inputs of girth > rule diameter of Same is true for monadic Datalog and for Datalog But if there is such a flawed UCQ-rewriting, there is also a good one: Girth Lemma For all cocsps and k>0: UCQ-rewritability on inputs of girth >kimplies UCQ-rewritability. Can be proved using combinatorial lemma due to Erdös / FederVardi 28

FO-Rewritability: Results Theorem [FeierKuusistoL ICDT17,BourhisL_KR16] FO-rewritability is decidable and 2NEXPTIME-complete in MDDLog and in (ALCI, UCQ). Note: 2-exponential succinctness gap does not materialise in complexity Result also holds for non-boolean queries (non-trivial, involves blowups) Approach also serves to analyse shape of rewritings Theorem [FeierKuusistoL ICDT17] In (ALCI, BUCQ), every FO-rewritable OMQ has a UCQ-rewriting in which every CQ has tree-width (1,k), k = max{2, q }. Again not quite true for non-boolean queries (tree-width with parameters) 29

Datalog-Rewritability It is unclear whether girth lemma holds for Datalog-rewritability Observations: it does hold for monadic Datalog-rewritability, thus we obtain decidability for that (between 2NExpTime and 3ExpTime) CSPs constructed from MDDLog programs that have equality, that is, there is binary EDB eq such that for all IDBs P : P (x) ^ eq(x, y)! P (y) and P (y) ^ eq(x, y)! P (x) Theorem [FeierKuusistoL ICDT17] For MDDLog programs that have equality, Datalog-rewritability is 2NExpTime-complete. Every MDDLog program can be extended with equality Unclear whether this preserves Datalog-rewritability (for CSPs it does) 30

Future Directions Natural next Questions: Is Datalog-rewritability of MDDLog programs decidable? Find practically feasible procedures for computing rewritings for OMQs from the ALC family More general question: There are many querying-based classes of problems for which people study complexity classification and rewriting, e.g. consistent query answering, deletion propagation, etc. Can we understand better how they interrelate? 31