Rewriting Ontology-Mediated Queries Carsten Lutz University of Bremen
Data Access and Ontologies Today, data is often highly incomplete and very heterogeneous Examples include web data and large-scale data integration Director(ww) Person(jj) Movie(dbl) directed(jj,dbl) Ontology is semantic technology from AI: adds domain knowledge to data and interrelates diverging vocabularies 2
Example Ontology: 8x( Director(x)! (Person(x) ^9y(directed(x, y) ^ Movie(y))) ) Query: q(x) =9y( Person(x) ^ directed(x, y) ^ Movie(y) ) Database: Person Movie Director jj dbl ww directed Person directed Movie Answers: jj ww Will be interested in ontologies formulated in description logics, standardised as web ontology language OWL 3
Ontology-Mediated Query Ontology-mediated query (OMQ): triple Q =(O,,q) where O is ontology is data signature (schema), possibly full q is query, e.g. atomic query (AQ) / conjunctive query (CQ) / UCQ A(x) Certain answer semantics: a answer to Q on D iff D [ O = q(a) OMQ language: pair (L, Q) with L ontology language and Q query language for example (EL, AQ), (ALC, UCQ), etc. Now interested in query answering, containment, rewriting, etc.
Query Rewriting Attaining scalable querying on large-scale data is far from trivial Existing database systems (SQL, Datalog) highly optimised, but (mostly) not prepared to deal with ontologies Query rewriting: Data Query and Ontology DB query DB 5
(Non-)Existence of Rewritings Rewritings into SQL (= FO, for now) need not exist: O: 8x8y (R(x, y) ^ A(y))! A(x) q: A(x) = all nodes that reach an A-labeled node along an R-path Same true for Datalog: O: every node must be labeled with R or G or B (disjunction!) + any two endpoints of an edge with same color satisfy D q: 9xD(x) = non-3-colorability But ontologies from practice tend to have very simple structure Challenges: construct rewritings when they exist, characterise existence, etc 6
The EL-Family of Description Logics (a tgd / datalog world) 7
EL Family of Description Logics The basic description logic EL (existential language) For example: Director v Person u 9directed.Movie Often used for medical and bio-ontologies: Pericardium v Tissue u 9partOf.Heart Pericarditis. = Inflammation u9location.pericardium Inflammation v Disease u 9actsOn.Tissue E.g. medical ontology SNOMED CT, ~500 ontologies on bioportal Spirit: a little semantics goes a long way 8
EL Family of Description Logics Concept formation: > A C u D 9r.C Ontology: set of C v D A u9r.b v 9r.A u9s.b A x ) x r r s B A B EL ontology set of tuple-generating dependencies w. single frontier variable + tree shaped body and head ELI: extension of EL with inverse roles (i.e., body and head are now trees in the undirected sense) (EL, CQ) and (ELI, CQ) have PTime data complexity, universal models 9
Computing vs Deciding Given an OMQ from (EL, CQ) or a related language, we would like to construct an (efficiently executable) FO-rewriting when it exists and report failure otherwise First aim: characterise and decide the existence of a rewriting, without necessarily computing it Allows to study complexity, abstract away from representation issues
Unraveling Tolerance OMQ (O,, A(x)) is unraveling tolerant if for every -database D: D [ O = A[a] iff D u a [ O = A[a] D r, s D u a r s a t a t b s t r t r s Theorem [L_WolterKR12] Every OMQ from (ELI, AQ) is unraveling tolerant.
Characterizing Non-Rewritability Unraveling tolerance enables characterization of FO-rewritability Theorem [BienvenuL_WolterIJCAI13] OMQ (O,,A(x)) in (ELI, AQ) is not FO-rewritable iff there are -databases D 1 D 2 D 3 D 4 D 0 1 1 D 2 2 0 D 0 3 3 D4 0 4 such that for all i 1: D i [ O = A(a 0 ), but D 0 i [ O 6 = A(a 0)
Decidability and Complexity Only two more steps to decidability and tight upper bounds: bound depth of tree-databases to consider (single exponential) via pumping argument use tree-automata to check existence of tree-db exceeding bound For CQs, tree databases are not quite trees anymore. Theorem [BienvenuL_WolterIJCAI13,BienvenuHansenL_WolterIJCAI16] Complexity of FO-rewritability: (EL, AQ) (EL, CQ) (ELI, AQ) ExpTime-c. (ELI, CQ) 2ExpTime-c.
Constructing FO-Rewritings: Backwards Chaining Proposed in [KönigLeclereMugnierThomazoRR12] for tgds, here adapted to (EL, AQ): Ontology: 9r.A u9r.b v A 0 Query: A 0 (x) A 0 9r.9s.> va 0 9s.B v B r r r r r A B A s s B Termination for positive cases guaranteed, general termination achievable via pumping argument [HansenL_SeylanWolterIJCAI15] 14
Constructing FO-Rewritings: Backwards Chaining Proposed in [KönigLeclereMugnierThomazoRR12] for tgds, here adapted to (EL, AQ): Ontology: 9r.A u9r.b v A 0 Query: A 0 (x) A 0 9r.9s.> va 0 9s.B v B r r r r r A B A s s B Yields UCQ-rewriting (c.f. Rossman s homomorphism theorem) But UCQ representation of rewriting quickly grows out of bounds 15
Constructing FO-Rewritings Efficiently Backwards chaining can be realized in decomposed calculus so that [HansenL_SeylanWolterIJCAI15] Implements structure sharing, generates non-recursive Datalog rewriting
Digression: Linear Datalog Rewritability Since SQL3 (1999), linear recursion is available (unlike in FO) This suggests linear Datalog as a target language for rewritings Again admits elegant characterisations: Theorem [L_SabellekSubmitted] OMQ (O,,A(x)) in (EL, AQ) is not LDLog-rewritable iff there are tree-shaped -databases D 1,D 2,... such that D i [ O = A(a 0 ) and is -minimal with this property D i contains a full binary tree of depth i as a minor From this (via other steps) decidability and ExpTime-completeness, LDLog-rewritable iff in NL data complexity, and other interesting things
The ALC-Family of Description Logics (a modal logic world)
ALC Family of Description Logics ALC extends EL with negation, disjunction, universal quantification (attributive language with complement) ALC concept formation: >? A C C u D C t D 9r.C 8r.D For example: Director v Person u9directed.(movie t TVseries) These features are costly : (ALC, AQ) and (ALC, CQ) are conp-complete in data complexity there are no universal models Again many extensions, e.g. with inverse roles: ALCI 19
No Unraveling Tolerance ALC is NOT unraveling tolerant: Ontology: 9x.9y.P u9y.9x.p v A 0 9x.9y. P u9y.9x. P v A 0 Query: A 0 (x) y P? P? y P P x x x x A 0 y A 0 y How can we find useful characterizations / algorithms? 20
Recap: Constraint Satisfaction Problems CSPs emerged in AI, can be viewed as generalized coloring problems Several equivalent definitions; here: homomorphism problems A template is a finite relational structure T. CSP(T ) is: Given: finite relational structure (i.e.: database) S Question: T S? I.e., is there a homomorphism from S to T? For example: S T R G B 21
OMQ and CSP A BAQ is a query of the form 9xA(x) Theorem [BienvenuTenCateL_WolterPODS13] Every OMQ from (ALCI, BAQ) is equivalent to the complement of a CSP from (ALCI, AQ) is equivalent to the complement of a multi-template CSP with a single constant (The converses are actually also true!) Construction incurs (unavoidable) exponential blowup Theorem [LaroseLotenTardiffLMCS07] FO-definability of (co)csps is NP-complete 22
FO-Rewritability Theorem [BienvenuTenCateL_WolterPODS13] FO-rewritability in (ALCI, BAQ), (ALCI, AQ) are NEXPTIME-complete. Characterization: OMQ (O,, 9xA(x)) not FO-rewritable iff there are -databases D 1 D 2 D 3 D 4 D 0 2 such that for all i 1: D 0 3 D i [ O 6 = 9xA(x), but D 0 i D 0 4 [ O = 9xA(x) Bound size of D 1 (actually D 1 = T 2!), establish bound on i by pumping do some further magic (since bound is double exponential) 23
Shape of FO-Rewritings Corollary of [Atserias07EuJComb, RossmanJACM08] If an OMQ from (ALCI, BAQ) is FO-rewritable then it is UCQ-rewritable. This can be improved further: UCQ-rewritability implies monadic Datalog (MDLog)-rewritability MDLog-rewritability equivalent to unraveling tolerance [FederVardi98] one can thus replace every CQ in a UCQ-rewriting with all its tree-shaped identifications Theorem If an OMQ from (ALCI, BAQ) is FO-rewritable then it is tucq-rewritable. Wanted: more practical procedure For non-boolean queries, this is not true: cycles through answer variable can occur, but no other cycles. 24
Datalog-Rewritability Theorem [BartoKozikFOCS09,BartoJLC16] 1. Datalog-rewritability of cocsps is decidable and NP-complete 2. Whenever a cocsp is Datalog-rewritable, there is a width two rewriting Theorem [BienvenuTenCateL_WolterPODS13] Datalog-rewritability in (ALCI, BAQ), (ALCI, AQ) is NEXPTIME-complete. Canonical Datalog-program of width two [FederVardiSIAMJComp98]: derives everything that any width two rewriting could ever derive is a rewriting whenever there is one Unfortunate: is of double exponential size, even in the best case 25
From AQs to UCQs We replace atomic queries in OMQs with unions of conjunctive queries Theorem [BienvenuTenCateL_WolterPODS13] Every OMQ from (ALCI,UCQ) is equivalent to a monadic disjunctive Datalog (MDDLog) program; the converse also holds. Translation involves double exponential blowup Beyond CSP in expressive power, but CSP still a valuable tool 26
From MDDLog to cocsp MDDLog program is simple if each rule body contains single EDB atom and this atom contains all body variables, exactly once [FederVardiSIAMJComp98]: Each MDDLog program can be translated into simple one of same complexity (signature change, exponential blowup) Each simple MDDLog program equivalent to complement of CSP (exponential blowup) Important: this cannot happen in structures of high girth! 27
Reducing Rewritability Translation of to simple program S : UCQ-rewriting of S yields UCQ-rewriting of UCQ-rewriting of yields UCQ-rewriting of S that is - unconditionally complete (that is, D = S implies D = S ) - sound on inputs of girth > rule diameter of Same is true for monadic Datalog and for Datalog But if there is such a flawed UCQ-rewriting, there is also a good one: Girth Lemma For all cocsps and k>0: UCQ-rewritability on inputs of girth >kimplies UCQ-rewritability. Can be proved using combinatorial lemma due to Erdös / FederVardi 28
FO-Rewritability: Results Theorem [FeierKuusistoL ICDT17,BourhisL_KR16] FO-rewritability is decidable and 2NEXPTIME-complete in MDDLog and in (ALCI, UCQ). Note: 2-exponential succinctness gap does not materialise in complexity Result also holds for non-boolean queries (non-trivial, involves blowups) Approach also serves to analyse shape of rewritings Theorem [FeierKuusistoL ICDT17] In (ALCI, BUCQ), every FO-rewritable OMQ has a UCQ-rewriting in which every CQ has tree-width (1,k), k = max{2, q }. Again not quite true for non-boolean queries (tree-width with parameters) 29
Datalog-Rewritability It is unclear whether girth lemma holds for Datalog-rewritability Observations: it does hold for monadic Datalog-rewritability, thus we obtain decidability for that (between 2NExpTime and 3ExpTime) CSPs constructed from MDDLog programs that have equality, that is, there is binary EDB eq such that for all IDBs P : P (x) ^ eq(x, y)! P (y) and P (y) ^ eq(x, y)! P (x) Theorem [FeierKuusistoL ICDT17] For MDDLog programs that have equality, Datalog-rewritability is 2NExpTime-complete. Every MDDLog program can be extended with equality Unclear whether this preserves Datalog-rewritability (for CSPs it does) 30
Future Directions Natural next Questions: Is Datalog-rewritability of MDDLog programs decidable? Find practically feasible procedures for computing rewritings for OMQs from the ALC family More general question: There are many querying-based classes of problems for which people study complexity classification and rewriting, e.g. consistent query answering, deletion propagation, etc. Can we understand better how they interrelate? 31