Graph Databases. Advanced Topics in Foundations of Databases, University of Edinburgh, 2017/18

Similar documents
Querying Graph Databases 1/ 83

Processing Regular Path Queries Using Views or What Do We Need for Integrating Semistructured Data?

Schema Mappings and Data Exchange for Graph Databases

Regular Path Queries on Graphs with Data

Containment of Data Graph Queries

Regular Expressions for Data Words

View-based query processing: On the relationship between rewriting, answering and losslessness

Structural Characterization of Graph Navigational Languages (Extended Abstract)

Querying Graph Databases

Center for Semantic Web Research.

Publications of Pablo Barceló by type

A Querying Graphs with Data

ANDREAS PIERIS JOURNAL PAPERS

a standard database system

A TriAL: A navigational algebra for RDF triplestores

On the Hardness of Counting the Solutions of SPARQL Queries

a standard database system

Semantic data integration in P2P systems

ATFD: material for presentations, projects, and essays

Semantic Acyclicity on Graph Databases

SPARQL with Property Paths

TriAL for RDF: Adapting Graph Query Languages for RDF Data

following syntax: R ::= > n j P j $i=n : C j :R j R 1 u R 2 C ::= > 1 j A j :C j C 1 u C 2 j 9[$i]R j (» k [$i]r) where i and j denote components of r

Monadic Datalog Containment on Trees

New Rewritings and Optimizations for Regular Path Queries

Three easy pieces on schema mappings for tree-structured data

FOUNDATIONS OF DATABASES AND QUERY LANGUAGES

Approximation Algorithms for Computing Certain Answers over Incomplete Databases

On the Complexity of Query Answering over Incomplete XML Documents

On the Data Complexity of Consistent Query Answering over Graph Databases

Logical Foundations of Relational Data Exchange

SEMANTIC ACYCLICITY ON GRAPH DATABASES

Data integration lecture 3

Model checking pushdown systems

Logical Aspects of Massively Parallel and Distributed Systems

Conjunctive queries. Many computational problems are much easier for conjunctive queries than for general first-order queries.

Counting multiplicity over infinite alphabets

Foundations of SPARQL Query Optimization

Conjunctive Query Containment in Description Logics with n-ary Relations

Certain Answers as Objects and Knowledge

XML Research for Formal Language Theorists

The Property Graph Database Model

Classical DB Questions on New Kinds of Data

Recursion in SPARQL. Juan L. Reutter, Adrián Soto, and Domagoj Vrgoč. PUC Chile and Center for Semantic Web Research

Incomplete Data: What Went Wrong, and How to Fix It. PODS 2014 Incomplete Information 1/1

The Logic of the Semantic Web. Enrico Franconi Free University of Bozen-Bolzano, Italy

XPath for DL-Lite Ontologies

MASTRO-I: Efficient integration of relational data through DL ontologies

Posets, graphs and algebras: a case study for the fine-grained complexity of CSP s

Answering Pattern Queries Using Views

A Retrospective on Datalog 1.0

Learning Queries for Relational, Semi-structured, and Graph Databases

Theory of Computations Spring 2016 Practice Final

The Complexity of Relational Queries: A Personal Perspective

Structural characterizations of schema mapping languages

Reasonable Highly Expressive Query Languages

Foundations of Schema Mapping Management

Preferentially Annotated Regular Path Queries

Database Theory VU , SS Introduction: Relational Query Languages. Reinhard Pichler

Introduction to Finite Model Theory. Jan Van den Bussche Universiteit Hasselt

The Relational Model

Context-Free Path Queries on RDF Graphs

Relative Information Completeness

Data Integration: A Theoretical Perspective

Reasoning over Evolving Graph-structured Data Under Constraints

Decidable Problems. We examine the problems for which there is an algorithm.

On the Codd Semantics of SQL Nulls

Finite Model Theory and Its Applications

Formalizing MongoDB Queries

Data integration lecture 2

Provable data privacy

A Formal Study of Practical Regular Expressions

Lexical Analysis. Implementation: Finite Automata

On Reconciling Data Exchange, Data Integration, and Peer Data Management

DATABASE THEORY. Lecture 11: Introduction to Datalog. TU Dresden, 12th June Markus Krötzsch Knowledge-Based Systems

Safe Stratified Datalog With Integer Order Does not Have Syntax

Some Complexity Results for Stateful Network Verification

SMO System Management Ontology

Streaming Tree Transducers

Semantic Optimization of Preference Queries

Schema Mappings and Data Exchange

Peter T. Wood. Third Alberto Mendelzon International Workshop on Foundations of Data Management

Chapter 3 Complexity of Classical Planning

On the Complexity of Enumerating the Answers to Well-designed Pattern Trees

CS256 Applied Theory of Computation

Query Containment for Data Integration Systems

Evaluating an Automata Approach to Query Containment

Outline. Language Hierarchy

Property Path Query in SPARQL 1.1

COMP4418 Knowledge Representation and Reasoning

Automata Theory for Reasoning about Actions

2 A topological interlude

Formal Languages and Automata

Theory of Computations Spring 2016 Practice Final Exam Solutions

Thompson groups, Cantor space, and foldings of de Bruijn graphs. Peter J. Cameron University of St Andrews

Logical reconstruction of RDF and ontology languages

CSE 544: Principles of Database Systems

Scale Independence: Using Small Data. to Answer. Queries onbig Data. Floris Geerts. University of Antwerp 1/61

Describing and Utilizing Constraints to Answer Queries in Data-Integration Systems

A Characterization of the Chomsky Hierarchy by String Turing Machines

Transcription:

Graph Databases Advanced Topics in Foundations of Databases, University of Edinburgh, 2017/18

Graph Databases and Applications Graph databases are crucial when topology is as important as the data Several modern applications Semantic Web and RDF Social networks Knowledge graphs etc. β 1 5 γ α β 2 α 3 4

Graph Databases vs. Relational Databases Simply use standard relational databases β 1 5 γ α β 2 α 3 4 Graph id_o label id_t 1 α 3 1 β 5 1 γ 2 2 β 5 2 α 4 Problems: We need to navigate the graph recursion is needed We can use Datalog performance issues (complexity mismatch, basic static analysis task are undecidable)

Graph Data Model Different applications gave rise to different graph data models But, the essence is the same finite, directed, edge labeled graphs

Graph Data Model An graph database G over a finite alphabet Λ is a pair (V, E) finite set of node ids set of edges of the form v α u where u,v 2 V and α 2 Λ α Path in G: π = v 1 α 1 v 2 2 v 3 α v k k v k+1 The label of π is λ(π) = α 1 α 2 α 3...α k 2 Λ*

Graph Database: Example A graph database representation of a fragment of DBLP journal:jacm journal Jacm:HopcroftT74 creator :Robert_E_Tarjan conf:focs series infocs:focs8 partof Focs:HopU67a creator :John_E_Hopcroft inpods:89 partof Pods:Ullman89 creator :Jeffrey_Ullman conf:pods inpods:83 partof Pods:FaginUV83 creator :Ronald_Fagin :Moshe_Y_Vardi

Regular Path Queries (RPQs) Basic building block of graph queries First studied in 1989 An RPQ is a regular expression over a finite alphabet Λ Given a graph database G = (V,E) over Λ and RPQ Q over Λ Q(G) = {(v,u) v,u 2 V and there is a path π from v to u such that λ(π) 2 L(Q)}

RPQs With Inverses (2RPQs) Extension of RPQs with inverses two-way RPQs First studied in 2000 2RPQs over Λ = RPQs over Λ = Λ [ {α α 2 Λ} Given a graph database G = (V,E) over Λ and 2RPQ Q over Λ Q(G) = Q(G ) α obtained from G by adding u v for each v α u

Querying Graph Database Compute the pairs (c,d) such that author c has published in conference or journal d journal:jacm journal Jacm:HopcroftT74 creator :Robert_E_Tarjan conf:focs series infocs:focs8 partof Focs:HopU67a creator :John_E_Hopcroft inpods:89 partof Pods:Ullman89 creator :Jeffrey_Ullman conf:pods inpods:83 partof Pods:FaginUV83 creator :Ronald_Fagin (creator ((partof series) [ journal)) :Moshe_Y_Vardi

Querying Graph Database Compute the pairs (c,d) such that author c has published in conference or journal d d journal:jacm journal Jacm:HopcroftT74 creator c :Robert_E_Tarjan conf:focs series infocs:focs8 partof Focs:HopU67a creator :John_E_Hopcroft inpods:89 partof Pods:Ullman89 creator :Jeffrey_Ullman conf:pods inpods:83 partof Pods:FaginUV83 creator :Ronald_Fagin (creator ((partof series) [ journal)) :Moshe_Y_Vardi

Querying Graph Database Compute the pairs (c,d) such that author c has published in conference or journal d journal:jacm journal Jacm:HopcroftT74 creator :Robert_E_Tarjan conf:focs series infocs:focs8 partof Focs:HopU67a creator :John_E_Hopcroft d inpods:89 partof Pods:Ullman89 creator :Jeffrey_Ullman conf:pods inpods:83 partof Pods:FaginUV83 creator c :Ronald_Fagin (creator ((partof series) [ journal)) :Moshe_Y_Vardi

Evaluation of 2RPQs EVAL(2RPQ) Input: a graph database G, a 2RPQ Q, two nodes v,u of G Question: (v,u) 2 Q(G)? It boils down to the problem: RegularPath Input: a graph database G over Λ, a regular expression Q over Λ, two nodes v,u of G Question: is there a path π from v to u in G such that λ(π) 2 L(Q)

Complexity of RegularPath Theorem: RegularPath can be solved in time O( G Q ) Proof Idea: by exploiting nondeterministic finite automata (NFA) Compute in linear time from Q an equivalent NFA A Q Compute in linear time an NFA A G obtained from G by setting v and u as initial and finite states, respectively There is a path π from v to u in G such that λ(π) 2 L(Q) iff L(A G ) \ L(A Q ) is non-empty Non-emptiness can be checked in time O( A G A Q ) = O( G Q ) A graph database can be naturally seen as an NFA nodes are states edges are transitions

Complexity of 2RPQs We immediately get that: Theorem: EVAL(2RPQ) can be solved in time O( G Q ) Regarding the data complexity (i.e., Q is fixed): Theorem: EVAL Q (2RPQ) is in NLOGSPACE (by exploiting the previous automata construction)

Limitation of RPQs RPQs are not able to express arbitrary patterns over graph databases (e.g., compute the pairs (c,d) that are coauthors of a conference paper) We need to enrich RPQs with joins and projections Conjunctive regular path queries (CRPQs) C2RPQs if we add inverses

C2RPQs: Example Compute the pairs (c,d) that are coauthors of a conference paper journal:jacm journal Jacm:HopcroftT74 creator :Robert_E_Tarjan conf:focs series infocs:focs8 partof Focs:HopU67a creator :John_E_Hopcroft inpods:89 partof Pods:Ullman89 creator :Jeffrey_Ullman conf:pods inpods:83 partof Pods:FaginUV83 creator :Ronald_Fagin :Moshe_Y_Vardi

C2RPQs: Example Compute the pairs (c,d) that are coauthors of a conference paper Q(x,u) :- (x, creator, y), (y, partof series, z), (y, creator, u) journal:jacm journal Jacm:HopcroftT74 creator :Robert_E_Tarjan conf:focs series infocs:focs8 partof Focs:HopU67a creator :John_E_Hopcroft inpods:89 partof Pods:Ullman89 creator :Jeffrey_Ullman conf:pods z inpods:83 partof Pods:FaginUV83 creator u :Ronald_Fagin y x (: Moshe_Y_Vardi, :Ronald_Fagin) :Moshe_Y_Vardi

C2RPQs: Formal Definition A C2RPQ over an alphabet Λ is a rule of the form Q(z) :- (x 1, Q 1, y 1 ),, (x n, Q n, y n ) where x i, y i are variables, Q i is a 2RPQ over Λ, z are the output variables from {x 1, y 1,, x n, y n } Remark: C2RPQs are more expressive than 2RPQs (previous example)

Evaluation of C2RPQs To evaluate a C2RPQ of the form Q(z) :- (x 1, Q 1, y 1 ),, (x n, Q n, y n ) we simply need to evaluate the conjunctive query Q(z) :- Q 1 (x 1, y 1 ),, Q n (x n, y n ) where each Q i stores the result of evaluating the 2RPQ Q i

Complexity of C2RPQs Theorem: EVAL(C2RPQ) is NP-complete Proof Hints: Upper bound: polynomial time reduction to EVAL(CQ) Lower bound: inherited from CQs over graphs Regarding the data complexity (i.e., Q is fixed): Theorem: EVAL Q (C2RPQ) is in NLOGSPACE

Basic Graph Query Languages: Recap Two-way regular path queries (2RPQs) Can be evaluated in linear time in combined complexity, and in NLOGSPACE in data complexity Conjunctive 2RPQs (C2RPQs) Evaluation is NP-complete in combined complexity, and in NLOGSPACE in data complexity

Towards Tractable C2RPQs Recall acyclic conjunctive queries 1 2 3 8 4 7 5 6 {8,9} {1,3,4,5,6,7,8} {9,10,11} 13 9 {1,2,3} {11,12} 12 11 10 {12,13}

Acyclic C2RPQs A C2RPQ is acyclic if its underlying CQ is acyclic Q :- (x, Q 1, x), (x, Q 2, y), (y, Q 3, x) x y Q :- (x, Q 1, y), (y, Q 2, z), (z, Q 3, x) x z y Equivalently, the underlying graph does not contain cycles of length 3

Complexity of Acyclic C2RPQs Theorem: EVAL(AC2RPQ) can be solved in time O( G 2 Q 2 ) {Q 2 C2RPQ Q is acyclic} Proof Idea: recall that we can reduce EVAL(C2RPQ) to EVAL(CQ)

Simple Path Semantics Simple Path: No node is repeated In this case, EVAL(2RPQ) boils down to the problem: RegularSimplePath Input: a graph database G over Λ, a regular expression Q over Λ, two nodes v,u of G Question: is there a simple path π from v to u in G such that λ(π) 2 L(Q)

Simple Path Semantics Theorem: RegularSimplePath is NP-complete Theorem: RegularSimplePath Q is NP-complete (data complexity) RegularSimplePath (0 0)* Is there a simple directed path of even length? NP-complete NP-complete data complexity means impractical

Containment of Graph Queries CONT(L) Input: two queries Q 1 2 L and Q 2 2 L Question: Q 1 µ Q 2? (i.e., Q 1 (G) µ Q 2 (G) for every graph database G?)

Containment of Graph Queries Theorem: CONT(RPQ) is PSPACE-complete Proof Hint: exploit containment of regular expressions Theorem: CONT(2RPQ) is PSPACE-complete Proof Hint: exploit containment of two-way automata, while the lower bound is inherited from RPQs Theorem: CONT(C2RPQ) is EXPSPACE-complete Proof Hint: exploit containment of two-way automata, while the lower bound is by reduction from a tiling problem

Limitations of CRPQs Compute the pairs (c,d) that are linked by a path labeled in {α n β n n 0} π 1 π 2 v w u such that λ(π 1 ) 2 L(α*) and λ(π 2 ) 2 L(β*) and λ(π 1 ) = λ(π 2 ) Not expressible using CRPQs. We need: To define complex relationships among labels of paths To include paths in the output of a query

Comparing Paths With Regular Relations Regular languages for n-ary relations n-ary regular relations: set of n-tuples (w 1,,w n ) of words over an alphabet Λ Accepted by a synchronous automaton over Λ n The input strings are written in the n-tapes Shorter strings are padded with the symbol # not in Λ At each step, the automaton simultaneously reads the next symbol on each tape, terminating when it reads # on each tape

Synchronous Automata w 1 = α α β... α β γ w 2 = α β α... α w 3 = β β...... w n = α β β... α γ

Synchronous Automata w 1 = α α β... α β γ w 2 = α β α... α # # w 3 = β β #... # # #... w n = α β β... α γ #

Synchronous Automata w 1 = α α β... α β γ w 2 = α β α... α # # w 3 = β β #... # # #... w n = α β β... α γ #

Synchronous Automata w 1 = α α β... α β γ w 2 = α β α... α # # w 3 = β β #... # # #... w n = α β β... α γ #

Synchronous Automata w 1 = α α β... α β γ w 2 = α β α... α # # w 3 = β β #... # # #... w n = α β β... α γ #

Regular Relations: Examples All regular languages regular relations of arity 1 Path equality: w 1 = w 2 Length comparison: w 1 = w 2, w 1 < w 2, w 1 w 2 Prefix: w 1 is a prefix of w 2

Extended CRPQs With Regular Relations (REG) An ECRPQ(REG) is a rule obtained from a CRPQ as follows annotate each Q(z) :- (x 1, Q 1, y 1 ),, (x n, Q n, y n ) pair (x i,y i ) with a path variable π i output some of π i s as a tuple π in the output Q(z) :- (x 1, π 1, y 1 ),, (x n, π n, y n ) Q(z) :- (x 1, π 1, y 1 ),, (x n, π n, y n ), ^j S j (π j ) Q(z,π) :- (x 1, π 1, y 1 ),, (x n, π n, y n ), ^j S j (π j ) compare labels of paths in π j w.r.t. S j 2 REG

Evaluation of EC2RPQ(REG) Q(z,π) :- (x 1, π 1, y 1 ),, (x n, π n, y n ), ^j S j (π j ) Same as CRPQs, but Each π i is mapped to a path ρ i in the graph database For each j, if π j = (π j1,...,π jk ) ) (λ(ρ j1 ),...,λ(ρ jk )) 2 S j

Example of ECRPQ(REG) Compute the pairs (c,d) that are linked by a path labeled in {α n β n n 0} π 1 π 2 v w u such that λ(π 1 ) 2 L(α*) and λ(π 2 ) 2 L(β*) and λ(π 1 ) = λ(π 2 ) Q(x,y) :- (x, π 1, z), (z, π 2, y), α*(π 1 ), β*(π 2 ), Equal_Length(π 1,π 2 )

ECRPQ(REG) vs. CRPQs Q(z) :- (x 1, Q 1, y 1 ),, (x n, Q n, y n ) Q(z) :- (x 1, π 1, y 1 ),, (x n, π n, y n ), Q 1 (π 1 ),, Q n (π n )

Complexity of ΕC2RPQ(REG) Theorem: It holds that EVAL(ECRPQ(REG)) is PSPACE-complete EVAL Q (ECRPQ(REG)) is in NLOGSPACE (data complexity) CONT(ECRPQ(REG)) is undecidable

Beyond Regular Relations Subsequences w 1 is a subsequence of w 2, i.e., w 1 can be obtained from w 2 by deleting some letters Subword: w 3 w 1 w 4 = w 2 we can exploit rational relations (RAT) - ECRPQ(RAT)

Path Query Languages: Recap CRPQs do not allow to compare labels of paths and export paths This has led to the introduction of ECRPQ(REG) Preserves data tractability But containment becomes undecidable We can go beyond REG ECRPQ(RAT) Undecidability of query evaluation We obtain data tractability if we restrict the syntax

Querying Graphs With Data So far queries talk about the topology of the data However, graph databases contain data data graphs We have query languages that can talk about data paths (obtained by replacing each node in a path by its value)

Associated Papers Mariano P. Consens, Alberto O. Mendelzon: Low Complexity Aggregation in GraphLog and Datalog. Theor. Comput. Sci. 116(1): 95-116 (1993) One of the papers introducing (C)RPQs Pablo Barcelo: Querying graph databases. PODS 2013: 175-188 Renzo Angles, Claudio Gutierrez: Survey of graph database models. ACM Comput. Surv. 40(1) (2008) Two surveys of graph languages, two are more theoretical, one more practical

Associated Papers Diego Calvanese, Giuseppe De Giacomo, Maurizio Lenzerini, Moshe Y. Vardi: Rewriting of Regular Expressions and Regular Path Queries. J. Comput. Syst. Sci. 64(3): 443-465 (2002) Introducing two-way queries Diego Calvanese, Giuseppe De Giacomo, Maurizio Lenzerini, Moshe Y. Vardi: Reasoning on regular path queries. SIGMOD Record 32(4): 83-92 (2003) Diego Calvanese, Giuseppe De Giacomo, Maurizio Lenzerini, Moshe Y. Vardi: Containment of Conjunctive Regular Path Queries with Inverse. KR 2000: 176-185 Static analysis of regular path queries Leonid Libkin, Wim Martens, Domagoj Vrgoc: Querying graph databases with XPath. ICDT 2013: 129-140 Adding data values to (C)RPQs

Associated Papers Pablo Barcelo, Leonid Libkin, Anthony Widjaja Lin, Peter T. Wood: Expressive Languages for Path Queries over Graph-Structured Data. ACM Trans. Database Syst. 37(4): 31 (2012) Extending RPQs with regular relations Pablo Barcelo, Diego Figueira, Leonid Libkin: Graph Logics with Rational Relations. Logical Methods in Computer Science 9(3) (2013) Extending RPQs with rational relations Dominik D. Freydenberger, Nicole Schweikardt: Expressiveness and Static Analysis of Extended Conjunctive Regular Path Queries. AMW 2011 Resolving some of the questions on the containment of path queries Jelle Hellings, Bart Kuijpers, Jan Van den Bussche, Xiaowang Zhang: Walk logic as a framework for path query languages on graph databases. ICDT 2013: 117-128 A different approach to expanding the power of path languages

Associated Papers Pablo Barcelo, Leonid Libkin, Juan L. Reutter: Querying Regular Graph Patterns. Journal of the ACM 61(1): 8:1-8:54 (2014) Incomplete information in graph databases and querying it Wenfei Fan, Xin Wang, Yinghui Wu: Querying big graphs within bounded resources. SIGMOD Conference 2014: 301-312 Wenfei Fan: Graph pattern matching revised for social network analysis. ICDT 2012: 8-21 Two papers on making graph queries scalable