Copyright 2016 by Sahibi Miranshah. All Rights Reserved

Size: px

Start display at page:

Neal Cox
5 years ago
Views:

1 ABSTRACT MIRANSHAH, SAHIBI. Integrating a Path Operator in Apache Jena for Generalized Graph Pattern Matching. (Under the direction of Dr. Kemafor Anyanwu Ogan.) The emergence of large heterogeneous networks largely driven by the W3C standards for representing metadata and relationships on the Web has introduced the need for more flexible querying methods that the standard pattern matching paradigm. For such large graphs, it will be common for users to only know a part of the structure they are interested in finding in a large network. In some cases, finding the structure would in fact be the goal of the query. For such scenarios, it is necessary to have flexible matching paradigms where it is possible to match both fixed structure patterns as well as variable structure patterns. The present infrastructure for supporting such generalized queries exists in a very limited form. In this thesis we explore some approaches for achieving this. The two common approaches to implement generalized graph pattern matching queries, i.e. queries that can match not just edges but paths between source and destination node bindings, would be either to use the traditional graph pattern matching approach which roughly compares to relational query processing and evaluates the query by using relational join expressions. The other approach would be to use the graph traversal algorithms which convert the graph pattern to a regular expression, and execute its finite automata on the dataset graph. Both these approaches are limited in terms of expressiveness of queries and their performance on large and diverse datasets as the Semantic Web. In our research we use a hybrid approach by decomposing the generalized query into parts; traditional graph pattern matching is used to evaluate one part of the query and paths are evaluated using the path algebraic algorithm. We then integrate support for such queries in the popular open-source Semantic Web framework for Java, by extending the Jena API to support compilation and execution of such queries, and by integrating a path operator. Evaluation of the implementation was done using multiple datasets and the overhead in query compilation time was minimal, as compared to queries which do not perform path matching.

3 Integrating a Path Operator in Apache Jena for Generalized Graph Pattern Matching by Sahibi Miranshah A thesis submitted to the Graduate Faculty of North Carolina State University in partial fulfillment of the requirements for the Degree of Master of Science Computer Science Raleigh, North Carolina 2016 APPROVED BY: Dr. Raju Vatsavai Dr. Guoliang Jin Dr. Kemafor Anyanwu Ogan Chair of Advisory Committee

4 DEDICATION To my parents. ii

5 BIOGRAPHY Sahibi Miranshah received her Bachelor s degree in Computer Science and Engineering from Dr. B.R.Ambedkar National Institute of Technology, Jalandhar, India in She enrolled at North Carolina State University in 2014 for her Master of Science degree in Computer Science. iii

6 ACKNOWLEDGEMENTS I would like to thank my advisor Dr. Kemafor Anyanwu Ogan for her guidance in my thesis work. Her knowledge, consistent motivation, humble attitude and passion for research and innovation made my first steps in academic research an exciting and fulfilling experience. I would like to thank Dr. Raju Vatsavai and Dr. Guoliang Jin for being part of my graduate committee. I would like to thank my colleagues in the semantic computing research lab - Sidan Gao, HyeongSik Kim, Shalki Shrivastava and Avimanyu Mukhopadhyay for their support. I would like to thank Sidan for her invaluable suggestions and feedback and fruitful research oriented discussions. Finally, I would like to thank my family and friends who have supported me in all my endeavors, without which this would not have been possible. iv

7 TABLE OF CONTENTS LIST OF TABLES LIST OF FIGURES vii viii Chapter 1 INTRODUCTION Structure of the Thesis Chapter 2 BACKGROUND RDF RDF Model RDF Serialization Formats Sparql Querying Data With Sparql Graph Pattern Matching General Definition of Graph Pattern Matching Types of Graph Patterns Data Models and Graph Pattern Matching Systems supporting the Graph Pattern Matching problem Generalized Graph Pattern Matching SPARQ2L Solve Algorithm Apache Jena Jena Architecture Chapter 3 APPROACH Generic Sparql Query Engine Extending Jena ARQ Query Processing Main Query Engine Custom Query Engine Algebra Extensions Expression Functions and Property Functions Query Plan Generation Architecture of Extended Jena Integrating the Path Operator Path Operator OpFindAllPaths Integration of the Path Operator User Workflow Path Query Syntax v

8 3.7 Related Work Chapter 4 EVALUATION Query Compilation Time Query Execution Time Usability List of Queries Chapter 5 CONCLUSION AND FUTURE WORK BIBLIOGRAPHY APPENDIX Appendix A Code vi

9 LIST OF TABLES Table 4.1 Compile Times of queries with and without path operator Table 4.2 Execution Times of queries with and without path operator vii

10 LIST OF FIGURES Figure 2.1 The graph describes two people, and each one has properties name, gender and knows Figure 2.2 The graph uses the FOAF ontology, describes two people, and each one has properties name, gender and knows Figure 2.3 The algorithm SOLVE to solve single source path expression problem.[tar81] Figure 2.4 The algorithm ELIMINATE is used to pre-compute the path sequence for an input graph G whose vertices are numbered from 1 to n.[tar81] 33 Figure 2.5 The Jena architecture illustrating interaction between the APIs.[Arc] 35 Figure 3.1 Data graph to illustrate query types Figure 3.2 The intuition behind the integration of the path operator is to decompose the path query and use the existing Jena framework to execute graph pattern matching, and use a path algebraic approach to extract paths Figure 3.3 A generic SPARQL Query processing, optimization and execution framework Figure 3.4 Phases of ARQ query processing Figure 3.5 Query plan for a simple example query without path operator integrated into the query engine Figure 3.6 Query plan for the example query after integrating the path operator into the query engine Figure 3.7 Query plan for the example query after integrating the path operator into the query engine Figure 3.8 The diagram shows the architecture of the Path-Extraction Enhanced Jena API Figure 3.9 The flowchart shows how the Jena API can be used to execute SPARQL queries with or without path computation, by using the object of class ExtendedModel and by specifying boolean arguments that reflect the user s choice of whether to compute paths or not Figure 4.1 Figure 4.2 Graph to illustrate the comparison of compile time in queries with and without path computations Graph to illustrate the comparison of execution time in queries with and without path computations viii

11 CHAPTER 1 INTRODUCTION With the exponential growth in data being contributed to the Semantic Web and its representation as large heterogeneous graphs, and the reliance of a large number of applications on the ability to query the unstructured data, there is an urgent need to introduce flexible querying paradigms. Traditional graph pattern matching infrastructure does not allow path extraction queries with variable structure patterns. In circumstances like intelligence analysis, crime investigations and social network analysis, there is need to find the connections between certain groups of entities. For example, a query that asks for bindings of the variable?edge in a graph pattern "PersonX"?edge "NC State" can match the edge variable for the triple "PersonX" "worksin" "NC State". To be able to perform generalized graph pattern matching, i.e. a term we use to refer to queries that match paths between source and destination node pairs, we need a more flexible querying paradigm. For example, airport security officers might need to query 1

12 CHAPTER 1. INTRODUCTION for the relation between a PersonX on the FBI watchlist and JFK airport, which would be represented by the path pattern: "PersonX"??path "JFK". This path query would look for paths like this: "PersonX" "passengerin" "Flight D143". "Flight D143" "arrivesat" "JFK". Or "PersonX" "passengerin" "Flight D143". "Flight D143" "departsfrom" "JFK". Path queries might also contain constraints, for example, the above query might have a constraint that the path should contain the edge label "arrivesat", in which case the second path would not be a valid result. RDF (Resource Description Framework), a collection of W3C specifications, is a popular data model used to represent data in the Semantic Web. SPARQL, the official W3C recommendation for querying RDF graphs, does not contain the path extraction capability. The aim of our research is to integrate a path operator in the existing open source Apache Jena framework and the ARQ query engine for SPARQL. There are a few common approaches that could have been used to implement the generalized graph pattern matching paradigm. The traditional graph pattern matching approach that roughly compares to relational query processing would use joins to execute path queries, but this approach would have high overheads, recursion would be limited and querying would be cumbersome. The other approach would be to specify the path query using a regular expression, which can be compiled into a deterministic finite automaton by the compiler and executed to find the path bindings. These approaches are limited in terms of their expressiveness, efficiency and performance on large disk resident graphs. In our research we use a hybrid approach of breaking down the path query into fixed and variable structure patterns, use the existing ARQ query engine infrastructure to execute the fixed pattern, and use a path algebraic algorithm to execute the variable structure and output the set of paths between the source and destination node pairs, with or without constraints. We then test the integrated system for accuracy and comparison of compile time between queries with and without our path operator integration. 2

13 1.1. STRUCTURE OF THE THESIS CHAPTER 1. INTRODUCTION 1.1 Structure of the Thesis The thesis document has been structured into five chapters, where Chapter 2 discusses the background of the research, comprising the RDF data model, SPARQL querying language, the traditional graph pattern matching paradigm and what generalized graph pattern matching implies. It also introduces the popular open source Apache Jena framework for SPARQL query processing and its architecture. The chapter points out to the SPARQ2L research publication that proposes a grammar extension that enables generalized graph pattern matching. It also talks about the Solve algorithm [Tar81] given by Tarjan that we use as a basis to implement our path operator and the prefix-solve algorithm [GA13] which is the implementation framework that allows to efficiently query graphs on disk. Chapter 3 discusses the approach used for the integration of the path operator, the extension points used in the existing Jena ARQ query engine and the implementation details. It also talks about how compilation of the path queries takes place, the path query syntax that we use and comparison with existing related work. Chapter 4 discusses the experiments conducted, their evaluation and results. It also gives the datasets and queries we used to perform the experiments. Chapter 5 gives the conclusion and scope for future work. The implemented code is given in Appendix A. 3

14 CHAPTER 2 BACKGROUND With the evolution of technology and the exponential growth of the World-Wide-Web, there is a huge volume of data being generated from social media, news, research data, repositories etc. A huge amount of it is stored as graphs, including hypertext data, data from social networking sites, unstructured and semi-structured data, especially because of the high expressive power of graphs to represent complex structures of data and its interconnections. Structured data (i.e. data in relational databases, spreadsheets) has its own advantages due to the fixed structure of the data model used. It is easier to enter, store, visualize, query, manage and has been tried and tested for over 40 years. On the other hand, modern applications and its use of data in the form of images, videos, webpages, PowerPoint presentations, blogs, PDF files, wikis, word processing documents, s makes it harder to store, query, manage or analyze by fitting it into neat columns and tables. The lack of structure in such data makes managing it a time and energy consuming task. Thus the need for solutions to manage semi-structured (for example- s) and unstructured (for example-videos, 4

15 2.1. RDF CHAPTER 2. BACKGROUND blogs) data arises. Unstructured data refers to the information that is not organized in a pre-defined manner. The irregularity and ambiguity in the data makes it difficult to understand and process using traditional programs and techniques as compared to relational data. There are several techniques such as data mining and natural language processing to find patterns in, or interpret unstructured data. Some structure can be imparted to the data by using tags or metadata making it semi-structured data. Unstructured data can be stored in document databases (for example-mongodb), graph databases (for example-neo4j) or key-value stores. 2.1 RDF [W3c] The World Wide Web was originally built for human usage and consumption, and even though everything on it is machine-readable, this data is not machine-understandable. Due to the sheer volume of the information on the Web, it is not possible to manage it manually. But at the same time it is hard to automate anything on the Web. The solution here is to use metadata to describe the data contained on the Web. Metadata is "data about data" or in the context of this specification "data describing Web resources". Resource Description Framework (RDF) is a foundation for processing metadata, formalized by the World Wide Web Consortium. It provides interoperability between applications that use and exchange machine-understandable information on the Web. RDF enables the automated processing of Web resources. RDF can be used in a vast number of application areas, for example - in resource discovery to implement better search engine capabilities, in cataloging for listing the content and content relationships available at a particular Web site, by intelligent software agents to aid knowledge sharing and exchange, in content rating, for characterizing intellectual property rights of Web pages, and for describing the privacy preferences of a user and the privacy policies of a website. RDF with digital signatures is the foundation to building the "Web of Trust" for electronic commerce, collaboration, and other applications. The RDF syntax uses the Extensible Markup Language (XML). RDF has a class system much like many object-oriented programming and modeling systems to define the meta- 5

16 2.1. RDF CHAPTER 2. BACKGROUND data. A collection of classes is called a schema. Classes are systematized in a hierarchy, and offer extensibility through subclass refinement RDF Model The base of RDF is a model for representing named properties and property values. The RDF model is based on well-established principles from various data representation communities. RDF properties are essentially attributes of resources and correspond to traditional attribute-value pairs. RDF properties represent relationships between resources. An RDF model can therefore resemble an entity-relationship diagram. In terms of object-oriented design, resources correspond to objects and properties correspond to instance variables. The RDF data model is a syntax-neutral representation of RDF expressions. Two RDF expressions are equivalent if and only if their data model representations match. The basic data model consists of three object types: Resources: All things being described by RDF expressions are resources. A resource may be an entire webpage, a part of a webpage, a whole collection of pages or an object that is not directly accessible via the web. Resources are always named by URIs (Uniform Resource Identifier) plus optional anchor IDs. Properties: A property is a specific feature, characteristic, attribute, or relation used to describe a resource. Each property has a definitive meaning, defines its acceptable values, the types of resources it can describe, and its relationship with other properties. Statements: A particular resource together with a named property and the value of that property for that resource is an RDF statement. These three individual parts of a statement are called, respectively, the subject, the predicate, and the object; also known as a subject-predicate-object triple. The object of a statement can be another resource or a literal. Since RDF is an abstract model for data representation, the exact encoding of data varies depending on the serialization format. The RDF model s simplicity and the ability to represent diverse data have led to its widespread use in knowledge management 6

17 2.1. RDF CHAPTER 2. BACKGROUND applications. An RDF database is basically a collection of triples or statements that represent a labeled, directed multi-graph. For example, below is a graph [converted using Figure 2.1 The graph describes two people, and each one has properties name, gender and knows RDF Serialization Formats Following are the popular RDF Serialization Formats [W3c]: Turtle: Terse RDF Triple Language is a compact, human friendly and concrete syntax for RDF defined by the W3C. Turtle is an extension of N-Triples which includes the most useful and appropriate things from Notation 3 while keeping it in the RDF model. The recommended XML syntax for RDF, RDF/XML has certain constraints imposed by XML and the use of XML Namespaces that prevent it encoding all RDF graphs. These constraints do not apply to Turtle. A Turtle document allows representing an RDF graph in a compact textual form. It comprises a sequence of directives, triple-generating statements or blank lines. Comments can be written after # and continue to the end of the line. Turtle represents the subject-predicate-object triple by grouping the three units of information together, and by factoring out the common portions. 7

18 2.1. RDF CHAPTER 2. BACKGROUND N-Triples: It is an easy, simple to parse, line based, and plain text format for representing the correct answers for parsing RDF/XML test cases as part of the RDF Core working group. It was designed to be a fixed subset of N3 and hence N3 tools such as cwm and Euler can be used to read and process it. It is simpler than Notation3 and Turtle, hence easier for software to parse and generate. It is recommended, but not required, that N-Triples content is stored in files with an.nt suffix to distinguish them from N3. For example, the graph in Fig () can be represented in N-Triples format as: _:genid1 < < _:genid1 < "Emma". _:genid1 < "Female". _:genid1 < _:genid2. _:genid2 < < _:genid2 < "Rachel". _:genid3 < < _:genid3 < "Rachel". _:genid3 < "Female". _:genid3 < _:genid4. _:genid4 < < _:genid4 < "Emma". N-Quads: It is an easy to parse line-based language. N-quads statements are a sequence of RDF terms defining the subject, predicate, object and graph label of an RDF Triple and the graph it is part of in a dataset. These may be separated by white space. This sequence is terminated by a. and a new line. N-Quads is a superset of N-Triples used to serialize multiple RDF graphs. 8

19 2.1. RDF CHAPTER 2. BACKGROUND JSON-LD: JSON-LD is a JSON-based format to serialize Linked Data. The syntax was designed to easily integrate into deployed systems that already use JSON, and supplies a smooth upgrade path from JSON to JSON-LD. It is mainly intended to be a way to use Linked Data in Web-based programming environments, to build interoperable Web services, and to store Linked Data in JSON-based storage engines. Notation 3: It is a non-standard serialization which is a compact and readable alternative to RDF s XML syntax, but also is extended to allow greater expressiveness. It is a superset of Turtle with additional features, such as support for RDF based rules. N3 extends the RDF data model by adding formulae, variables, logical implication, functional predicates, and providing a textual syntax alternative to RDF/XML. RDF/XML: It is an XML based syntax defined by W3C. This format was the first standard serialization format for RDF. The RDF graph has nodes and labeled directed arcs that link pairs of nodes. This graph is represented as a set of RDF triples where each triple contains a subject node, predicate and object node. Nodes can be IRIs, literals, or blank nodes. Predicates are IRIs and can be interpreted as either a relationship between the two nodes or as defining an attribute value for some subject node. To encode the graph in XML, the nodes and predicates have to be represented in XML terms i.e. element names, attribute names, element contents and attribute values. For example, the graph in Fig () can be represented in RDF/XML format as: <rdf:rdf xmlns:rdf=" xmlns:foaf=" <foaf:person> <foaf:name>emma</foaf:name> <foaf:gender>female</foaf:gender> <foaf:knows> <foaf:person> <foaf:name>rachel</foaf:name> </foaf:person> </foaf:knows> 9

20 2.2. SPARQL CHAPTER 2. BACKGROUND </foaf:person> <foaf:person> <foaf:name>rachel</foaf:name> <foaf:gender>female</foaf:gender> <foaf:knows> <foaf:person> <foaf:name>emma</foaf:name> </foaf:person> </foaf:knows> </foaf:person> </rdf:rdf> 2.2 Sparql SPARQL is the query language for the Semantic Web. "Trying to use the Semantic Web without SPARQL is like trying to use a relational database without SQL", explained Tim Berners-Lee, W3C Director. SPARQL can be used to articulate queries across diverse data sources, whether the data is stored natively as RDF or viewed as RDF via middleware. SPARQL queries hide the details of data management, which results in lower costs and increased robustness of data integration on the Web.[Jen][W3c] SPARQL was standardized by the RDF Data Access Working Group (DAWG) of the World Wide Web Consortium (W3C), and is one of the fundamental technologies of the semantic web. SPARQL queries can comprise triple patterns, optional patterns, conjunctions and disjunctions. SPARQL also supports extensible value testing and constraining queries by source RDF graph. The results of SPARQL queries can be results sets or RDF graphs. SPARQL has been implemented for multiple programming languages. There exist tools like ViziQuer that allow semi-automatic construction of SPARQL queries for a SPARQL endpoint. There also exist tools that translate SPARQL queries to other query languages like SQL and XQuery. Using SPARQL users can write queries against data that follows the RDF specification, 10

21 2.2. SPARQL CHAPTER 2. BACKGROUND or what can loosely be called "key-value" data. The database is essentially a set of "subjectpredicate-object" triples. SPARQL is a "data-oriented" query language, in that it only queries the information held in the models. There is no inference in the query language itself. SPARQL takes the description of what the application wants, in the form of a query, and returns the results of the query in the form of a set of bindings or an RDF graph. In terms of SQL relational database, RDF data can be considered as a table with three columns i.e. the subject, the predicate and the object column. Unlike relational databases, the data type of the column values is not required to be homogenous. The object column is heterogeneous. The data type of each cell value is implied, or specified in the ontology, by the predicate value. Again comparing to SQL relations, the RDF data would be a table with all triples for a given subject represented as a row, with the subject as the primary key and each possible predicate as a column and the object being the value in the cell. However, SPARQL/RDF becomes easier and more powerful for columns that could contain multiple values (for example- ids) for the same key, and where the column itself could be a joinable variable in the query, rather than directly specified. SPARQL provides a full set of analytic query operations such as JOIN, SORT, and AG- GREGATE for data that does not require a separate schema definition, but whose schema is intrinsically part of the data. The schema information is often provided externally to allow different datasets to be joined in an unambiguous manner. Additionally, SPARQL provides specific graph traversal syntax for data that can be thought of as a graph. Below is an example that demonstrates a simple query that leverages the ontology definition "foaf", also called the "friend-of-a-friend" ontology. The query returns names and s of every person in the dataset. PREFIX foaf: < SELECT?name?homepage WHERE {?person a foaf:person.?person foaf:name?name.?person foaf:homepage?homepage. This query results in a join of all the triples with a matching subject, where the type 11

22 2.2. SPARQL CHAPTER 2. BACKGROUND predicate ("a") is a person (foaf:person) and the person has one or more names (foaf:name) and homepages (foaf:homepages). The result of the join is a set of rows with bindings for variables?person,?name,?homepage. This query returns the?name and?homepage, and we chose not to return?person also because?person is often a complex URI rather than a human-friendly string. Some of the?people may have multiple homepages, so in the returned set, a?name row may appear multiple times, once for each homepage. This query can be called a federated query if distributed to multiple SPARQL endpoints, (i.e. services that accept SPARQL queries and return results), computed, and results gathered. In either case of a federated or a local query, additional triple definitions in the query could allow joins to different subject types, such as cars. For example, we could use such a query to return a list of names and homepages for people who drive cars with high fuel efficiency Querying Data With Sparql SPARQL queries RDF graphs which are essentially sets of triples, but can be serialized in any format, for example, RDF/XML, Turtle, N-Triples. The example graph given below uses the FOAF ontology, describes two people, and each one has properties name, gender and knows. As triples, the graph would be represented like foaf: rdf: < _:genid1 rdf:type foaf:person. _:genid1 foaf:name "Emma". _:genid1 foaf:gender "Female". _:genid1 foaf:knows _:genid2. _:genid2 rdf:type foaf:person. _:genid2 foaf:name "Rachel". _:genid3 rdf:type foaf:person. 12

23 2.2. SPARQL CHAPTER 2. BACKGROUND Figure 2.2 The graph uses the FOAF ontology, describes two people, and each one has properties name, gender and knows. _:genid3 foaf:name "Rachel". _:genid3 foaf:gender "Female". _:genid3 foaf:knows _:genid4. _:genid4 rdf:type foaf:person. _:genid4 foaf:name "Emma" Basic Patterns A basic pattern is a set of triple patterns that is obtained when the triple patterns all match with the same value used each time the variable with the same name is used. The following query involves two triple patterns, each ending in a.. The variable x has to be the same for each triple pattern match. SELECT?name WHERE {?x < "Female".?x < The solutions are: name 13

24 2.2. SPARQL CHAPTER 2. BACKGROUND "Emma" "Rachel" Filters Graph matching allows finding patterns in the graph. Filters allow the values in a solution to be restricted. Following are a few comparisons with filters. String Matching: SPARQL provides an operation based on regular expressions to test strings, this includes the SQL "LIKE" style tests. The syntax is: regex(?y, "pattern" [, "flags"]) The flags argument is optional. The flag "i" indicates a case-insensitive pattern match needs to be done. For example, the following query finds names with "E" or "e". PREFIX foaf: < SELECT?b WHERE {?a foaf:name?b. FILTER regex(?b, "e", "i") with the results b "Emma" "Rachel" Testing Values: There are requirements when the application wants to filter on the value of a variable. In the graph given above, assume that there exists an extra field for age. Following is an extract of the data with age property: 14

25 2.3. GRAPH PATTERN MATCHING CHAPTER 2. BACKGROUND < a foaf:person ; foaf:name "Rachel" ; foaf:age 35 ; foaf:gender "Female" ; foaf:knows ( < [ foaf:name "Emma" ] ). < a foaf:person ; foaf:name "Emma" ; foaf:age 5 ; foaf:gender "Female" ; foaf:knows ( < [ foaf:name "Rachel" ] ). A query to find the names of people who are younger than 40 is: PREFIX foaf: < SELECT?var WHERE {?var foaf:age?age. FILTER (?age < 40) The solution is: Var < < 2.3 Graph Pattern Matching [GT14] RDF data intrinsically represents a labeled, directed multi-graph in which every node represents a subject or object and every edge represents the predicate. A directed 15

26 2.3. GRAPH PATTERN MATCHING CHAPTER 2. BACKGROUND edge exists from the subject to the object for every subject-predicate-object triple in the data. With the increase in size of the database, the size and complexity of this graph increases and it becomes inefficient to find patterns in the data by visualizing it. Graph Pattern Matching refers to the problem of finding specific patterns in a given graph. The problem is popular in graph processing simply because it has widespread application in plagiarism detection, intelligence analysis, social networking, biology, chemistry, knowledge discovery and numerous other areas. The graph pattern matching problem can be implemented in various data models including the Resource Description Framework (RDF), Property Graphs or the Relational data model General Definition of Graph Pattern Matching Graph pattern matching is also referred to as sub-graph isomorphism, which is an NP complete problem. The inputs of the subgraph pattern matching problem are: a graph G with nodes V and edges E, where nodes and edges are labeled with strings. The label of a node is label(v) and the label of an edge e is label(e). a query pattern, which can also be viewed as a graph P = (V p ; E p ). The nodes and edges of graph P describe conditions that a subgraph of G must satisfy in order to be a match. Generally, the query pattern P is a conjunction of smaller patterns that together impose requirements on nodes and their neighborhoods in the data graph G. The graph pattern matching problem is to find all possible subgraphs of the given graph G that match a given pattern P. More precisely, the match can be defined as: structural match or isomorphism between P and a candidate solution, and conditions on labels of specific nodes and edges in a candidate solution We referred to a study conducted in research paper [GT14], of how a wide spectrum of systems handle and support the graph pattern matching problem. The approach in the study was to take the popular LUBM benchmark, model it across various domains 16

27 2.3. GRAPH PATTERN MATCHING CHAPTER 2. BACKGROUND (relational, RDF, property graph), and execute the benchmark queries on the corresponding systems. The systems were evaluated using a large data instance on a single machine (the largest dataset being LUBM-8000, which contains over 1 billion RDF triples). The paper concluded that the graph pattern matching problem can be expressed in the different data models such as the Resource Description Framework, property graphs and the relational model and can be evaluated using systems using all these data models; but contrary to popular belief and various vendors claims, modern native graph stores do not necessarily offer a competitive advantage over traditional relational and RDF stores, even for the graphspecific problem of pattern matching Types of Graph Patterns Nodes and Edges: This is the most basic graph pattern and consists of a single triple pattern that satisfies given conditions. No join is necessary in this case. An example query given below requests for all undergraduate students. Query: select?x where {?x isa "UndergraduateStudent" Neighborhoods and Stars: Neighborhoods are the second most important type of graph patterns. Neighbors match on the nodes which are adjacent and edges that are incident to a given node. The example query to find neighbors could ask for all graduate students that attend a certain course. Here, all the nodes that are adjacent to the two nodes labeled GraduateStudent and GraduateCourse0 would be answers to the query. Star patterns are neighborhoods with multiple patterns around a central node. Query: select?name? ?telephone where {?x worksfor?y.?y id "University0".?x name?name. 17

28 2.3. GRAPH PATTERN MATCHING CHAPTER 2. BACKGROUND?x ? .?x telephone?telephone Triangles: Triangle patterns look for three nodes adjacent to each other. An example is the following query: select?x?y?z where {?x undergraduatedegreefrom?y.?x memberof?z.?z suborganizationof?y Matching triangles is specially challenging from the point of view of query optimization as potentially many intermediate results need to be processed. Fixed and Variable Length Paths: Paths of fixed or variable length are a special case of graph patterns. Paths of fixed length occur for example when the names of students who attend a given course are queried. Variable length paths are commonly used to match hierarchical relationships. In the example query given below, all direct and transitive paths between variables y and z with edge label sub-organizationof must be matched. Answering a path query requires a series of joins (in triple stores or RDBMS) or some form of breadth-first expansion on the graph (in native graph stores) depending on the internal graph representation of a particular system. SPARQL supports path queries through property paths. Property paths can represent paths between two graph nodes, which may be a trivial single edge path of length 1, or an arbitrary length path. SPARQL uses property path expressions to represent paths, which are similar to regular expressions. During query execution, matches for the path expression are evaluated. The limitation of property paths is that they do not return the paths themselves, but only bindings for source and destination nodes can be obtained using them. Example Query: select?x?y where { 18

29 2.3. GRAPH PATTERN MATCHING CHAPTER 2. BACKGROUND?x worksfor?y.?y isa Department.?y sub-organizationof*?z Data Models and Graph Pattern Matching RDF: RDF and the SPARQL query language are the typical data representation and querying format in the domain of Semantic Web. It views the data as a collection of triples, each describing a statement using a subject-predicate-object format. An object can either be a literal value, or a URI which refers to a subject. The RDF representation naturally describes a graph where subject and object are nodes, and the predicate is the label of an edge between them. Property Graph: The property graph model extends the RDF concept and allows nodes and edges in the graph to have an arbitrary number of properties or key/value pairs, in addition to labels. The study conducted in the research paper [GT14] uses a natural mapping between the RDF and PG models. Whenever a subject-predicateobject triple has a literal value for object, the predicate and object are modeled as a property-value pair of the subject node. If the object is a URI, the triple describes an edge between subject and object nodes with the label on the predicate edges. This way, not all the triples in the RDF graph become edges in the PG model, which makes it a conceptually less verbose model. Cypher Query Language: Neo4j s Cypher is a declarative query language to query data in the property graph model. A cypher query allows specifying a subgraph in the MATCH clause with constants and variables in place of nodes and edges to be matched, like in a SPARQL query. Similar to SPARQL s FILTER clause, additional constraints on properties of nodes and edges can be expressed in the WHERE clause. Unlike SPARQL, relationships between nodes can have multiple properties. It is also possible to query the paths between nodes. Query by API: Querying by API is performed generally when the graph store does not support any declarative language, for example in the Sparksee system. In this case, all the cases of graph pattern matching, i.e. matching neighborhoods, 19

30 2.3. GRAPH PATTERN MATCHING CHAPTER 2. BACKGROUND triangles, paths, are performed via multiple calls of API functions that return nodes/edges of a given label and immediate neighbors of a given node. Relational Model: Graphs and pattern matching problems can arise and can be solved in the relational domain as well. In case of the relational data model, it is assumed that every node belongs to one particular type (e.g., Student) with a fixed set of statically determined properties. These node types are translated into relations and the relationships between nodes are stored in mapping tables. After the schema for the entire dataset is defined, indexes can be created to speed up lookups and joins. SQL queries for graph pattern matching are very verbose due to the fact that relationships are stored as separate tables. This means that a single hop lookup in the graph conceptually decodes into two joins. The advantage of using SQL and the relational model for graph pattern matching is that it allows us to leverage decades of development in transactional processing, query optimization and system tuning. Relational database management systems (RDBMS) can become an especially lucrative option for "hybrid" datasets, in which only part of the data is a graph, while some information comes in tables. Above stated are the ways to conceptually model graph pattern matching problems in different areas Systems supporting the Graph Pattern Matching problem RDF Databases: Virtuoso: Virtuoso is an existing relational store that models RDF graph as a single table and translates SPARQL queries into SQL. TripleRush: TripleRush is a research RDF database that represents the triple data as partially evaluated read-optimized patterns, matches a given query s graph pattern against those in parallel and then builds the results from the matched parts. This can be correlated to join indices. Relational Databases: 20

31 2.4. GENERALIZED GRAPH PATTERN MATCHING CHAPTER 2. BACKGROUND Virtuoso: Virtuoso is an RDF as well as relational store. It can be used to find graph pattern matches by using SQL queries. SQL queries can be written using domain knowledge to simplify matching the variable length paths. But to avoid making such assumptions about the data, one would need to rely on recursive SQL features. Graph Databases: Neo4j: Neo4j is an open-source native graph database. It supports functionalities similar to traditional RDBMSs such as full transactional support, a declarative query language (Cypher), availability and scalability through a distributed version. A major advantage of Neo4j is its intuitive way of modeling and querying graph-shaped data. It stores edges as double linked lists and properties are stored separately, referencing the nodes with corresponding properties. Sparkshee: Sparksee is a proprietary native graph database. It is a disk-based system that depends on B+-trees and compressed bitmap indexes to store nodes and edges with their properties. Sparksee uses custom API functions to provide access to data. The API contains a set of primitive operations on nodes and edges like adding and deleting nodes or extracting neighborhoods. The system additionally provides native implementation of core graph algorithms such as connected component detection, shortest paths and different traversals. Some of the use-cases for Sparksee include various types of graph analysis such as cluster and outlier detection. 2.4 Generalized Graph Pattern Matching We refer to the problem of generalized graph pattern matching as a general form of subgraph extraction query in which we are able to retrieve specific information about the structure of the data entities in the graph by involving variables and constraints in the query. Such functionality is of immense use as it provides a wider range of data exploration and querying. [Brü08] talks about generalized graph matching in relational databases for data mining and information retrieval purposes. It extends the idea of subgraph isomorphism to include 21

32 2.4. GENERALIZED GRAPH PATTERN MATCHING CHAPTER 2. BACKGROUND queries with don t care symbols, variables and constraints. The representation of relational data in a graph model enables to represent not only the values of the entities but also to explicitly model structural relations between different parts of an object. Typically, data mining aims to extract a subgraph of the underlying data graph and the concept of subgraph isomorphism is applied for such information retrieval. Subgraph isomorphism is a formal concept for checking subgraph equality, but it intuitively indicates that a smaller graph is part of a larger graph. Assuming that a query is represented by an attributed graph q (query graph), and the database graph is G, the knowledge mining system gives the result of the query as a binary decision (true or false) depending on whether the query graph q is contained in the database graph G. This kind of subgraph isomorphism approach in data extraction has some limitations, some of which are relevant only to relational databases. Firstly, the database graph has a large number of attributes that might be irrelevant for a particular query. Secondly, the result of such a query is always a binary decision and cannot be used to extract data points or paths between the entities. Thirdly, this approach does not allow the query to impose constraints on the attributes of the query graph to model restrictions or dependencies. The generalized subgraph isomorphism approach discussed in [Brü08] overcomes these limitations. Thus [Brü08] uses a generalized graph pattern matching approach to enhance the querying capability leading to a powerful and flexible graph pattern matching framework apt for general graph based data mining. In our research we refer to generalized graph pattern matching with respect to unstructured data. With an increasing number of organizations contemplating the value of endorsing Semantic Web technologies, the deciding factor becomes the degree to which the needs of their applications are supported. There is considerable amount of effort being made to develop storage and querying facilities for the Semantic Web, but there is still a wide gap between the present support and the needs of specific types of applications. For example, investigative and analytical applications in areas such as national security and business intelligence have the need to identify connections or relationships between entities in the data. In such cases, structural queries like "Find the relationship between P, Q and R" come into play; and this is where we intend to use the generalized graph pattern matching paradigm. Such queries specify some anchor points in the data graph and aim to "extract" a data subgraph connecting these anchor points. The queries might include some constraints on the type of connections to be included or omitted in the result subgraph. A 22

33 2.4. GENERALIZED GRAPH PATTERN MATCHING CHAPTER 2. BACKGROUND few example analysis tasks that use such queries are given below: Example 1: Flight and Airport Risk Assessment (adapted from [Any07]): To determine potential threats to flight and airport safety, security officials would query for and investigate all high risk passengers scheduled for the flight. Find the relationships between passengers scheduled for flights to Washington DC, who purchased their tickets by cash or purchased their tickets less than a day before departure, and have links to flight training. Example 2: Local Threat Assessment: To determine threats to local safety, for example, by civilians in the San Bernardino Shooting incident, the local security officials would like to regularly query and investigate high risk residents in the area. Find the relationships between people who recently purchased weapons or are frequent visitors at a shooting range, and have been in touch with any person suspected of terrorism. Example 3: Background check for potential hires: To determine whether a candidate is a potential risk to the organization, the management would want to perform a background check to find out any criminal behavior. Find the relationships between candidates in the organization s interview process, and entities in the government watchlist. All the example queries aim to retrieve paths or subgraphs connecting specific nodes where the paths are subject to some constraints, for example, associated with flight training or have been in contact with potential terrorist suspects. Such queries can have very important applications where network analysis plays a significant role, as in case of planning, anti-money laundering or detecting patent infringements. Unfortunately, as per our knowledge, the existing Semantic Web query languages do not support the expression of such queries about constrained path relationships SPARQ2L SPARQ2L [Any07] is an initial proposal to extend SPARQL to be able to do generalized graph pattern matching. It addresses the need for applications in analytical domains to be able to find connections between a set of entities and to be able to query the structure of the data. Such queries aim to extract relationships or path between the entities, and will often specify constraints that the qualifying paths must satisfy. Most present day Semantic Web 23

34 2.4. GENERALIZED GRAPH PATTERN MATCHING CHAPTER 2. BACKGROUND query languages, including SPARQL, do not support the ability to extract paths based on queries about arbitrary path structures in the data. The SPARQ2L Language SPARQ2L proposes a query language which extends SPARQL with path variables and path variable constraint expressions. The language supports three kinds of constraints: Constraints on Nodes and Edges: constrain paths based on presence or absence of specific nodes and edges. Cost-based constraints: constrain paths based on their cost, in weighted graphs. Structure-based constraints: constrain paths depending on their structural properties, for example, simple path, presence of pattern in path. The SPARQ2L language introduces the concept of path variables and path filter expressions to be able to express generalized graph pattern matching queries. It extends the query patterns supported to include RDF path patterns which generalize standard SPARQL graph pattern expression to include triple patterns with path variables in the predicate positions. A few terms used in SPARQ2L are: RDF Term: collectively refers to I, L and B, which are pairwise disjoint infinite sets of IRIs, literals and blank nodes respectively. RDF Triple: It is a 3-tuple (s, p,o ) (I B ) I (I B L) where s is the subject, p is the predicate and o is the object. Directed Edge-Labeled Graph: It is a graph G = (V, E,λ,Σ) such that E V V, and λ is a function from E to a set of labels Σ i.e. λ : E Σ. RDF Triple Graph: It is a directed edge labeled graph G = (s, o,(s, o ),λ, p ) for an RDF triple (s, p,o ). RDF Database Graph: It is a directed edge labeled graph formed from the union of the triple graphs for t 1, t 2, a nd t n for a set of triples t 1, t 2,..t n. 24

35 2.4. GENERALIZED GRAPH PATTERN MATCHING CHAPTER 2. BACKGROUND RDF Path: from node x to node y in an RDF database G is a sequence of triples (x, p 1, o 1 ),(s 2, p 2, o 2 ),,(s k, p k, y ) such that o i = s i +1, i = 1, 2,, k 1. Simple RDF Path: is an RDF Path if for all i, j, i j implies o i o j i.e., no node is repeated on the path. Let V N (regular variables) and V P (path variables) be two pairwise disjoint sets of variables that are also disjoint from I L B. Triple Pattern: It is a tuple in (I L V N ) (I V N ) (I L V N ). The set of all triple patterns is T. Path Triple Pattern: It is a triple pattern with a path variable in the predicate position. Path Pattern Expression: It is like a SPARQL triple pattern except that it can contain path variables in the predicate position. It can consist of a SPARQL graph pattern, a path triple pattern and some built-in path filter conditions. Regular Expression over a set X : if x X, then x,(x ),(x ) +,(x )? are regular expressions. If x and y are regular expressions, then x y, x y are also regular expressions. T-Regular Expression: It is a regular expression over a triple pattern or an extended regular expression of the form ([s, ], p,[, o])+ where (s, p,o) a triple pattern. An extended regular expression matches a path such that the subject of the first triple in the path is s and object of last triple is o, matches arbitrary intermediate nodes on the path and all the predicates on the path are p. R(T ) is the set of regular and extended regular expressions over T. SPARQ2L defines a Path Built-in Condition as an expression built from I V P L R (T ), logical operators (, ), comparison operators (=, ) and path built-in functions where path built-in functions are: containsany: (V P, 2 I ) B o o l e a n containsall: (V P, 2 I ) B o o l e a n containspattern: (V P, R (T )) B o o l e a n 25

36 2.4. GENERALIZED GRAPH PATTERN MATCHING CHAPTER 2. BACKGROUND issimple: V P B o o l e a n cost : V P Given a variable??p V P, a constant c, C I is a set of IRIs, T P is a T-regular expression, path built-in conditions are: cost(??p) = c, containsany(??p, C), containsall((??p, C), containspattern(??p, TP), and issimple(??p). If B C 1 and B C 2 are path built-in conditions, then ( B C 1 ) and (B C 1 B C 2 ) are path built-in conditions. SPARQ2L defines a Path Pattern Expression recursively as: a 3-tuple q (I V N L) V P (I V N L) called a path triple pattern is a path pattern. if GP is a SPARQL graph pattern and PP is a path pattern then (PP AND GP) is a path pattern. if PP is an path pattern and F is a path built-in condition then (PP PATHFILTER F) is a path pattern. A SPARQL graph pattern can be semantically defined as a function [[ ]] where input is a pattern expression and output is a set of mappings where a mapping µ is a partial function from V N to R D F T, R D F T = I L B. d o m(µ) is the subset of V N in which µ is defined. 2 R D F T is the set of possible tuples from R D F T. SPARQ2L defines a pmapping ω as a partial function from (V P V N ) to (2 R D F T R D F T ) such that ω(v p V P ) = p 2 R D F T and ω(v n V N ) = R D F T. For a path triple pattern t p, ω(t p) is the tuple formed by substituting any variables v n V N v p V P in t p according to ω. d o m(ω) is the domain of ω and is the subset of V P V N. A mapping µ is compatible with a pmapping ω if when x d o m(µ) d o m(ω), then µ(x ) ω(x ). The join of a set of mappings Ω and a set of pmappings Θ is defined as Ω Θ = µ ω µ Ω,ω Θ are compatible. A Path Pattern Solution is the solution of a path pattern PP over D and denoted by [[ ]] where D is an RDF dataset over RDFT, t p is a path triple pattern whose variables are defined by v a r (t p ) and G P 1 is a graph pattern. It is defined recursively as: 26

37 2.4. GENERALIZED GRAPH PATTERN MATCHING CHAPTER 2. BACKGROUND [[t p ]] D = ω d o m(ω) = v a r (t p ) and ω(t p ) forms a path in D [[(P PAN D G P )]]D = [[P P ]]D [[G P ]] D A pmapping ω satisfies a built-in condition F or ω = F for path patterns with PATH- FILTER expressions. I is a subset of the set of IRIs and t r is a tp-regular expression. F is containsany(??p, I ) and??p dom(ω) and I ω(??p). F is containsall(??p, I ) and??p dom(ω) and I ω(??p). F is containspattern(??p, tr) and??p dom(ω) and ground(tr) is a subpath of ω(??p). F is issimple (??P ) and??p dom(ω) and for x, y ω(??p), x y. F is ( F 1 ), F 1 is a built-in condition, ω = F 1 F is (F 1 F 2 ), F 1 and F 2 are built-in conditions, ω = F 1 and ω = F 2 A few example queries using the SPARQ2L grammar are: Non-Simple Path Query Find any feedback loops that involve the compound Methionine. SELECT??p WHERE {?x??p?x.?z compound:name "Methionine". PathFilter(containsAny(??p,?z)) Path Query with Terminal Node Constraints Is EmployeeA connected in anyway to entities on the Government watchlist? SELECT??p WHERE {?x??p?y.?x foaf:name "EmployeeA".?y rdf:type sec:government_watchlist. Path Query with Constraint on Intermediate Nodes Find the paths of influence of Mycobacterium Tuberculosis MTB organism on PI3K signaling pathways. 27

38 2.4. GENERALIZED GRAPH PATTERN MATCHING CHAPTER 2. BACKGROUND SELECT??p WHERE {?x??p?y.?x bio:name "MTB Surface Molecule".?y rdf:type bio:cellular_response_event.?z rdf:type bio:pi3k_enzyme. PathFilter(containsAny(??p,?z)) Path Query with Path Length Constraint Find all close connections (< 4 hops) between SalesPersonA and CIO-Y. SELECT??p WHERE {?x??p?y.?x foaf:name "salespersona".?y company:is_cio?z.?z company:name "CompanyY". PathFilter( cost(??p) < 4 ) Path query with path pattern constraint Find social relationships between potential jurors and a defendant. SELECT??p WHERE {?x??p?y.?x foaf:name "defendantx".?y foaf:name "jurory". PathFilter( containspattern (??p, [?a,.] foaf:knows [.,?b])+ ) The SPARQ2L Query Evaluation Framework SPARQ2L also proposes a novel query evaluation framework for solving path problems using efficient algebraic techniques. This would allow path queries to be efficient even on disk resident RDF graphs. The framework uses an algebraic technique similar to Gaussian elimination for solving a system of linear equations by LU decomposition to solve path problems. To solve a system of equations using Gaussian elimination, a matrix representing 28

39 2.4. GENERALIZED GRAPH PATTERN MATCHING CHAPTER 2. BACKGROUND the system of linear equations Mx=b is decomposed into two triangular matrices L and U. The system Ly=b (frontsolving) is solved first and then y is substituted in Ux=y to solve for vector x. The triangular systems L and U can be used to solve for different values of b. This allows for the computationally dominant LU decomposition phase to be reused for different problem instances. A variety of path problems can be solved by interpreting the sum and product operations appropriately. Solving a path problem instance using the triangular matrices involves processing each triangular matrix in a specific order. The SPARQ2L framework focuses on indexing and storing the contents of these matrices so that the system can skip processing submatrices that are irrelevant to a query. The SPARQ2L system for multi-paradigm RDF querying includes support for pattern matching queries, path queries and keyword queries. The first step is to load RDF Schema and data documents into internal graph data structures. Then different preprocessing steps are performed on the data which produce relevant indexes on the data for each of the querying paradigms, for example, Pattern Matching Indexes stored in the Pattern Match Store and Path Index stored in the Path Store. The Query Processor Module comprises three different kinds of query processors for processing each type of query. A query, however, may be processed by multiple processors. For example, a path query may have some constraints that involve standard graph pattern matching. The data preprocessing phase for path query processing involves construction, labeling and indexing of a graph s path sequence. The LU decomposition phase is used to compute partial path summaries, which means that for certain pairs of nodes, some of the paths connecting the nodes are computed at this phase. The path summaries are a concise representation of path information as opposed to an enumerated listing of paths. Assuming that we have the following triples (x, p1, y), (x, p2, y), (y, p3, z) represented as labeled edges in an RDF graph, Paths from x to z can be summarized as (p1 p2 p3). A triple of such a regular expression and the source and destination nodes can be referred as a P-Expression e.g. ((p1 p2 p3), x, z). These p-expressions can be treated as strings for discussion purposes, but the SPARQ2L approach uses a more efficient implementation where p-expressions are represented using a binary encoding scheme that enables the path filtering step for path constraint evaluation to be performed efficiently using bit operations. The LU decomposition phase of the preprocessing requires that the RDF graph G be ordered, G α = (G,α) where α : 1, 2,...N V (G ) so that α(i ) maps to some node v in G i.e. 29

40 2.4. GENERALIZED GRAPH PATTERN MATCHING CHAPTER 2. BACKGROUND v V (G ). Conversely, α 1 (v ) maps a node in G to an integer between 1 and N. At the end of the LU decomposition, the elements of M satisfy one of the following conditions, for u, v V (G ): M [ 1 (u), 1 (v )] for 1 (u) 1 (v ) contains a p-expression representing exactly the paths from u to v that do not contain any intermediate vertex w such that 1 (w ) > 1 (v ). M [ 1 (u), 1 (v )] for 1 (u) < 1 (v ) contains a p-expression representing exactly the paths from u to v that do not contain any intermediate vertex w such that 1 (w ) < 1 (u). Preprocessing begins by initializing M [i, j ] for 1 i, j N with a p-expression representing a union of the set of edges between the nodes α(i ), α( j ). This union p-expression is then systematically updated to represent other paths that satisfy the above constraints. A naive algorithm for the LU decomposition phase runs in O (N 3 ). The path sequence for G is the sequence of p-expressions (X i, u i, v i ) where 1 (u i ) 1 (v i ) in increasing order on 1 (u i ) is followed by the sequence of p-expressions (X i, u i, v i ) for 1 (u i ) > 1 (v i ) in decreasing order on α(u i ). Support for efficient evaluation of path queries on disk-based databases requires an effective disk storage model for graphs. Since a path sequence has the Single-Scan-Path- Preserving property which means that for any given node u in G, it is possible to compute complete path information for u by aggregating the partial path fragments during a single scan of the path sequence. This means that it is possible to index this sequence using a B+ tree and then process queries using modified range queries. To minimize the width of the range retrieved to process each query, the framework clusters p-expressions on the path sequence based on their likelihood of being relevant or irrelevant for the same class of queries. This approach minimizes the number of disk requests and disk-seek operations needed when evaluating queries. On the other hand, a more fragmented organization of relevant and irrelevant p-expressions would lead to queries requiring many small relevant clusters that are scattered across the sequence and hence many more disk seek operations. This clustering is achieved logically by using a graph numbering or labeling scheme that assigns groups of related nodes and thus associated p-expressions numbers in contiguous intervals. The framework uses a graph labeling mechanism to be able to identify groups of related nodes and develops an effective sequential representation for a graph which associates key values derived from the hierarchical labeling with elements of a path sequence. 30

41 2.5. SOLVE ALGORITHM CHAPTER 2. BACKGROUND In the path query processing phase, the Path Finder evaluates a query by successively retrieving the relevant p-expressions from disk and composing them into larger p-expressions that compose the solution. Path Finder achieves this using the Path-Solve algorithm which begins by initializing a matrix which keeps track of the composed p-expressions and is filled as the algorithm proceeds. 2.5 Solve Algorithm To be able to implement the support for generalized graph pattern matching we need to implement physical and logical operators integrated as part of the Jena framework. The physical operator would comprise of the algorithm that works on the data graph and the query and extracts the required information. We refer to a graph theoretical framework [Tar81], that has been extended in [GA13] to an implementation framework for databases that allows to efficiently query graphs on disk. Tarjan introduces an algorithm [Tar81] to find single-source path expressions, i.e. a regular expression P(s, v) for each vertex v which represents the set of all paths in directed graph G(V,E) from source vertex s to v, such that σ(p (s, v )) contains all paths from s to v. The algorithm works by dividing G into components, computing path expressions on the components by Gaussian elimination, and combining the solutions. The algorithm s time complexity is O (mαt (m, n)), plus time to compute path expressions within the components, where n is the number of vertices in G, m is the number of edges in G, and α is a functional inverse of Ackermann s function. If G is a reducible flow graph and each component of G is a single vertex, then the method requires O(m α(m, n)) time total. Gaussian elimination method consists of two steps: LU decomposition where matrix A is decomposed into L(lower triangular) and U(upper triangular); and frontsolving (Ly=b) and backsolving (Ux=y) to solve the system of linear equations Ax=b. Given a path sequence (P 1, v 1, w 1 ),(P 2, v 2, w 2 ),...(P l, v l, w l ), the single-source path expression problem for any source s can be solved by using the following propagation algorithm SOLVE, where a path sequence for a directed graph G is a sequence (P 1, v 1, w 1 ),(P 2, v 2, w 2 ),...(P l, v l, w l ) such that (i) P i is an unambiguous path expression expressed as (v i, w i ) for 1 i l (ii) if v i = w i, then Λ σ(p i ), for 1 i l (iii) For any non-empty path p in G, there is a unique 31

42 2.5. SOLVE ALGORITHM CHAPTER 2. BACKGROUND sequence of indices 1 i 1 < i 2 <... < i k l and a unique partition of p into non-empty paths p = p 1, p 2,..., p k such that p j σ(p i j ) for 1 j k. Figure 2.3 The algorithm SOLVE to solve single source path expression problem.[tar81] SOLVE is a generalization of the frontsolving-backsolving step in Gaussian elimination and its running time is O(n + l ). To solve a single-source path expression problem on a graph G, SOLVE is applied once after constructing the path sequence. To solve an all-pairs path expression problem, SOLVE is applied n times, once for each possible source after constructing the path sequence. Step 1 of Gaussian elimination method is used to construct the path sequence of a graph, by using algorithm ELIMINATE. For dense graphs the time complexity is O (n 3 + m) and the space complexity is O (n 2 ). The solve algorithm is a propagation algorithm that evaluates path expressions for a single source. To evaluate path expressions for multiple sources, the algorithm needs to iterate for each source. In that case the complexity of the algorithm escalates n times, where n is the number of nodes. The general purpose algorithm [Tar81] for solving any path problem on a given graph is extended and an implementation framework provided in [GA13] which utilizes sharing of computations based on suffix equivalence. Suffix equivalence amidst subqueries takes into account the fact that multiple subqueries with different prefixes can share a suffix and hence share the computation of shared suffixes, which allows prefix path computations to share common suffix path computations. Even though this approach does not change the theoretical complexity, it results in orders of magnitude 32

43 2.5. SOLVE ALGORITHM CHAPTER 2. BACKGROUND Figure 2.4 The algorithm ELIMINATE is used to pre-compute the path sequence for an input graph G whose vertices are numbered from 1 to n.[tar81] better performance than current existing graph navigational techniques due to reduced I/O. The existing graph navigational techniques require the decomposition of MSMD queries into multiple single-source or destination path subqueries, each of which is solved independently, and typically generate very poor I/O access patterns for large, disk-resident graphs and for MSMD path queries, such poor access patterns may be repeated if common graph exploration steps exist across subqueries. As part of the pre-computation step we use for the path operator, the input RDF file is read and converted to a graph model instance with nodes and edges, which is stored in an adjacent linked list. The algorithm then computes strongly connected components for the graph and creates path sequences for each strongly connected component. These path sequences are then combined to create one path sequence for the entire graph, using the eliminate algorithm [Tar81]. B+ trees are used to index the path sequence information, using the labeling scheme discussed in [Any07]. Path expressions on the path sequence are clustered together based on their possibility of relevance for the similar queries. This ensures minimal disk request and seek operations to evaluate path queries. The path sequence information is then stored at the metadata location, and can be used during query execution time to find path expressions. 33

44 2.6. APACHE JENA CHAPTER 2. BACKGROUND 2.6 Apache Jena [Arc] [Jen] Apache Jena is a free and open source Java framework composed of different APIs interacting together to process RDF data, for building semantic web and Linked Data applications. The Jena API enables developers to extract data from and write to RDF graphs, which are represented as an abstract "model". Jena supports the serialization of RDF graphs to: A relational database RDF/XML Turtle Notation Jena Architecture Jena, at its core, stores information as RDF triples in directed graphs, and allows the application code to add, remove, store, manipulate and publish that information. Jena comprises of a number of major subsystems with clearly defined interfaces between them. Jena s RDF API is used to access RDF triples and graphs and their various components. The API has support for adding and removing triples to graphs and basic graph pattern matching. Jena allows the developer s code to read in RDF from external sources, files or URLs, and serialize a graph in correctly-formatted text form. Input as well as output support most of the commonly-used RDF syntaxes. Typical abstractions in this API are: Resource representing an RDF resource Literal representing data values Statement representing an RDF triple Model representing the whole graph 34

45 2.6. APACHE JENA CHAPTER 2. BACKGROUND Figure 2.5 The Jena architecture illustrating interaction between the APIs.[Arc] 35

46 2.6. APACHE JENA CHAPTER 2. BACKGROUND Jena supports a rich programming interface to Model, but internally it stores the RDF graph in a simpler abstraction named Graph. This allows Jena to use a variety of different storage strategies that conform to the Graph interface equivalently. Jena can store a graph as an in-memory store, as a persistent store using a custom disk-based tuple index, or in an SQL database. The graph interface can be extended to connect other stores like LDAP to Jena, by writing an adapter that allows the calls from the Graph API to work on that store. Jena s inference API supports a vital feature of semantic web applications, which is that the semantic rules of RDF, RDFS and OWL can be used to infer information that is not explicitly stated in the graph. For example, if Z is a descendant of Y, and Y is a descendant of X, then by implication Z is a descendant of X. Jena s inference API enables applications to add these entailed triples in the store just as if they had been added explicitly. The inference API provides various rule engines to perform this, either using the built-in rule sets for OWL and RDFS, or using application custom rules. The inference API can also be connected up to an external reasoner to perform inference with specialized reasoning algorithms. The Jena SPARQL API handles SPARQL, the RDF query language, for both query and update. Jena conforms to all published standards, and tracks the revisions and updates in the under-development areas of the standard. The Jena Ontology API supports both ontology languages for RDF, RDFS and OWL. Ontologies, which are formal logical descriptions or models of entities and interactions of some areas of real life, are essential to many semantic web applications. Ontologies can be shared with other developers and researchers, which makes it a good basis for building linked-data applications. The API supports methods that know about the richer representation forms available to applications through OWL and RDFS. The Java API enables applications to access the above capabilities. Fuseki, a data publishing server, can present and update RDF models over the web using SPARQL and HTTP, which is a common requirement of modern applications. Jena also has other components like command-line tools and specialized indexes for text-based lookup. Jena ARQ is the query engine for Jena that supports the SPARQL RDF Query language. 36

47 CHAPTER 3 APPROACH Existing graph data models can mostly be classified into two categories, data models based on property graphs and the RDF data model. Property graph data models assign properties to nodes and edges, but properties in RDF are represented as supplementary nodes. Most current graph algebra proposals have been defined in the context of RDF, which are often only slight extensions of the relational algebra since RDF triples can easily be mapped to relations [HG16]. Cyganiak s report [Cyg05] outlines the transformation from SPARQL to relational algebra, an abstract intermediary language to express and analyze queries. This transformation makes existing work on query planning and optimization available to SPARQL queries and also enables SPARQL support in database-backed RDF stores. Querying in this case would be performed by a relational language on the relational data model representation D(G) of the graph database G instead of on G. The disadvantage in this case is that the more popular queries in graph databases i.e. navigational queries cannot be expressed easily by relational languages like SQL, where recursion is limited and represented by joins which can be cumbersome and time consuming. [BB13] 37

48 CHAPTER 3. APPROACH There are various possible approaches to execute generalized graph pattern matching queries on graph databases. Regular Path Queries (RPQs) are common in querying graphs where the user might want to find pairs (x, y ) of nodes such that there exists a path from x to y whose sequence of edge labels matches some specified pattern in the query [Woo12]. An RPQ is a regular expression R on the node or edge labels of a graph G, whose result is the set of all acyclic paths in G and the concatenation of labels of the paths defines R. Evaluating regular path queries on graph databases is an NP hard problem, which can be illustrated by an example. Assume a graph of researchers, with nodes labeled P (Professor) or T (Student), and directed edges labeled S (Supervised work) or J (Joint work). The query P (J P )(J P )? would find all paths between a professor and direct or indirect co-workers. The query (P S)(P S) + (P T ) would find all paths between a professor and his doctorate descendants. Assume a few more nodes labeled N (Nobel prize) and A (Sigmod award) and edges labeled H (Honored) to connect the researchers to the awards, then the query (P S) + P H N would find doctorate predecessors of all Nobel prize winners, and the query (P S) + P H (N A) would find doctorate predecessors of any prize winner [KL12]. Path queries have been studied in detail for XML, where the predominant approach is to use automata. In this approach both the graph and the query are represented as automata, and the intersection automaton is the subgraph specified by the query. To harness this approach the graph needs to be translated into a DFA, which may need exponential construction time and can be of exponential space. Research shows that automata-based regular path query evaluation works well for XML trees but its space consumption is huge on general graphs. Moreover automata-based approaches do not use specific properties of the data graph to decrease query execution time [KL12]. This approach is used by many graph databases like Neo4j, but there are limitations of this approach, some being that it is not a very flexible solution and requires limitations to be imposed on the nodes and edges in the query. This methodology has also been extended in some research works [Bon15] to learn path queries on graphs defined by regular expressions based on user examples. A typical query Q may want to find students who study both Database Systems and Operating Systems. Such a query is a simple Conjunctive Query (CQ) that returns a set of nodes as the answer [Woo12]. Using the format illustrated in [Woo12], this query can be expressed as ans(x) <- (x, studies, Database Sys),(x, studies, Operating Sys) 38

49 CHAPTER 3. APPROACH Figure 3.1 Data graph to illustrate query types. Where x is a variable and studies, Database Sys and Operating Sys are constants. An RPQ using the same graph could be a query to find pairs of student x and place y such that x lives in y or x lives in a place that is located in y. This query can be represented using the regular expression livesin. locatedin*. Conjunctive Regular Path Queries (CRPQs) are a combination of CQs and RPQs, for example, the above two queries can be combined to form a query that is represented as follows ans(x,y) <- (x, studies, Database Sys),(x, studies, Operating Sys),(x, (livesin. locatedin*), y) Support for executing conjunctive queries is available in existing systems, and Jena supports these queries using SPARQL and graph pattern matching. The support for querying paths using regular expressions and providing support for CRPQs has been recently introduced in SPARQL 1.1 [Woo12]. Generalized graph pattern matching is supported in SPARQL by use of Property Paths, but its limitation is that paths themselves are not returned. 39

50 CHAPTER 3. APPROACH We propose the idea of generalized graph pattern matching as queries where specifying the edges is not required, and general queries like?x??p?y would be able to find all paths between nodes x and y. Also, in path queries we would like include the paths themselves as output of the query. In our research, we address these issues. Figure 3.2 The intuition behind the integration of the path operator is to decompose the path query and use the existing Jena framework to execute graph pattern matching, and use a path algebraic approach to extract paths. In our approach, we disintegrate the query into parts, one which is analogous to CQs, can be mapped to fixed patterns and can be easily solved by the existing graph pattern matching mechanism. We use SPARQL to express our queries and the existing Jena framework to generate the query plan for this part of the query. The other part of query is the graph traversal problem. We use the prefix-solve algorithm [GA13] to find paths between a given pair of source and destination node sets, and also enable support for constrained path querying. For example, Fig. 3.2 illustrates how the query 40

51 3.1. GENERIC SPARQL QUERY ENGINE CHAPTER 3. APPROACH SELECT??p WHERE {?x??p?y.?x foaf:name "EmployeeA".?y rdf:type sec:government_watchlist. is decomposed into two parts. The left branch illustrates the triple patterns which are evaluated using the existing graph pattern matching support for SPARQL. The right branch illustrates the path pattern?x??p?y for which we implement the path operator that uses a path algebraic algorithm to compute path expressions. The output of the query consists of all paths between bindings for variables x and y. In our research we explore how to use the relational query infrastructure that is supported by the open source semantic web framework Jena, and extend it such that the traditional relational execution infrastructure can be used for the fixed conjunctive part of the query and integrate a new operator in the Jena API to evaluate navigational queries on disk resident databases. The significant advantage of this is that it uses the path aggregation mechanism as opposed to the path enumeration method i.e. explicitly enumerate paths, which is cumbersome in case of infinite length paths. 3.1 Generic Sparql Query Engine [MR12] A generic query engine would generally comprise of Query Processing, Optimization and Execution steps as illustrated by the following flowchart, which depicts the internal structure and transformation of the SPARQL query during the evaluation process. Following are the steps taken by the query engine to evaluate the query: Query Processing: A query written in a high-level language is scanned to identify language tokens, for example, keywords, attribute names, relation names; parsed to check the syntax of the query to ensure that the query conforms to the grammar rules; and validated to check if the attributes and relation names are valid or not. During the query processing step, the parser identifies SPARQL keywords and checks for any syntax errors. The SPARQL query order is also verified, for example, the order of SELECT, WHERE, FILTER operations are verified. The validation step checks the RDF 41

52 3.1. GENERIC SPARQL QUERY ENGINE CHAPTER 3. APPROACH Figure 3.3 A generic SPARQL Query processing, optimization and execution framework. 42

53 3.2. EXTENDING JENA CHAPTER 3. APPROACH attributes within the SPARQL query. The query is then converted into a query tree, or its equivalent operator tree, which is the tree data structure internal representation of the query. This tree represents the execution sequence of operations in the query and can be used for optimization purpose based in the SPARQL algebra. The query can also be represented using a query graph which is a graph data structure. The query graph or tree is then optimized by the processor and an execution plan is created. The query code generator then generates the code to execute this plan, which is run by the runtime DB processor to generate query results. Query Optimization: This step is to select the most suitable strategy to process a query. Since SPARQL is a declarative language, the query engine needs to choose the most efficient way to evaluate a query. Querying capabilities are provided by all RDF repositories, but some require manual interaction to decrease the execution time for the query. For ontological data based on graph models, user queries are executed without optimizations, but for large ontologies, optimization techniques are important to make sure that the query execution time is reasonable. The main idea of query optimization is reordering or rewriting of the query (for example, rewriting FILTER variables) using some approach like selectivity estimation, which illustrates how triple pattern reordering based on their selectivity affects the query execution performance. An efficient query plan can be obtained by using the query graph model and transformation rules to rewrite a query into a semantically equivalent one. Query Execution: After converting the query to the QEP (Query Execution Plan), which is the code generated from the optimized operator tree, the query engine evaluates the query using any of the open source tools available, for example, Twinkle and Jena ARQ. 3.2 Extending Jena Apache Jena documentation [Doc] provides the mechanisms that can be used to extend and modify query execution within the Jena ARQ query engine. Through these techniques, ARQ can be utilized to query different graph implementations and to provide different query 43

54 3.2. EXTENDING JENA CHAPTER 3. APPROACH evaluation and optimization strategies for specific circumstances. A number of ways are available to extend ARQ to integrate custom code into a query, a few of which are custom filter functions and property functions that allow addition of application specific code. ARQ can be extended at the basic graph matching or algebra level for high SPARQL performance, but Jena itself can also be extended by supplementing a new implementation of the Graph interface. This can be used to encapsulate specific specialized storage and for wrapping non-rdf sources to resemble RDF. To extend ARQ at the query execution level, developers need to work with the source code for ARQ for specific details and for finding code to reuse. Jena provides some examples to perform this ARQ Query Processing ARQ query engine performs a sequence of actions on a query which are parsing, algebra generation, execution building, high-level optimization, low-level optimization and evaluation. Modifications can take place at any step, whether it is parsing or the conversion from the parse tree to the algebra form, which is a fixed algorithm defined by the SPARQL standard. Extensions may alter the algebra form by transforming it from one algebra expression to another, including introducing new operators. Parsing: This step is performed to turn a query string to a Query object. The class Query depicts the abstract syntax tree (AST) for the query and provides methods to create the AST. The query object also provides methods to serialize the query to a string. The string produced is more or less the original query with the same syntactic elements without comments, and formatted with a whitespace for readability. The query object can be reused and is not modified once created, and not modified by query execution. Building a query programmatically is not the preferred way and the AST is not normally an extension point. Algebra Generation: To generate SPARQL algebra expression for the query string, represented as Query object, ARQ applies the algorithm in the SPARQL specification for translating a SPARQL query string, which includes removing joins involving the identity pattern (the empty graph pattern). Then various transformations like identification of property functions can be done. For example, the query: 44

55 3.2. EXTENDING JENA CHAPTER 3. APPROACH SELECT?x?y WHERE {?x < Is translated to (project (?x?y) (bgp (triple?x < using SSE syntax to represent the internal data structure for the algebra. High-Level Optimization and Transformations: ARQ provides a collection of transformations that can be applied to the algebra, for example, replacing equality filters with a more efficient graph pattern and an assignment. When ARQ is extended, the query processor for the custom storage layout can choose which optimizations are appropriate and can provide its own algebra transformations. The Transformer class is used to execute transform code, which converts an algebra operation into other algebra operations; it applies the transform to each operation in the algebra expression tree. Transform is an interface that has one function signature for each operation type and returns a replacement for the operator instance it is called on. Transformations advance in a bottom up fashion in the expression tree. Algebra expressions are considered immutable so that any change made in one part of the tree should result in a copy of the tree above it, which is automated by the TransformCopy class. Another helper base class is TransformBase, which provides the identify operation for each transform operation. SSE syntax is used to print out operations. Static methods in WriterOp provide output to several output objects like java.io.outputstream and the Java tostring method is overridden to provide pretty printing. Low-Level Optimization and Evaluation: Low-level optimization is the step carried out by the custom storage layer where the order to evaluate basic graph patterns is chosen. It can be done dynamically as part of evaluation. Evaluation of a query 45

56 3.2. EXTENDING JENA CHAPTER 3. APPROACH includes the execution of the algebra expression, as modified by any transformations applied, to yield a stream of pattern solutions. ARQ widely uses iterators internally, and evaluates an operation by feeding the stream of results from the previous stage into the evaluation where possible. It commonly takes each intermediate result one at a time, by using QueryIterRepeatApply for each binding, substitutes the variables of pattern with those in the incoming binding, and evaluates to a query iterator of all results for this incoming row. It is possible that the result is the empty iterator. It is also possible to not manipulate the incoming stream at all but only pass it to sub-operations. Query Engines and Query Engine Factories: All the ARQ query processing steps from algebra generation to query evaluation are carried out by executing a query using the QueryExecution.execSelect or other QueryExecution exec operation. Storage-specific operations might be carried out when the query execution is created. A query engine works concurrently with a QueryExecution created by the QueryExecutionFactory to provide the evaluation of a query pattern, while QueryExecutionBase supports the mechanism for the different result types and does not need to be modified by extensions to query execution. ARQ supports three query engine factories: the main query engine factory, one to remotely execute a query and one for a reference query engine. SDB and TDB register their own query engine factories during sub-system initialization, which they extend from the main query engine. After choosing a query engine factory, the create method is called to return a Plan object for the execution, whose principal operation is to get the QueryIterator for the query Main Query Engine The main query engine contains various basic graph pattern matching implementations and works with general purpose datasets directly. It evaluates patterns on each graph in turn and includes optimizations for the standard Jena implementation of in-memory graphs. High-level optimizations are implemented by a sequence of transformations, and a custom implementation of a query engine can reuse some or all of these transformations. 46

57 3.2. EXTENDING JENA CHAPTER 3. APPROACH Figure 3.4 Phases of ARQ query processing. 47

58 3.2. EXTENDING JENA CHAPTER 3. APPROACH The main query engine evaluates expressions as the client consumes each query solution, hence working as a streaming engine. It produces the execution by creating the primary conditions, which are a partial solution of one row without any bound variables or any initial bindings of variables. After this the main query engine calls the algorithm to execute a query i.e. QC.execute. If an extension wishes to reuse some of the main query engine by specifying its own OpExecutor, it should call QC.execute to evaluate a sub-operation, which then finds the currently active OpExecutor factory, creates an OpExecutor object and invokes it to evaluate one algebra operation. The main query engine can be extended at two points: Stage generators: To evaluate basic graph patterns and reuse the rest of the engine, provide a custom StageGenerator. The advantage of this option is that it is more selfcontained and requires less detail about the internal evaluation of the other SPARQL algebra operators. OpExecutor: This option is to execute any algebra operator specially. A StageGenerator provides matching for a basic graph pattern which is invoked by the standard OpExecutor to match a basic graph pattern and results are used for more of the evaluation. An OpExecutor performs each step of evaluation in the main query engine and a new one is created from a factory at each step. The factory is registered in the execution context. A specialized OpExecutor can be implemented by inheriting from the standard one and overriding only those algebra operators it wants to deal with, including inspection of the execution and choosing to passing up to the super-class based on the details of the operation Custom Query Engine A custom query engine allows an extension to choose the datasets it wants to handle, and to intercept query execution during the setup of the execution so it can modify the algebra expression, introduce its own algebra extensions, choose which high-level optimizations to apply and also transform the expression into quad form. Execution may progress with the normal algorithm, a custom OpExecutor, a custom Stage Generator or a combination of all three extension mechanism. 48

59 3.2. EXTENDING JENA CHAPTER 3. APPROACH Algebra Extensions New operators can be added to the algebra by making it a sub-class of the OpExt class. To insert the new operator into the expression to be evaluated, a custom query engine is used to intercept evaluation initialization. The eval method is called if the evaluation of a query requires the evaluation of a sub-class of OpExt. SDB uses this mechanism to introduce an operator that is implemented in SQL Expression Functions and Property Functions Implementation for additional operations can be provided by Filter functions, which help to filter bindings for variables using specified conditions. After implementing a custom filter function in SPARQL, it needs to be installed in the function registry, which is a mapping from URI to a factory class for functions. Property functions, on the other hand, are used to match triples without using the usual graph matching, but by executing code that is determined by the property URI. Property functions can be used for inferencing and rule processing. In our research we use algebra extension to implement the path operator. After an analysis of the available extension methods we conclude that extending the algebra is the most apt approach to extend the ARQ query engine to include path processing, because that would enable the new operator to use the existing framework to perform pattern matching, as well as implement its own algorithm to find paths. Where custom filter functions and property functions would only allow to filter bindings based on a condition or a property URI, extending the main query engine or implementing a custom query engine would mean redoing what has already been implemented and tested, hence extending the algebra provides us the capability to reuse the query engine and the support for graph pattern matching and build on it to provide enhanced querying capability. We have used the OpExt class as the base class to implement a new operator called OpFindAllPaths. The eval method of the new class is used to execute the operation to find all paths, after receiving and processing the results of the operators below it in the query plan tree. 49

60 3.3. QUERY PLAN GENERATION CHAPTER 3. APPROACH 3.3 Query Plan Generation As mentioned in the above sections, the query engine compiles the query to an equivalent operator tree or query plan, which is then optimized and evaluated for results by the query engine using the database. For example, a SPARQL query PREFIX foaf: < PREFIX rdf: < select?name where {?person rdf:type foaf:person?person foaf:name?name Would be equivalently expressed using the following query plan in Fig. 3.5, using the notation explained in [MR12] The query plan is used to denote the order of operations during the execution phase of the query, where the operations are carried out in a bottom-up fashion. To be able to introduce a new path operator in this query tree, we need to study where to insert the operator in the query plan tree so that the execution of the other operators is not affected and our operator is also able to use the results generated by the operators below it. In our study we see that the generalized path pattern matching operator (OpFindAllPaths) can be inserted at the top of the query plan tree for specific types of queries, and the extended algebra can then be used to execute the query. For a given Generalized Graph Pattern GP, where GP is composed of a graph pattern G and a set of path variables P, we can compile GP by decomposing it; compiling G using the relational graph pattern matching approach and extending the output plan by inserting the path operator expression on top of the operator tree. This approach utilizes the query compilation done by Jena for the SPARQL query, and includes the path operator in the generated algebra for the operator tree. This approach also does not adversely affect the algebra for the query that has been generated before the path operator is appended. The approach is valid for queries with single path expressions or single path variables. For the given query that wishes to find whether EmployeeA is connected in any way to the Government watchlist, the query would ideally look like: 50

61 3.3. QUERY PLAN GENERATION CHAPTER 3. APPROACH Figure 3.5 Query plan for a simple example query without path operator integrated into the query engine. 51

62 3.3. QUERY PLAN GENERATION CHAPTER 3. APPROACH select??paths where {?x??paths?y?x foaf:name "EmployeeA"?y rdf:type sec:government_watchlist. The operator tree for the query would look like Fig Figure 3.6 Query plan for the example query after integrating the path operator into the query engine. 52

63 3.4. ARCHITECTURE OF EXTENDED JENA CHAPTER 3. APPROACH Another example query which wants to find out how members of the Smith family are related to the members of the Doyle family, using the "foaf" ontology, would be written as: select??paths where {?x??paths?y?x foaf:family_name "Smith"?y foaf:family_name "Doyle". The operator tree would look like Fig Architecture of Extended Jena The Jena API consisting of the SPARQL query engine has been modified to support graph extraction queries. The query parser or translator component which takes the query as input and parses it into the equivalent algebra expression. If the query contains path variable and path operator, the equivalent algebra consists of the traditional sparql query i.e. triple patterns, and path query i.e. path patterns. The query execution engine contains the component to perform traditional graph pattern matching. The execution engine has been extended to include the path pattern matching component that uses the results generated by triple pattern matching and generates the list of valid paths as output. The architecture uses BerkeleyDb, which is an open source, high performance embedded database for key-value data, to store and read path sequence information used to extract paths. This database is used by the path pattern matching component of the execution engine. The database engine component in the architecture shows the component responsible for handling and reading the input databases. 53

64 3.4. ARCHITECTURE OF EXTENDED JENA CHAPTER 3. APPROACH Figure 3.7 Query plan for the example query after integrating the path operator into the query engine. 54

65 3.4. ARCHITECTURE OF EXTENDED JENA CHAPTER 3. APPROACH Figure 3.8 The diagram shows the architecture of the Path-Extraction Enhanced Jena API. 55

66 3.5. INTEGRATING THE PATH OPERATOR CHAPTER 3. APPROACH 3.5 Integrating the Path Operator Path Operator OpFindAllPaths To be able to perform generalized graph pattern matching using SPARQL on RDF datasets, we use the Apache Jena ARQ query engine, build upon the existing query framework and integrate the path operator OpFindAllPaths. The Jena API contains the packages that implement the ARQ query engine. We extend the package com.hp.hpl.jena.sparql.algebra.op to integrate the path operator. The compiler uses operators to create the query plan which describes how the query engine executes the query, and how results specified in the query are obtained or the operations specified in the query are executed. Operators are of two types: Physical Operators: They contain the implementation for the operation specified by the logical operators. Physical operators are methods or routines to execute operations like aggregation, calculation, joins, data integrity checks or to access data. Every physical operator has a cost affiliated to it, depending on which the query optimizer chooses the most appropriate physical operator for a logical operator. A physical operator usually uses three methods: init(), getnext() and close(). The init() method is used to initialize the operator and any required data structures. The getnext() method might be called iteratively to get the consequent data in the data stream, on which the operator works. The close() method is used to release data structures that are not required anymore and to perform clean-up operations. Logical Operators: They represent the relational algebraic operation that describes conceptually the action to be performed, for example, select (σ), or project (π). The query plan or operator tree is the data structure that consists of logical operators, created by the compiler. The optimizer then chooses the most appropriate physical operator for each logical operator based on its cost. A logical operator may have more than one physical operator. [Opr] In our research, we could have extended the SPARQL grammar to include path variables and path patterns as proposed by SPARQ2L [Any07], but the adoption of new SPARQL 56

67 3.5. INTEGRATING THE PATH OPERATOR CHAPTER 3. APPROACH grammar by the W3C is a long engagement in terms of time, we focus on the system level implementation of the path operator. Path operator OpFindAllPaths uses the conceptual solve algorithm [Tar81] implemented in [GA13] as prefix-solve algorithm for disk resident graphs, to implement the physical operator for the path operator functionality. It does not use the iterator model in the physical operator by using the init(), getnext() and close() methods. Instead, it waits for the subquery results to obtain the set of source and destination node pairs to begin the getallpaths computation as described below. The operator takes source and destination node pairs, and constraints as input, and returns the list of valid paths as output. The ARQ query compiler is used to create a query plan with the path operator by extending the compilation process. The path operator is inserted at the top of the query plan tree as explained in the previous section Integration of the Path Operator Using the algebra extension method 3.2.4, the path operator OpFindAllPaths has been implemented as a subclass of OpExt, which is used to extend the Jena ARQ library [Doc] by implementing the new operator functionality in the class eval method. The eval method in class OpFindAllPaths makes a function call to the prefix solve algorithm [GA13], which has been implemented in the getallpaths method (member of class QueryProcessor). During this function call, the eval method also sends as arguments the list of source and destination nodes, the set of labels in case of constrained path query, the metadata location and the input data file used. Returned is the set of valid paths between the source and destination nodes. The class ExtendedModel has been implemented to provide an abstraction layer for users executing the SPARQL query. The class ExtendedModel uses global variables mygraph to store the preprocessed path sequence information, model to store the Model instance for the input data file, inputfilename to store the location of the input data file, metadatalocation to store the location of the metadata created for the input data file. The class implements method loadmodel which takes two arguments, the filename location and a boolean value val. val=false indicates that the user wishes to load a traditional Model for the input file, which is sufficient in case the user does not wish to perform any 57

68 3.6. PATH QUERY SYNTAX CHAPTER 3. APPROACH path queries. A ModelFactory instance is created for the data file and stored in memory, which is needed to execute the query. val=true indicates that the user wishes to perform path queries, in which case additional pre-computation is done and path sequences are created for the data file. The path sequences are generated by a call to the method newgraph, and are stored in a Berkeley DB database. In case of path extraction queries, loadmodel is performed separately because the preprocessing is not required every time a new query needs to be executed. For one data file, preprocessing is required only once and the time taken to load model increases with the size of the dataset. The ExtendedModel implements a method executequery that takes two arguments, the query string and a boolean value. The boolean value true indicates that the query is a generalized path query, in which case the system makes sure that preprocessed data is available and then executes the query. In case of false boolean value, the query string is executed just like a traditional SPARQL query. The Java code for significant source files has been given in Appendix A User Workflow Fig. 3.9 demonstrates how the extended Jena API can be used to execute SPARQL queries with or without path computations. An ExtendedModel object can be created as an abstraction for the data model. It is responsible for keeping track of the data file, the in-memory data model and the preprocessed path sequence information. To load the model, use function loadmodel and specify if the application needs support for path extraction queries by supplying a boolean argument. Execute any number of queries using executequery and supplying the query string and a boolean value to indicate if the execution needs support for path queries. 3.6 Path Query Syntax The path pattern matching queries supported by the extended ARQ query engine after the integration of the path operator evaluate the path expressions between the given sets of source and destination node pairs. Since our research does not integrate the operator into the SPARQL grammar, we use a different notation to express path matching queries. As 58

69 3.6. PATH QUERY SYNTAX CHAPTER 3. APPROACH Figure 3.9 The flowchart shows how the Jena API can be used to execute SPARQL queries with or without path computation, by using the object of class ExtendedModel and by specifying boolean arguments that reflect the user s choice of whether to compute paths or not. 59

70 3.6. PATH QUERY SYNTAX CHAPTER 3. APPROACH specified above, the user can use the query string with a true Boolean value while using the executequery method to ensure that the query is executed with the path operator notation. A few example queries as listed in Chapter 2 would be written in the following format. SELECT??p WHERE {?x??p?y.?x foaf:name "EmployeeA".?y rdf:type sec:government_watchlist. Would translate to SELECT?x?y WHERE {?x foaf:name "EmployeeA".?y rdf:type sec:government_watchlist. In this query, the path operator is appended at the top of the operator tree and bindings for?x are the source nodes, bindings for?y are the destination nodes for the prefix-solve algorithm to find paths. Constrained path queries are supported with the following constraint arguments: All Paths: In case no argument is provided at the end of the query, the path operator finds path expressions between the set of source and destination node bindings. Contains Any: If the query contains the argument #containsany?z at the end of the query, the path operator prunes the results and outputs only those paths that contain any of the bindings for?z as edge labels in the paths. Contains None: If the query contains the argument #containsnone?z at the end of the query, the path operator prunes the results and outputs only those paths that do not contain any of the bindings for?z as edge labels in the paths. Contains All: If the query contains the argument #containsall?z at the end of the query, the path operator prunes the results and outputs only those paths that contain all of the bindings for?z as edge labels in the paths. For example, the query 60

71 3.6. PATH QUERY SYNTAX CHAPTER 3. APPROACH SELECT??p WHERE {?x??p?x.?z compound:name "Methionine". PathFilter(containsAny(??p,?z)) Would be translated as SELECT?x WHERE {?z compound:name "Methionine". #containsany?z In the above case, where only one variable is listed in the select clause, the path operator automatically uses the set of bindings for the single variable as the set of source as well as destination nodes. The query SELECT??p WHERE {?x??p?y.?x bio:name "MTB Surface Molecule".?y rdf:type bio:cellular_response_event.?z rdf:type bio:pi3k_enzyme. PathFilter(containsAny(??p,?z)) Would be written as SELECT?x,?y WHERE {?x bio:name "MTB Surface Molecule".?y rdf:type bio:cellular_response_event.?z rdf:type bio:pi3k_enzyme. #containsany?z 61

72 3.7. RELATED WORK CHAPTER 3. APPROACH 3.7 Related Work There has been some research work on the generalized path querying paradigm and providing the navigational querying capability. [[Sea04],[Sou],[Mat05],[Det08],[Spa]] focus on path pattern matching queries, where path patterns are regular expressions of the edge labels in the data graph. A path pattern defines the structure of the path and such a query helps to find the pairs of nodes that are connected by the specified path pattern. [[PZ11],[Fio12b],[Alk09],[GN11],[Fio12a]] focus on path extraction queries that return paths, most of which return shortest paths. These systems either use a join based approach which is not very expressive and is cumbersome, or use the navigational approach which is not very efficient for large disk resident graphs. Most of these also do not deal with finding paths between multiple sources and multiple destinations. [[Fan11],[Jin10],[Atr],[Lib13]] focus on reachability queries where the queries can ask whether a path exists between a given pair of nodes whose edges satisfy given constraints. However in this case, no paths are returned. Tools like Neo4j [[Neo],[Jin10],[Atr],[Zho11]] use DFS, BFS, or bidirectional search algorithms to find paths, but the downside here is that they are not very scalable for large disk-resident graphs. In our research we focus on the system level integration of the prefix-solve algorithm as the path operator that returns all qualifying paths between the given sets of source and destination node pairs. 62

73 CHAPTER 4 EVALUATION In our study we conducted experiments concerning a few popular aspects of integrating a new path operator to perform generalized graph pattern matching query execution using the existing Jena ARQ query engine. We tested for usability of the queries by keeping a real-life research scenario in perspective, by using a research test case provided by RENCI (Renaissance Computing Institute). We also tested for query compilation times and observed that the enhanced queries take the same order of time to compile as compared to the equivalent SPARQL queries that do not integrate a path operator. We also measured the query execution times for various datasets. The experiments are performed on Eclipse (Luna 4.4.0) IDE on a Vaio E laptop with 4GB RAM, Intel Core i5-2410m Processor 2.30GHz and Windows 7 (64 bit) operating system. The Apache Jena version was used to extend the ARQ query engine. 63

74 4.1. QUERY COMPILATION TIME CHAPTER 4. EVALUATION 4.1 Query Compilation Time In this set of experiments we measure the time taken to compile the queries that use path variables and the OpFindAllPaths operator, and compare it to similar queries that do not use any path operator. This is the time to generate the query plan. The list of queries used and their datasets are listed at the end of this chapter. Table 4.1 Compile Times of queries with and without path operator Query Compile Time with Compile Time without Path Operator(ms) Path Operator(ms) query query query query query query query query query

4.2. QUERY EXECUTION TIME CHAPTER 4. EVALUATION Figure 4.1 Graph to illustrate the comparison of compile time in queries with and without path computations.

75 4.2. QUERY EXECUTION TIME CHAPTER 4. EVALUATION Figure 4.1 Graph to illustrate the comparison of compile time in queries with and without path computations. By observing the results of the experiment we can say that the integration of the path operator does not induce an overhead on the query compilation process. the compile time of the query is not affected by the integration of the new operator in the Jena ARQ framework. The compile time for the queries is roughly similar, we perceive that this might be due to similarities in the number of triple patterns involved in the queries, and the structure of the queries. 4.2 Query Execution Time In this set of experiments we measure the time taken to execute the queries that use path variables and the OpFindAllPaths operator, and compare it to similar queries that do not use any path operator. We use the same set of queries as used in the previous experiment. In this experiment we are basically estimating the execution time of the path operator. Even though the aim of our thesis is not to optimize the execution time for the path operator, this experiment does illustrate the comparison between execution times of queries using 65

76 4.2. QUERY EXECUTION TIME CHAPTER 4. EVALUATION path operator vs queries without any path variables or path computations. By observing the graph showing the computed execution times for queries with and without the integrated path operator, we see that the execution time for the queries that do not perform path computations is roughly similar, we perceive that this might be due to similarities in the number of triple patterns involved in the queries, and the structure of the queries. The difference in execution time between queries with and without the path operation roughly indicates the time taken for the path operator execution. Table 4.2 Execution Times of queries with and without path operator Query Execution Time with Execution Time without Path Operator(ms) Path Operator(ms) query query query query query query query query query

4.3. USABILITY CHAPTER 4. EVALUATION Figure 4.2 Graph to illustrate the comparison of execution time in queries with and without path computations. 4.3 Usability In our research, we consulted RENCI (Renaissance Computing Institute) for their requirements on an ongoing project.

77 4.3. USABILITY CHAPTER 4. EVALUATION Figure 4.2 Graph to illustrate the comparison of execution time in queries with and without path computations. 4.3 Usability In our research, we consulted RENCI (Renaissance Computing Institute) for their requirements on an ongoing project. The project is a network provisioning system, using which users can provision their own network of nodes scattered across different parts of the world, having different specifications like RAM, bandwidth, number of neighbors, memory etc. using queries. The queries would specify the needs of the users and would enable the provision of the desired system. Experimenting with such a use case, we want to study if our integrated framework can be used in a real-life scenario to solve the purpose of finding path bindings Assuming the physical system is represented by a graph G (V, E ), where the set of nodes or vertices V = V 1, V 2., V N, the set of edges or links between them is E = E 1, E 2,, E M, and node V i has capacity of V i, E i has capacity of E i. Three types of basic embedding requests are as follows: Bounded path query: To find p a t h(a, B, c ), where A = V 1, B = V 2, c is the capacity of 67

COMPUTER AND INFORMATION SCIENCE JENA DB. Group Abhishek Kumar Harshvardhan Singh Abhisek Mohanty Suhas Tumkur Chandrashekhara

COMPUTER AND INFORMATION SCIENCE JENA DB. Group Abhishek Kumar Harshvardhan Singh Abhisek Mohanty Suhas Tumkur Chandrashekhara JENA DB Group - 10 Abhishek Kumar Harshvardhan Singh Abhisek Mohanty Suhas Tumkur Chandrashekhara OUTLINE Introduction Data Model Query Language Implementation Features Applications Introduction Open Source