Comparing path-based and vertically-partitioned RDF databases

Size: px

Start display at page:

Download "Comparing path-based and vertically-partitioned RDF databases"

Patricia Logan
5 years ago
Views:

1 11/4/2007 Comparing path-based and vertically-partitioned RDF databases Abstract Preetha Lakshmi & Chris Mueller CSCI 8715 Given the increasing prevalence of RDF data formats for storing and sharing data on the Semantic Web, efficient storage mechanisms for RDF data are also becoming increasingly important. We survey existing storage solutions for RDF data in an RDMS. Two recent and novel storage concepts open the door for significantly better querying efficiency. The first, proposed by Matono, et al (2005), models RDF data as a graph, then stores materialized path expressions for efficient querying. The second, proposed by Abadi, et al (2007), stores RDF triples in a vertically-partitioned column-oriented database, in which each RDF property is stored in a single table. Our objective is to compare these two storage methods to find relative strengths and weaknesses on a wide range of possible queries. Related Work The simplest way to store RDF data is in a triple store, essentially one large table with three columns for subject, predicate, and object. Variations on the triple-store have shown improvements in efficiency and reduced the number of self-joins needed when issuing complex queries. Data Normalization A normalized triple store attempts to improve the efficiency of the triple store. A Statements table, which stores RDF triples in three columns, as well as a Literals table and a Resources table make up the basic table schema. In RDF, literals refer to literal values, such as strings or integers, and Resources refer to URIs or a. The Statements table contains references to items in the Literals and Resources table, reducing disk space usage. A variation on this approach, the denormalized triple store attempts to limit the number of joins that would occur across the Statement, Resources, and Literals tables. Instead of always storing a reference to the Literals and Resource tables, the Statements table will hold the literal or resource within itself so long as that resource or literal is smaller than a certain limit, e.g. < 255 characters. Jena1 made use of normalized triple stores, and Jena2 makes use of denormalized triple stores. (Wilkinson 2003) Oracle also uses normalized tables (Alexander n.d.). Data normalization or denormalization can be a property of any RDF storage scheme. Dynamic Table Schema Another method for storing RDF data is to recreate the RDF schema in a dynamic table schema. With this approach, classes and properties in RDF are mirrored in tables. In the book/author example given above, separate tables would be created for books, authors, and titles. Relationships among the tables will express the RDF triples. One benefit of using this approach is that queries can be made against the RDF schema itself (Matono 2005). A similar dynamic table schema is used by Sesame. Research suggests this method can perform with reasonable efficiency, especially when properties of the underlying DBMS are exploited effectively, e.g. using the object-relational features of Postgres to support native subclassing (Broekstra 2002).

2 There are several problems with the dynamic table schema. It can be inefficient, especially when many self-joins are required during query execution. RDF data cannot be stored without knowing the RDF schema needed to create the table schema. Also, if the RDF schema changes, the table schema must be recreated. (Beckett 2003, Matono 2003) Property Tables In order to reduce the number of self-joins needed for queries, some applications have implemented property tables. In a property table, subjects along with similar properties are stored in denormalized tables like flat structures (Abadi 2007). All the data are stored in the same table and hence eliminates the need for joins. There can be significant overhead when using property tables if certain subjects do not have certain properties applicable to them, as the number of NULL values in a table will increase. This wastes the storage space. Postgres can prove to be better in case of NULL values because it just has a bit representation for NULL values. Another drawback to using property tables is that if queries require joins from many different property tables, or if the property tables were poorly configured, performance can actually be worse than a traditional triple-store. Jena2 implements property tables (Wilkinson 2003). The property class table is a variation on the property table. Instead of storing all the values a single table, triples are split up based on the availability of data so as to avoid storing of NULL values. This saves space. The selection of column names must be made very carefully. This makes implementation more dependent on the kind of data being stored and hence might require changes based on the input data. But a drawback here is, again, that if a query refers to properties from more than one table a join and a union might be required, performance can be degraded. Property class tables are implemented in Jena2 (Wilkinson 2003). Network Model Oracle differs from other RDF storage schemes and stores the RDF graph directly as a part of Oracle Spatial network data model (Wang 2003). Each triple is stored as a separate object. There are three main components to the table schema. RDF_VALUES are the values of all parts of the triples. RDF_NODES stores subjects and objects. RDF_LINKS stores all the link information and properties as well as reification information. Oracle supports bags, which are unordered groups of multiple properties, e.g. Book A was written by X,Y,Z; sequences, which are similar to bags, but with order imposed; as well as alts, which indicates alternative websites for a source, etc. Storage of this information is in the form of a graph. This facilitates making of inferences by traversing through the graph. Execution of RDF queries in Oracle is at the semantic level instead of the structural and syntactic level. (Alexander n.d.) Path-based Storage Matono, et al (2005) propose storage of RDF graph traversals in a path-based storage (PBS). Path types between nodes and the root nodes are stored in a path table. Instances of given path types are stored in a resource table, which stores a triple's object value, as well as a path_id that indicates the object's location in the graph with respect to a root node. Several schema-level tables are also introduced, which store information about RDF class hierarchy. See figure A for the full table schema.

3 Figure A: Path-based storage scheme Vertical Partitioning More recently, vertical partitioning (VP) has been proposed as an alternative to the schemes mentioned above (Abadi 2007). Here RDF values are stored in tables with two columns. Each table stores exactly one predicate, and rows in the table contain the subjects and objects that use that predicate. Multivalued properties can be captured easily by creating another row, and NULL values are avoided altogether. Since these tables are sorted and indexed based on a dictionary-encoded subject, the can be joined efficiently. Property inferences can be made easily by joining the necessary tables. If inferences are known to be frequent operations, they can be computed and stored in advance. VP can be done in relational databases such as Postgres, but column-oriented databases, such as C-Store, offer significant advantages in terms of storage and query response time for these vertically partitioned tables. Abadi, et. al, (2007) have shown two orders of magnitude of performance increase with VP in C-Store when compared with existing property-table schemes in other databases. Our Contribution Since the authors of PBS and the VP have shown their work to be experimentally better than standard storage models such as Jena's property tables. We observe which factors are most important when comparing PBS and VP across a wide range of query types. We have observed that, in general, PBS performs better when answering queries related to the RDF schema. VP is usually more efficient for retrieving RDF instance data, even when multiple joins are needed. We propose a modification to PBS to store additional information about path expressions in the resource table: a column for the root_node, which indicates the root node associated with the given path and resource. This enhanced path-based storage, or EPBS, would contain a new rdf_resource_en table as follows:

4 rdf_resource_en resourcename pathid root_node 'r1' 1 'r1' 'r2' 4 'r1' 'r3' 4 'r1' 'r4' 1 'r4' 'r5' 6 'r4' 'Picasso' 2 'r1' 'Pablo' 3 'r1' 'August' 2 'r4' 'Rodin' 3 'r4' 'Guernica' 5 'r1' 'Les Dem...' 5 'r1' 'The Thinker' 7 'r4' Problem Statement Given: A set of RDF triples; the vertical partitioning storage model proposed by Abadi, et al (2007); and the path-based storage model proposed by Matono, et al (2005). To find: Query plans for the various categories of queries under these two storage schemes. (Query plans indicate the number of joins and the I/O cost) Objective: To determine query types that perform comparatively better or worse in the path-based approach or in vertical partitioning. Constraint: To find the optimum storage method irrespective of the category of queries Assumptions There are no cycles in the RDF schema (this assumption was also made by the authors of the pathbased storage model). Also, inserts are performed far less frequently than select operations, and will not be considered. The RDF data (figure B) used will follow that in Matono, et al (2005).

5 Figure B: RDF Schema, showing RDF class information as well as instance data Proposed Approach Our approach is to use both experimental and analytical evaluation on the two RDF storage schemes. Executed the benchmark queries in Oracle and C-Store. Oracle was used to implement the path-based storage, and C-Store was used for vertical partitioning. The sample data and schema used are based on the schema used by Matono, et al (2005), in the figure below. SQL statements for queries and joins 1. Find the title of anything painted by anyone. Path-based approach: SELECT r.resourcename FROM path p, rdf_resource r WHERE p.pathid = r.pathid AND p.pathexp = '#title<#paints';

6 SELECT t.object FROM title t, paints p WHERE p.object = t.subject 2. Find the descendants of a given class, e.g. find all types of artists. Path-based approach: SELECT a.classname FROM class a WHERE a.pre > (SELECT b.pre FROM class b WHERE b.classname='artist') AND a.post < (SELECT c.post FROM class c WHERE c.classname='artist'); This cannot be done because the class hierarchy information is not stored. 3. List all the properties stored in this database Path-based approach: SELECT propertyname FROM property This requires a consistent naming pattern for property tables, e.g. prefixing all property tables with property:. SELECT tablenames FROM system_schema WHERE tablename like 'property:%' 4. Select all information about the node R4 This is not possible in path-based schema without resorting many self-joins on the triple table. Using the enhanced path-based schema, the following query can be issued: SELECT r.resourcename, p.pathexp FROM rdf_resource_en r, path p WHERE r.pathid = p.pathid AND r.root_resource = 'r4'; This query is similarly complex in the vertically partitioned scheme because we do not know the properties in advance that could be applied to the node R4. Therefore we must query all possible property tables. If RDF schema information is available, only the requisite tables would need to be queried. 5. Select all possible property values of sculptor Path-based approach:

7 SELECT propertyname FROM property WHERE domain='sculptor' This is not possible because the vertically partitioned model does not store schema-level information. 6. Give me all the titles of work by artists. Path-based approach: SELECT resourcename FROM rdfresource r, path p WHERE r.pathid=p.pathid AND p.pathexp lke '%title%' SELECT object FROM title; 7. Find the names of all sculptors. This is not possible in the path-based approach without resorting to multiple self-joins on the triple table. SELECT f.object, l.object FROM first f, last l, type t WHERE t.subject = f.subject AND f.subject = l.subject AND t.object = 'Sculptor' 8. List all the objects of type artifact. In the path-based approach this query would degenerate to querying the triple table because the pathid does not distinguish between a sculptor and a painter. SELECT subject FROM type WHERE object = 'Artifact'; 9. Diameter query for node R4. Select all nodes in the graph within one edge-length of R4. Using the enhanced path-based schema:

8 SELECT r.resourcename, p.pathexp FROM rdf_resource_en r, path p WHERE r.pathid = p.pathid AND r.root_resource = 'r4' AND p.pathexp NOT LIKE '%<%'; 10. Connectivity query: determine if any two arbitrary nodes are connected. In the artist tables example, it is possible to perform this query using only one join in EPBS. In the more generic case, where root nodes are not well-known, it is only possible to answer a connectivity query using a recursive join. If the graph is well-known in advance and the connections are materialized in the tables, the precise number of joins can be determined before query execution. Here we suggest a query for determining if 'Picasso' is connected to 'Guernica': SELECT 'Connection Exists' FROM rdf_resource_en r, rdf_resource_en r2 WHERE r.root_resource = r2.root_resource AND r.resourcename='picasso' and r2.resourcename='guernica' If all the paths connected to the main node are materialized and stored, all types of connection and diameter queries can be answered with one join. The main node is the node that has '' stored against the pathexp field of the path table. In vertical partitioning, this query can be answered only by joining across all tables. Given N tables in the schema, this would require N^N joins. Validation We executed the path-based model for query 1 using Oracle's query plan generator. The result is as follows: PLAN_TABLE_OUTPUT SELECT STATEMENT (15) 00:00:01 * 1 HASH JOIN (15) 00:00:01 * 2 TABLE ACCESS FULL PATH (0) 00:00:01 3 TABLE ACCESS FULL RDF_RESOURCE (0) 00:00:01 Similar query plans will be generated for all eight query types, in both Oracle (for path-based storage) and C-Store (for vertical partitioning). In certain cases, such as the path-based query for query 7, the result degenerates into many self-joins on the triple table. In such cases, we will not run a query plan for the query, as the result is obvious. Conclusions and Future Work In general, the path-based storage is more effective for retrieval of RDF schema information, including

9 queries about class inheritance or relationships between properties and classes. Queries 2, 3, and 5 are expected to perform better under the path-based storage. The vertically-partitioned approach, especially when implemented in a column oriented database such as C-Store, tends to perform better on queries regarding instance data. Queries 6, 7, and 8 would perform better for vertically partitioned databases. In query 1, both approaches would require a single join; without further analysis of disk I/O costs it is not possible to determine which approach would be better for this type of query. Some queries types are very expensive and inefficient in both storage schemes. Query 4 requires many joins in the path-based approach. In vertical partitioning, this query requires knowledge not inherently stored in the table structure, and we would need a higher-level interface to translate schema-level information into the appropriate SQL joins; otherwise we resort to joins across all tables. If we add level information or an additional identifier to the path table, we can answer more queries about instance level information, such as, find the artist who created this painting. References [1] D. J. Abadi, A. Marcus, S. Madden, K. Hollenbach. Scalable Semantic Web Data Management Using Vertical Partitioning. VLDB 07, September 23-28, [2] N. Alexander, X. Lopez, S. Ravada, S. Stephens, J. Wang. RDF Data Model in Oracle. w3d_rdf_data_model.pdf. Oracle Corporation. Last visited September [3] D. Beckett. SWAD Europe Deliverable 10.1: Scalability and Storage: Survey of Free Software / Open Source RDF Storage Systems. W3C Semantic Web Advanced Development for Europe (SWAD-Europe). July 31, rdf_scalable_storage_report/. Last visited September [4] D. Beckett, J. Grant. SWAD Europe Deliverable 10.2: Mapping Semantic Web Data with RDMSes. W3C Semantic Web Advanced Development for Europe (SWAD-Europe). January 31, Last visited September [5] T. Berners-Lee, J. Hendler, O. Lassila. The Semantic Web. Scientific American. May 17, [6] J. Broekstra, A. Kampman, F. van Harmelen. Sesame: A Generic Architecture for Storing and Querying RDF and RDF Schema. I. Horrocks and J. Hendler (Eds.): ISWC 2002, LNCS 2342, pp , [7] Jena A Semantic Web Framework for Java. Last accessed September [8] A. Matono, T. Amagasa, M. Yoshikawa, S. Uemura. A Path-based Relational RDF Database. 16 th Australasian Database Conference, January 31 February 3, Conferences in Research and Practice in Information Technology, Vol. 39, [9] openrdf.org, home of Sesame. Last accessed September [10] J. C. Wang. Oracle Spatial Network Data Model. Oracle Corporation Technical White Paper.

10 December g_network_model_twp.pdf. Last visited September [11] K. Wilkinson, C. Sayers, H. Kuno, D. Reynolds. Efficient RDF Storage and Retrieval in Jena2. Hewlett-Packard Technical Report, HPL December 16, Last visited September [12] [13]

Efficient, Scalable, and Provenance-Aware Management of Linked Data

Efficient, Scalable, and Provenance-Aware Management of Linked Data Marcin Wylot 1 Motivation and objectives of the research The proliferation of heterogeneous Linked Data on the Web requires data management