DB2 NoSQL Graph Store Mario Briggs mario.briggs@in.ibm.com December 13, 2012
Agenda Introduction Some Trends: NoSQL Data Normalization Evolution Hybrid Data Comparing Relational, XML and RDF RDF Introduction What is RDF Use-cases for RDF How RDF is different from other NoSQL stores Why we built RDF into DB2. Benefits of RDF Storage in DB2 DB2 RDF Features Evolution Differentiators Guidelines and Summary 2
Trend : NoSQL NoSQL = "Not only SQL" NoSQL denotes a class of database systems that depart from traditional RDBMSs in one or multiple ways: Data format / data model Query language, APIs Data consistency etc. Goals: performance, scalability, simplicity, schema flexibility for specific uses case and access patterns Not a generic data store 3
NoSQL Data Formats & API Anything that isn't relational: Key-value pairs (e.g., HBase, Cassandra) JSON (JavaScript Object Notation ) XML (Extensible Markup language) RDF (Resource Description Framework ) etc. Most NoSQL systems have: no standardized query language proprietary query APIs RDF, XML support XPath for XML, SPARQL for RDF, etc. 4
Trend: Data Normalization Evolution Two Significant Trends Both driven by the Web - Both enabling new applications of data Relational Tables (1) De-normalized or Not-normalized Intact Data : LOBs, XML, JSON, Documents etc 3 rd Normal Form Variant on row based stores is column based stores (2) Highly Normalized RDF (Resource Definition Framework) Triples and Ontologies See Data Normalization Reconsidered http://www.ibm.com/developerworks/data/library/techarticle/dm-1112normalization/ http://www.ibm.com/developerworks/data/library/techarticle/dm-1201normalizationpart2/ 5
Hybrid Data XML Complete business records Good for representing business records that are shared, for schema flexibility, for versioning Query language: XPath and XQuery Proposals to incorporate JSON support into XQuery at the W3C Relational Third normal form Versatile, works for many scenarios, Typically normalized to 3rd normal form Avoid update anomalies and save storage Sometimes then de-normalized for improved understanding or performance Query language: SQL RDF (Resource Description Framework) triples, Linked Data, Graph Data Good for data about things, for sharing data definitions, for relationships, for inferencing, for schema flexibility Part of the movement from the Web of Documents to the Web of Data Highly normalized Query language: SPARQL Hybrid query and manipulation languages: Relational and XML: Standardized integration of XQuery and SQL (SQL/XML) RDF (triples): No hybrid language for integration with relational 6
Comparing Relational, XML and RDF Relational XML RDF Tables Trees Graphs flat, highly structured hierarchical data linked data Multiple rows in multiple tables represent a business record Flexible normalizaton Nodes in trees represent business records Denormalized Triples represent business records and their properties via URIs fixed schema no or flexible schema highly flexible Extreme normalization SQL (ANSI/ISO) XPath/XQuery (W3C) SPARQL (W3C) 7
What is RDF? Subject predicate Object A method to represent information as triples: (subject, predicate, object) Each triple described the relationship between two things e.g.: ( IBM, is-a, Company) A set of triples defines a graph Relations are part of the data, not part of the db structure 8
RDF Use Cases Three major use cases for RDF, mainly because RDF allows complex queries across data with variable schema. 1.Data integration. Each data source has its own data model, each model s schema evolves differently with different/same entities and properties. 2.Unstructured data access. Metadata generated by extractors for videos/text/images has different entities and relations (based on the extractor). 3.Collaboratively developed repositories of knowledge. E.g. Wikipedia/Dbpedia, Freebase have entities and properties that evolve as users add entities into the system. 9 9
More on RDF Technically, a labeled directed graph where each edge represents a triple. has supplier IBM ABC uses sells sells Websphere DB2 Supplier 10 Company is is Software XYZ is subsidiary of SUBJECT PREDICATE OBJECT IBM Company IBM has supplier ABC ABC Company IBM sells DB2 IBM sells Webshere ABC uses DB2 ABC is subsidiary of XYZ XYZ Company
SPARQL: SPARQL Protocol and RDF Query Language A query language to find sub-graph patterns Company Example: "Find all companies that sell a product to a supplier" has supplier?? sells? uses SELECT?comp,?product,?supplier WHERE {?comp <isa> <Company>?comp <sells>?product?comp <hassupplier>?supplier?supplier <uses>?product } IBM sells Company has supplier ABC uses sells XYZ is subsidiary of Result:?comp IBM?product DB2?supplier ABC Websphere DB2 Supplier is is Software 11
RDF compared to other NoSQL stores NoSQL Key Value stores (such as Hbase, Cassandra) store sets of values associated with a key. For e.g., John_Smith type Person John_Smith hasreport Jim_Hunt Jim_Hunt hasreport John_Doe John_Doe hascontactwith Tom_Smith Tom_Smith worksfor IBM In Hbase etc can be represented as KV stores can store properties for a node in a graph But no JOIN functionality, which is crucial for RDF queries can t ask who in John Smith s reports has contactwith someone who works at IBM. No ability to traverse paths in a graph 12
Why we built RDF in DB2? Internal SWG usage with open-source triple-stores face problems of transactions, concurrency and isolation Key requirements: Transactional support. Eventual consistency is not sufficient in most cases. Concurrent access. This is where the open source systems that our internal projects had used were weak. Security and Access control. (a) Graph level access control (b) specialized predicates determining access. Ride on top of relational systems existing enterprise capabilities instead of reinventing the wheel. ACID, Security, Backup/recovery, compression, load balancing & parallel execution. 13 13
Traditional Approach for Mapping RDF in a RDBMS RDF data Properties : 1000 s of entities and predicates. Variable and sparse. Standard way RDF is modeled in relational : A table with 3 columns Problem : Too many self joins, even when accessing different predicates of same node. John_Smith type?p John_Smith hasreport?z Jim_Hunt hastitle?v Requires 3 JOINS (whereas in normal relational model, this single row fetch). No good use of RDBMS indexes. * Most SPARQL queries exhibit this notion of being star queries. 14
DB2 Approach for Mapping RDF All predicates about a subject / object are lined up in a single row (or minimal # of rows, to handle variability) Benefits : Lookup by Subject/Object is now via standard efficient RDBMS index. Single row fetch for accessing different properties of a node (no joins required). Handling variable predicates and sparsity Hash the predicate to determine column. Use multiple hash functions to reduce collisions. Spill to new row if still collides. Predicate correlations is sample data available. E.g., age, and social security number co-occur as predicates of Person, and headquarters and revenue co-occur as predicates of Company, but age and revenue never occur together in any entity 15
What does a DB2 RDF Store look like at the backend Direct Primary Subject Graph pred1 obj1 pred2 obj2 pred3 obj3 pred4 obj4 pred5 obj5 pred6 obj6 IBM Is A Company Sells REF#1 Has Supplier ABC ABC Uses DB2 Is A Supplier Direct Secondary Graph List ID Value REF#1 DB2 IBM REF#1 WebSphere sells Company has supplier ABC uses sells XYZ is subsidiary of Websphere DB2 Supplier Reverse Primary is Software is Object Graph pred1 sub1 pred2 sub2 pred3 sub3 pred4 sub4 pred5 sub5 pred6 sub6 DB2 sells IBM uses ABC Company Is A REF#2 Reverse Secondary Graph List ID Value REF#2 IBM REF#2 ABC 16
DB2 RDF features across Releases Released in DB2 10.1 Supported SPARQL 1.0 and some SPARQL 1.1 features Supported FGAC with RDF/SPARQL In DB2 10.1 FP2 SPARQL 1.1 (minus Property Paths, Negation) SPARQL 1.1 UPDATE SPARQL 1.1 GRAPH STORE HTTP PROTOCOL Support for querying versioned RDF Graphs Number of performance enhancements SPARQL-2-SQL Cache Single recursive SQL for Describe Queries Streaming bulk loaders 17
DB2 RDF support from all Programming Languages In FP2, SPARQL queries, Updates and Graph Store operations are all out-of-the box supported over HTTP Available from any programming language Integrated with Apache Fuseki 18
DB2 RDF Security and Access Control Access control for RDF exploits DB2 s fine grained access control (FGAC) facility. Granularity of control is for a set of triples that are in the same graph Source Graph PI John_Smit D PCPI D Col 1 g1 1 2 type Patie hjim_hunt g2 2 2 type nt Patie nt Col 2 Col 3 Col 4 hasssn 0123-456- hasssn 89 0245-361- 99 John_Doe g3 3 3 type Patie nt Goal: Let patients see their own data, let physicians see their patients data. Segregate information for each patient into different graphs. Provide system predicates to the DB2RDF store so each predicate gets a dedicated column which can be used for FGAC. Use DB2 to configure rules to specify access to the row by role and identity of SESSION USER. 19 19
RDF in DB2 : How Users consume RDF Developing customized SPARQL endpoints Use JENA Java API s in web-app to talk to DB2. Add rdfstore.jar and dependent jar files that ships with all DB2 clients on application classpath Need an out-of-the box SPARQL end-point Download Fuseki and install. Add db2rdfstore.jar to classpath. Make entries in configuration file for DB2 20
Data Characteristics and Guidelines Intact Data (Not Normalized) RDF (Highly Normalized) Characteri stics Identifiers are usually values, e.g., SSN, ISBN - global identifiers such as URLs are usually generated via REST / Web APIs Schemas can be globally or locally defined Query, Transformation & Schema Languages exist or emerging Global Identifiers are used throughout to facilitate integration : URIs; Linked Data URLs Ontologies are typically globally defined Query, Transformation & Schema Languages exist and new ones are emerging Usage Guideline Use intact data when it: matches the typical unit of retrieval and manipulation, e.g., data exchange, audit and logging use cases is the unit of integrity and versioning, e.g., a business record Use RDF when it: matches typical unit of retrieval and manipulation, e.g., integration and inferencing use cases Note: RDF is usually unsuitable for managing records that need coordinated integrity or to be versioned. RDF usually represents the latest version only 21
Use Cases: Normalized versus Non-normalized Storage Consider RDF for linking data across heterogeneous data sources, inferencing Use Case Properties 1 2 3 4 5 Suitable for non-normalized data representation, for example, XML Data access is "object-centric" (all or most pieces of a business record are accessed together) Intact business records are exchanged via web services or SOA Versioning is required: updates are replaced by inserts of immutable versions Schema evolution Auditing and compliance of business records are critical Suitable for normalized or semi-denormalized data representation Data access is set-oriented or column-oriented, for example for analytics Original business records do not need to be reassembled Only the latest state of each business record needs to be retained Schema is mature, stable, unlikely to evolve Audit/compliance requirements are short-term, weak, or absent 22
DB2 RDF Summary Improved Performance Optimized mechanism to store RDF triples in DB2 Exploit DB2 capabilities including ACID, compression, load balancing, parallel execution and scalability Easier Development Accessible from any programming language via HTTP end-points Support for SPARQL 1.1 standards (Query, Update, Graph Store ) Support for popular RDF Java APIs like JENA Easier Administration. Exploit DB2 advanced security like FGAC, DB2 Backup and recovery, Standard Data management practices. 23