DB2 NoSQL Graph Store

Similar documents
Introduction to Big Data. NoSQL Databases. Instituto Politécnico de Tomar. Ricardo Campos

Event Stores (I) [Source: DB-Engines.com, accessed on August 28, 2016]

1 Copyright 2011, Oracle and/or its affiliates. All rights reserved.

Overview. * Some History. * What is NoSQL? * Why NoSQL? * RDBMS vs NoSQL. * NoSQL Taxonomy. *TowardsNewSQL

AllegroGraph for Flexibility in the Enterprise and on the Web. Jans Aasman Franz Inc

CIB Session 12th NoSQL Databases Structures

CISC 7610 Lecture 4 Approaches to multimedia databases. Topics: Document databases Graph databases Metadata Column databases

OLAP Introduction and Overview

JENA: A Java API for Ontology Management

This presentation is for informational purposes only and may not be incorporated into a contract or agreement.

COMP9321 Web Application Engineering

COMP9321 Web Application Engineering

Copyright 2013, Oracle and/or its affiliates. All rights reserved. Insert Information Protection Policy Classification from Slide 12

Realtime visitor analysis with Couchbase and Elasticsearch

Introduction to NoSQL Databases

MarkLogic 8 Overview of Key Features COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.

CISC 7610 Lecture 4 Approaches to multimedia databases. Topics: Graph databases Neo4j syntax and examples Document databases

Study Guide. MarkLogic Professional Certification. Taking a Written Exam. General Preparation. Developer Written Exam Guide

COMP9321 Web Application Engineering

Distributed Non-Relational Databases. Pelle Jakovits

Goal of the presentation is to give an introduction of NoSQL databases, why they are there.

What is database? Types and Examples

Introduction to Graph Databases

Distributed Databases: SQL vs NoSQL

Semantic Web Company. PoolParty - Server. PoolParty - Technical White Paper.

COMP9321 Web Application Engineering

Building a Data Strategy for a Digital World

Non-Relational Databases. Pelle Jakovits

SERVICE-ORIENTED COMPUTING

relational Key-value Graph Object Document

Speech 2 Part 2 Transcript: The role of DB2 in Web 2.0 and in the IOD World

Disclaimer MULTIMODEL DATABASE WITH ORACLE DATABASE 18C

EMC Documentum xdb. High-performance native XML database optimized for storing and querying large volumes of XML content

COSC 416 NoSQL Databases. NoSQL Databases Overview. Dr. Ramon Lawrence University of British Columbia Okanagan

Information Workbench

Chapter 13 XML: Extensible Markup Language

Top 7 Data API Headaches (and How to Handle Them) Jeff Reser Data Connectivity & Integration Progress Software

Supports 1-1, 1-many, and many to many relationships between objects

Semantic Web Information Management

A Survey Paper on NoSQL Databases: Key-Value Data Stores and Document Stores

10/18/2017. Announcements. NoSQL Motivation. NoSQL. Serverless Architecture. What is the Problem? Database Systems CSE 414

XML: Extensible Markup Language

Data Classification. The Foundation for Intelligent Information Management. Infostructure Associates Leveraging Information for Organizational Success

Unit 10 Databases. Computer Concepts Unit Contents. 10 Operational and Analytical Databases. 10 Section A: Database Basics

Road to a Multi-model Database -- making PostgreSQL the most popular and versatile database

Copyright 2012, Oracle and/or its affiliates. All rights reserved.

DATABASE SYSTEMS. Database programming in a web environment. Database System Course, 2016

5/2/16. Announcements. NoSQL Motivation. The New Hipster: NoSQL. Serverless. What is the Problem? Database Systems CSE 414

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2015 Lecture 14 NoSQL

Course Introduction & Foundational Concepts

Database Systems CSE 414

A Linked Data Translation Approach to Semantic Interoperability

NOSQL EGCO321 DATABASE SYSTEMS KANAT POOLSAWASD DEPARTMENT OF COMPUTER ENGINEERING MAHIDOL UNIVERSITY

Active Endpoints. ActiveVOS Platform Architecture Active Endpoints

Graph Databases. Guilherme Fetter Damasio. University of Ontario Institute of Technology and IBM Centre for Advanced Studies IBM Corporation

SmartData Fabric distributed virtual data, graph data and master data management, analytics and security. Solutions and Key Features Revision 2.

DATABASE SYSTEMS. Database programming in a web environment. Database System Course,

Database Driven Web 2.0 for the Enterprise

Semantic Integration with Apache Jena and Apache Stanbol

Essentials of Database Management

CSE 344 JULY 9 TH NOSQL

Beyond Relational Databases: MongoDB, Redis & ClickHouse. Marcos Albe - Principal Support Percona

Database Management System Fall Introduction to Information and Communication Technologies CSD 102

Migrating Oracle Databases To Cassandra

Introduction to NoSQL by William McKnight

Extend NonStop Applications with Cloud-based Services. Phil Ly, TIC Software John Russell, Canam Software

BEYOND THE RDBMS: WORKING WITH RELATIONAL DATA IN MARKLOGIC

Development of guidelines for publishing statistical data as linked open data

Data Mining with Elastic

Programming Technologies for Web Resource Mining

Topics. History. Architecture. MongoDB, Mongoose - RDBMS - SQL. - NoSQL

Introduction to Computer Science. William Hsu Department of Computer Science and Engineering National Taiwan Ocean University

Big Data Management and NoSQL Databases

Mastering Data Access with the Optic API & Template-Driven Extraction

Oracle NoSQL Database Enterprise Edition, Version 18.1

NoSQL systems: introduction and data models. Riccardo Torlone Università Roma Tre

Abstract. The Challenges. ESG Lab Review InterSystems IRIS Data Platform: A Unified, Efficient Data Platform for Fast Business Insight

Table of Contents Chapter 1 - Introduction Chapter 2 - Designing XML Data and Applications Chapter 3 - Designing and Managing XML Storage Objects

Unstructured Data Management with Oracle Database 12c ORACLE WHITE PAPER NOVEMBER 2016

Oracle Big Data Connectors

Oracle Database Mobile Server, Version 12.2

Graph Databases. Graph Databases. May 2015 Alberto Abelló & Oscar Romero

Alternative Data Models Toward NoSQL

COMPUTER AND INFORMATION SCIENCE JENA DB. Group Abhishek Kumar Harshvardhan Singh Abhisek Mohanty Suhas Tumkur Chandrashekhara

OKKAM-based instance level integration

Oracle Spatial and Graph: Benchmarking a Trillion Edges RDF Graph ORACLE WHITE PAPER NOVEMBER 2016

Safe Harbor Statement

Introduction Aggregate data model Distribution Models Consistency Map-Reduce Types of NoSQL Databases

MAPR TECHNOLOGIES, INC. TECHNICAL BRIEF APRIL 2017 MAPR SNAPSHOTS

Something to think about. Problems. Purpose. Vocabulary. Query Evaluation Techniques for large DB. Part 1. Fact:

Chapter 24 NOSQL Databases and Big Data Storage Systems

Cassandra, MongoDB, and HBase. Cassandra, MongoDB, and HBase. I have chosen these three due to their recent

CSE 344 APRIL 16 TH SEMI-STRUCTURED DATA

HBase vs Neo4j. Technical overview. Name: Vladan Jovičić CR09 Advanced Scalable Data (Fall, 2017) Ecolé Normale Superiuere de Lyon

Programming the Semantic Web

Lecture 0: Course Intro

Conceptual Database Modeling

Performance Comparison of NOSQL Database Cassandra and SQL Server for Large Databases

DB2 9 XML Data Server Francis Arnaudiès IT/Specialist Information Management. Jeudi 24 Mai 2007

Study of NoSQL Database Along With Security Comparison

Transcription:

DB2 NoSQL Graph Store Mario Briggs mario.briggs@in.ibm.com December 13, 2012

Agenda Introduction Some Trends: NoSQL Data Normalization Evolution Hybrid Data Comparing Relational, XML and RDF RDF Introduction What is RDF Use-cases for RDF How RDF is different from other NoSQL stores Why we built RDF into DB2. Benefits of RDF Storage in DB2 DB2 RDF Features Evolution Differentiators Guidelines and Summary 2

Trend : NoSQL NoSQL = "Not only SQL" NoSQL denotes a class of database systems that depart from traditional RDBMSs in one or multiple ways: Data format / data model Query language, APIs Data consistency etc. Goals: performance, scalability, simplicity, schema flexibility for specific uses case and access patterns Not a generic data store 3

NoSQL Data Formats & API Anything that isn't relational: Key-value pairs (e.g., HBase, Cassandra) JSON (JavaScript Object Notation ) XML (Extensible Markup language) RDF (Resource Description Framework ) etc. Most NoSQL systems have: no standardized query language proprietary query APIs RDF, XML support XPath for XML, SPARQL for RDF, etc. 4

Trend: Data Normalization Evolution Two Significant Trends Both driven by the Web - Both enabling new applications of data Relational Tables (1) De-normalized or Not-normalized Intact Data : LOBs, XML, JSON, Documents etc 3 rd Normal Form Variant on row based stores is column based stores (2) Highly Normalized RDF (Resource Definition Framework) Triples and Ontologies See Data Normalization Reconsidered http://www.ibm.com/developerworks/data/library/techarticle/dm-1112normalization/ http://www.ibm.com/developerworks/data/library/techarticle/dm-1201normalizationpart2/ 5

Hybrid Data XML Complete business records Good for representing business records that are shared, for schema flexibility, for versioning Query language: XPath and XQuery Proposals to incorporate JSON support into XQuery at the W3C Relational Third normal form Versatile, works for many scenarios, Typically normalized to 3rd normal form Avoid update anomalies and save storage Sometimes then de-normalized for improved understanding or performance Query language: SQL RDF (Resource Description Framework) triples, Linked Data, Graph Data Good for data about things, for sharing data definitions, for relationships, for inferencing, for schema flexibility Part of the movement from the Web of Documents to the Web of Data Highly normalized Query language: SPARQL Hybrid query and manipulation languages: Relational and XML: Standardized integration of XQuery and SQL (SQL/XML) RDF (triples): No hybrid language for integration with relational 6

Comparing Relational, XML and RDF Relational XML RDF Tables Trees Graphs flat, highly structured hierarchical data linked data Multiple rows in multiple tables represent a business record Flexible normalizaton Nodes in trees represent business records Denormalized Triples represent business records and their properties via URIs fixed schema no or flexible schema highly flexible Extreme normalization SQL (ANSI/ISO) XPath/XQuery (W3C) SPARQL (W3C) 7

What is RDF? Subject predicate Object A method to represent information as triples: (subject, predicate, object) Each triple described the relationship between two things e.g.: ( IBM, is-a, Company) A set of triples defines a graph Relations are part of the data, not part of the db structure 8

RDF Use Cases Three major use cases for RDF, mainly because RDF allows complex queries across data with variable schema. 1.Data integration. Each data source has its own data model, each model s schema evolves differently with different/same entities and properties. 2.Unstructured data access. Metadata generated by extractors for videos/text/images has different entities and relations (based on the extractor). 3.Collaboratively developed repositories of knowledge. E.g. Wikipedia/Dbpedia, Freebase have entities and properties that evolve as users add entities into the system. 9 9

More on RDF Technically, a labeled directed graph where each edge represents a triple. has supplier IBM ABC uses sells sells Websphere DB2 Supplier 10 Company is is Software XYZ is subsidiary of SUBJECT PREDICATE OBJECT IBM Company IBM has supplier ABC ABC Company IBM sells DB2 IBM sells Webshere ABC uses DB2 ABC is subsidiary of XYZ XYZ Company

SPARQL: SPARQL Protocol and RDF Query Language A query language to find sub-graph patterns Company Example: "Find all companies that sell a product to a supplier" has supplier?? sells? uses SELECT?comp,?product,?supplier WHERE {?comp <isa> <Company>?comp <sells>?product?comp <hassupplier>?supplier?supplier <uses>?product } IBM sells Company has supplier ABC uses sells XYZ is subsidiary of Result:?comp IBM?product DB2?supplier ABC Websphere DB2 Supplier is is Software 11

RDF compared to other NoSQL stores NoSQL Key Value stores (such as Hbase, Cassandra) store sets of values associated with a key. For e.g., John_Smith type Person John_Smith hasreport Jim_Hunt Jim_Hunt hasreport John_Doe John_Doe hascontactwith Tom_Smith Tom_Smith worksfor IBM In Hbase etc can be represented as KV stores can store properties for a node in a graph But no JOIN functionality, which is crucial for RDF queries can t ask who in John Smith s reports has contactwith someone who works at IBM. No ability to traverse paths in a graph 12

Why we built RDF in DB2? Internal SWG usage with open-source triple-stores face problems of transactions, concurrency and isolation Key requirements: Transactional support. Eventual consistency is not sufficient in most cases. Concurrent access. This is where the open source systems that our internal projects had used were weak. Security and Access control. (a) Graph level access control (b) specialized predicates determining access. Ride on top of relational systems existing enterprise capabilities instead of reinventing the wheel. ACID, Security, Backup/recovery, compression, load balancing & parallel execution. 13 13

Traditional Approach for Mapping RDF in a RDBMS RDF data Properties : 1000 s of entities and predicates. Variable and sparse. Standard way RDF is modeled in relational : A table with 3 columns Problem : Too many self joins, even when accessing different predicates of same node. John_Smith type?p John_Smith hasreport?z Jim_Hunt hastitle?v Requires 3 JOINS (whereas in normal relational model, this single row fetch). No good use of RDBMS indexes. * Most SPARQL queries exhibit this notion of being star queries. 14

DB2 Approach for Mapping RDF All predicates about a subject / object are lined up in a single row (or minimal # of rows, to handle variability) Benefits : Lookup by Subject/Object is now via standard efficient RDBMS index. Single row fetch for accessing different properties of a node (no joins required). Handling variable predicates and sparsity Hash the predicate to determine column. Use multiple hash functions to reduce collisions. Spill to new row if still collides. Predicate correlations is sample data available. E.g., age, and social security number co-occur as predicates of Person, and headquarters and revenue co-occur as predicates of Company, but age and revenue never occur together in any entity 15

What does a DB2 RDF Store look like at the backend Direct Primary Subject Graph pred1 obj1 pred2 obj2 pred3 obj3 pred4 obj4 pred5 obj5 pred6 obj6 IBM Is A Company Sells REF#1 Has Supplier ABC ABC Uses DB2 Is A Supplier Direct Secondary Graph List ID Value REF#1 DB2 IBM REF#1 WebSphere sells Company has supplier ABC uses sells XYZ is subsidiary of Websphere DB2 Supplier Reverse Primary is Software is Object Graph pred1 sub1 pred2 sub2 pred3 sub3 pred4 sub4 pred5 sub5 pred6 sub6 DB2 sells IBM uses ABC Company Is A REF#2 Reverse Secondary Graph List ID Value REF#2 IBM REF#2 ABC 16

DB2 RDF features across Releases Released in DB2 10.1 Supported SPARQL 1.0 and some SPARQL 1.1 features Supported FGAC with RDF/SPARQL In DB2 10.1 FP2 SPARQL 1.1 (minus Property Paths, Negation) SPARQL 1.1 UPDATE SPARQL 1.1 GRAPH STORE HTTP PROTOCOL Support for querying versioned RDF Graphs Number of performance enhancements SPARQL-2-SQL Cache Single recursive SQL for Describe Queries Streaming bulk loaders 17

DB2 RDF support from all Programming Languages In FP2, SPARQL queries, Updates and Graph Store operations are all out-of-the box supported over HTTP Available from any programming language Integrated with Apache Fuseki 18

DB2 RDF Security and Access Control Access control for RDF exploits DB2 s fine grained access control (FGAC) facility. Granularity of control is for a set of triples that are in the same graph Source Graph PI John_Smit D PCPI D Col 1 g1 1 2 type Patie hjim_hunt g2 2 2 type nt Patie nt Col 2 Col 3 Col 4 hasssn 0123-456- hasssn 89 0245-361- 99 John_Doe g3 3 3 type Patie nt Goal: Let patients see their own data, let physicians see their patients data. Segregate information for each patient into different graphs. Provide system predicates to the DB2RDF store so each predicate gets a dedicated column which can be used for FGAC. Use DB2 to configure rules to specify access to the row by role and identity of SESSION USER. 19 19

RDF in DB2 : How Users consume RDF Developing customized SPARQL endpoints Use JENA Java API s in web-app to talk to DB2. Add rdfstore.jar and dependent jar files that ships with all DB2 clients on application classpath Need an out-of-the box SPARQL end-point Download Fuseki and install. Add db2rdfstore.jar to classpath. Make entries in configuration file for DB2 20

Data Characteristics and Guidelines Intact Data (Not Normalized) RDF (Highly Normalized) Characteri stics Identifiers are usually values, e.g., SSN, ISBN - global identifiers such as URLs are usually generated via REST / Web APIs Schemas can be globally or locally defined Query, Transformation & Schema Languages exist or emerging Global Identifiers are used throughout to facilitate integration : URIs; Linked Data URLs Ontologies are typically globally defined Query, Transformation & Schema Languages exist and new ones are emerging Usage Guideline Use intact data when it: matches the typical unit of retrieval and manipulation, e.g., data exchange, audit and logging use cases is the unit of integrity and versioning, e.g., a business record Use RDF when it: matches typical unit of retrieval and manipulation, e.g., integration and inferencing use cases Note: RDF is usually unsuitable for managing records that need coordinated integrity or to be versioned. RDF usually represents the latest version only 21

Use Cases: Normalized versus Non-normalized Storage Consider RDF for linking data across heterogeneous data sources, inferencing Use Case Properties 1 2 3 4 5 Suitable for non-normalized data representation, for example, XML Data access is "object-centric" (all or most pieces of a business record are accessed together) Intact business records are exchanged via web services or SOA Versioning is required: updates are replaced by inserts of immutable versions Schema evolution Auditing and compliance of business records are critical Suitable for normalized or semi-denormalized data representation Data access is set-oriented or column-oriented, for example for analytics Original business records do not need to be reassembled Only the latest state of each business record needs to be retained Schema is mature, stable, unlikely to evolve Audit/compliance requirements are short-term, weak, or absent 22

DB2 RDF Summary Improved Performance Optimized mechanism to store RDF triples in DB2 Exploit DB2 capabilities including ACID, compression, load balancing, parallel execution and scalability Easier Development Accessible from any programming language via HTTP end-points Support for SPARQL 1.1 standards (Query, Update, Graph Store ) Support for popular RDF Java APIs like JENA Easier Administration. Exploit DB2 advanced security like FGAC, DB2 Backup and recovery, Standard Data management practices. 23