CISC 7610 Lecture 4 Approaches to multimedia databases. Topics: Graph databases Neo4j syntax and examples Document databases

CISC 7610 Lecture 4 Approaches to multimedia databases Topics: Graph databases Neo4j syntax and examples Document databases

NoSQL architectures: different tradeoffs for different workloads Already seen: Generation 1 Hadoop for batch processing on commodity hardware Key-value stores for distributed non-transactional processing This lecture: Generation 2 Document databases for better fit with object-oriented code This lecture also: Generation 3 Graph databases for modeling relationships between things Column stores for efficient analytics

Graph databases Most databases store information about things: key-value stores, document databases, RDBMS Graph databases put the relationships between things on equal footing with the things themselves Examples: social networks, medical models, energy networks, access-control systems, etc. Can be modeling in RDBMSs using foreign keys and self-joins Generally hits performance issues when working with very large graphs SQL lacks an expressive syntax to work with graph data Key-value stores and document databases lack joins, would treat a graph as one document For multimedia, store metadata in graph, data in key-value store

Example graph Graphs have Vertices (AKA nodes ) Edges (AKA relationships ) Both can have properties

Example graph Relational schema for movie data

Converting relational to graph-based Each row in a entity table becomes a node Each entity table becomes a label on nodes Columns on those tables become node properties Foreign keys become relationships to the corresponding nodes in the other table Join tables become relationships, columns on those tables become relationship properties

Relational vs graph schema Relational Graph-based

Graph databases Issues with graphs in relational model SQL lacks the syntax to easily perform graph traversal Especially traversals where the depth is unknown or unbounded E.g., Kevin Bacon game, Erdos number Performance degrades quickly as we traverse the graph Each level of traversal adds significantly to query response time.

Example graph SQL query to find actors who co-starred with Keanu Reeves Relational schema for movie data

Issues with other database models for graph data Finding co-stars requires 5-way join Add another 3 joins for each level deeper of query No syntax to allow arbitrary or unknown depth No syntax to expand the whole graph (unknown depth) Even with indexing, each join requires an index lookup for each actor and movie Fast, but memory-inefficient solution: Load the tables in their entirety into map structures in memory For key-value stores and document databases, graph must be traversed in application code

Example graph database: Neo4j Property graph model, nodes and edges both have properties Neo4j is the most popular graph database Written in Java Easily embedded in any Java application or run as a standalone server Supports billions of (graph) nodes, ACID compliant transactions, and multiversion consistency. Implements declarative graph query language Cypher Query graphs in a way somewhat similar to SQL Used to quickly analyze the Panama Papers leaks

Neo4j uses Cypher graph query language Declarative query language (like SQL) ASCII-Art notation for nodes and edges Results are graphs as well But can be displayed as tables

Cypher syntax: nodes/entities Nodes appear in parentheses: (a), (b) a and b are variables that can be referred to later in the query Can access variables properties, e.g. a.name Can request nodes with a specific label using colon: a:person Other properties can be specified in braces: (a:person {name: Mike }) https://neo4j.com/developer/cypher-query-language/

Cypher syntax: edges/relationships Arrows for relationships --> With variables and properties in square brackets: -[r:likes]-> Can include path length: -[r:likes*..4]-> https://neo4j.com/developer/cypher-query-language/

Cypher graph creation: CREATE Create a Person with name property of You : CREATE (you:person {name:"you"}) RETURN you https://neo4j.com/developer/cypher-query-language/

Cypher graph querying: MATCH Query using MATCH These are all equivalent for the one-node graph we just created MATCH (n) RETURN n; MATCH (n:person) RETURN n; MATCH (n:person {name: You }) RETURN n; https://neo4j.com/developer/cypher-query-language/

Cypher graph creation: MATCH and CREATE Add a new relationship between the existing node and a new node MATCH (you:person {name:"you"}) CREATE (you)-[like:like]->(neo:database {name:"neo4j" }) RETURN you,like,neo https://neo4j.com/developer/cypher-query-language/

Cypher graph creation: FOREACH to loop over things Add a new relationship between the existing node and several new nodes MATCH (you:person {name:"you"}) FOREACH (name in ["Johan","Rajesh","Anna","Julia","Andrew"] CREATE (you)-[:friend]->(:person {name:name})) https://neo4j.com/developer/cypher-query-language/

Cypher graph creation Create new relationship between two existing entities MATCH (neo:database {name:"neo4j"}) MATCH (anna:person {name:"anna"}) CREATE (anna)-[:friend]->(:person:expert {name:"amanda"})-[:worked_with]->(neo) https://neo4j.com/developer/cypher-query-language/

Cypher graph querying Find someone in your network who can help you learn Neo4j MATCH (you {name:"you"}) MATCH (expert)-[:worked_with]->(db:database {name:"neo4j"}) MATCH path = shortestpath( (you)-[:friend*..5]- (expert) ) RETURN db,expert,path https://neo4j.com/developer/cypher-query-language/

Example graph database: Neo4j Create example graph CREATE (TheMatrix:Movie {title:'the Matrix', released:1999, tagline:'welcome to the Real World'}) CREATE (JohnWick:Movie {title:'john Wick', released:2014, tagline:'silliest Keanu movie ever'}) CREATE (Keanu:Person {name:'keanu Reeves', born:1964}) CREATE (AndyW:Person {name:'andy Wachowski', born:1967}) CREATE (Keanu)-[:ACTED_IN {roles:['neo']}]->(thematrix), (Keanu)-[:ACTED_IN {roles:['john Wick']}]->(JohnWick), (AndyW)-[:DIRECTED]->(TheMatrix)

Example graph database: Neo4j Retrieve info on one node MATCH (keanu:person {name:"keanu Reeves"}) RETURN keanu; +----------------------------------------+ keanu +----------------------------------------+ Node[1]{name:"Keanu Reeves",born:1964} +----------------------------------------+

Bigger graph database in Neo4j Find all co-stars of Keanu MATCH (kenau:person {name:"keanu Reeves"})- [:ACTED_IN] (movie)<-[:acted_in]-(costar) RETURN costar.name; +----------------------+ costar.name +----------------------+ "Jack Nicholson" "Diane Keaton" "Dina Meyer" "Ice-T" "Takeshi Kitano"

Bigger graph database in Neo4j Find all nodes within 2 hops of Keanu MATCH (kenau:person {name:"keanu Reeves"})-[*1..2]- (related) RETURN distinct related; +------------------------------------------------------ related +------------------------------------------------------ Node[0]{title:"The Matrix",released:1999, tagline: } Node[7]{name:"Joel Silver",born:1952} Node[5]{name:"Andy Wachowski",born:1967} Node[6]{name:"Lana Wachowski",born:1965}

Bigger graph database in Neo4j Results are graphs too

Bigger graph database in Neo4j Demo

Graph database internals: Index-free adjacency Graph processing can be performed on databases irrespective of their internal storage format It is a logical model, not a specific implementation Efficient real-time graph processing requires moving through graph without index lookups Referred to as index-free adjacency In RDBMS, indexes allow logical key values to be translated to a physical addresses Typically, three or four logical IO operations are required to traverse a B-Tree index Plus another lookup to retrieve the value Can be cached, but usually some disk IO required In a native graph database utilizing index-free adjacency, each node knows the physical location of all adjacent nodes So no need to use indexes to efficiently navigate the graph

Graph compute engines: Add graph interface on other models Implements efficient graph processing algorithms Exposes graph APIs Doesn t necessarily store data in index free adjacency graph Usually designed for batch processing of most or all of a graph Significant examples: Apache Giraph: graph processing on Hadoop using MapReduce GraphX: graph processing part in Spark (part of Berkeley Data Analytic Stack) Titan: graph processing on Big Data storage engines like Hbase and Cassandra

Strengths Graph databases Strengths and weaknesses Joins are precomputed Flexible schema Fast and scalable Weaknesses Harder to execute queries not embodied in relationships

Document databases Non-relational database that stores data as structured documents Usually XML or JSON formats Doesn t imply anything specific beyond the document storage model Could implement ACID transactions, etc Most provide modest transactional support Try to remove object-relational impedance mismatch Easy to incorporate media in documents

Document databases JSON-based document databases flourished 2008 because Address the conflict between object-oriented programming and the relational database model Self-describing document formats could be interrogated independently of the program that had created them Aligned well with the dominant web-based programming paradigms

JSON databases Not a single specification, just store data in JSON format, but usually Basic unit of storage is the Document a row in an RDBMS One or more key-value pairs May also contain nested documents and arrays Arrays may also contain documents, creating complex hierarchical structure A collection or bucket is a set of documents sharing some common purpose a relational table Documents in a collection don t have to be the same type Could implement 3 rd normal form schema But usually model data in a smaller number of collections With nested documents representing master-detail relationships

JSON format

Example JSON structure {"menu": { "id": "file", "value": "File", "popup": { "menuitem": [ {"value": "New", "onclick": "CreateNewDoc()"}, {"value": "Open", "onclick": "OpenDoc()"}, {"value": "Close", "onclick": "CloseDoc()"} ] } }}

Example JSON database Relational schema

Example JSON database Relational schema Example JSON database Document embedding Mirrors OO design

Example JSON database Could instead organize by linking to list of actor documents Document linking

Example JSON database Could instead organize by linking to list of actor documents Another option, closer to 3 rd normal form Less natural for document DB because of lack of joins Document linking

Example document DB: MongoDB Query with JavaScript

Document database summary Like a key-value store, but with self-documenting values Like a traditional RDBMS, but more scalable, less mis-match with object-oriented programming Many types of databases are adding support for JSON, so a JSON document database might soon be feature of other databases instead of a distinct type of database

Strengths Document databases Strengths and weaknesses Self-documenting schema Easier for non-programmers to query Can offer availability instead of strict consistency Weaknesses Typically weak transactional support Joins implemented in application code

Column databases RDBMs is tuned for OLTP: OnLine Transaction Processing Data in relations are organized by row (tuple) All data in a row are stored together Data warehouses used for analytics present a different workload: OLAP: OnLine Analytic Processing Aggregate information over many records to provide insight into trends Organizing relations by column has several advantages

Data warehousing In the 1970s, OLTP happened in real time, reports were generated in batches overnight In the 1980s and 90s, RDBMs started to be used for generating reports in real-time in parallel to the OLTP systems Known as data warehouses Star schema de-normalizes data somewhat to provide real-time responses

Star Schema For RDBMS data warehouse Large central fact table Measurements or metrics of each event Smaller dimension tables Attributes to describe facts Still CPU and I/O intensive to use

Instead: Column Databases Store relations by column

Column database advantages Two main advantages Acceleration of aggregation queries Because data are stored together Better data compression Because similar data are stored together

Column database disadvantages Slower to insert RDBMS requires one IO to insert a row Column database could require one per column Inserts are generally batched across a number of rows

C-store is a column store with a row-based cache Column store is bulk-loaded periodically Delta-store is updated in real-time Queries combine results from both Company Vertica commercialized it

Data are stored as Projections Columns that are accessed together can be stored together Workload-dependent Can be created manually Or on the fly by the query optimizer Or in bulk based on historical workloads Like indexes in RDBMS

Column store summary Designed for data warehousing and analytics Re-organize data for compression and aggregation Can be added on as feature to other DB systems Significant component in some in-memory DBs For analytics applications (SAP HANA) For multimedia, store metadata in column store, media in key-value store

Strengths Column store Strengths and weaknesses Fast aggregations Efficient compression Weaknesses Vanilla model expensive to insert one row

Summary NoSQL architectures utilize different tradeoffs for different workloads Document databases for better fit with objectoriented code Graph databases for modeling relationships between things Column stores for efficient analytics