CISC 7610 Lecture 4 Approaches to multimedia databases Topics: Graph databases Neo4j syntax and examples Document databases
NoSQL architectures: different tradeoffs for different workloads Already seen: Generation 1 Hadoop for batch processing on commodity hardware Key-value stores for distributed non-transactional processing This lecture: Generation 2 Document databases for better fit with object-oriented code This lecture also: Generation 3 Graph databases for modeling relationships between things Column stores for efficient analytics
Graph databases Most databases store information about things: key-value stores, document databases, RDBMS Graph databases put the relationships between things on equal footing with the things themselves Examples: social networks, medical models, energy networks, access-control systems, etc. Can be modeling in RDBMSs using foreign keys and self-joins Generally hits performance issues when working with very large graphs SQL lacks an expressive syntax to work with graph data Key-value stores and document databases lack joins, would treat a graph as one document For multimedia, store metadata in graph, data in key-value store
Example graph Graphs have Vertices (AKA nodes ) Edges (AKA relationships ) Both can have properties
Example graph Relational schema for movie data
Converting relational to graph-based Each row in a entity table becomes a node Each entity table becomes a label on nodes Columns on those tables become node properties Foreign keys become relationships to the corresponding nodes in the other table Join tables become relationships, columns on those tables become relationship properties
Relational vs graph schema Relational Graph-based
Relational vs graph schema Relational Graph-based
Graph databases Issues with graphs in relational model SQL lacks the syntax to easily perform graph traversal Especially traversals where the depth is unknown or unbounded E.g., Kevin Bacon game, Erdos number Performance degrades quickly as we traverse the graph Each level of traversal adds significantly to query response time.
Example graph SQL query to find actors who co-starred with Keanu Reeves Relational schema for movie data
Example graph SQL query to find actors who co-starred with Keanu Reeves Relational schema for movie data
Issues with other database models for graph data Finding co-stars requires 5-way join Add another 3 joins for each level deeper of query No syntax to allow arbitrary or unknown depth No syntax to expand the whole graph (unknown depth) Even with indexing, each join requires an index lookup for each actor and movie Fast, but memory-inefficient solution: Load the tables in their entirety into map structures in memory For key-value stores and document databases, graph must be traversed in application code
Example graph database: Neo4j Property graph model, nodes and edges both have properties Neo4j is the most popular graph database Written in Java Easily embedded in any Java application or run as a standalone server Supports billions of (graph) nodes, ACID compliant transactions, and multiversion consistency. Implements declarative graph query language Cypher Query graphs in a way somewhat similar to SQL Used to quickly analyze the Panama Papers leaks
Neo4j uses Cypher graph query language Declarative query language (like SQL) ASCII-Art notation for nodes and edges Results are graphs as well But can be displayed as tables
Cypher syntax: nodes/entities Nodes appear in parentheses: (a), (b) a and b are variables that can be referred to later in the query Can access variables properties, e.g. a.name Can request nodes with a specific label using colon: a:person Other properties can be specified in braces: (a:person {name: Mike }) https://neo4j.com/developer/cypher-query-language/
Cypher syntax: edges/relationships Arrows for relationships --> With variables and properties in square brackets: -[r:likes]-> Can include path length: -[r:likes*..4]-> https://neo4j.com/developer/cypher-query-language/
Cypher graph creation: CREATE Create a Person with name property of You : CREATE (you:person {name:"you"}) RETURN you https://neo4j.com/developer/cypher-query-language/
Cypher graph querying: MATCH Query using MATCH These are all equivalent for the one-node graph we just created MATCH (n) RETURN n; MATCH (n:person) RETURN n; MATCH (n:person {name: You }) RETURN n; https://neo4j.com/developer/cypher-query-language/
Cypher graph creation: MATCH and CREATE Add a new relationship between the existing node and a new node MATCH (you:person {name:"you"}) CREATE (you)-[like:like]->(neo:database {name:"neo4j" }) RETURN you,like,neo https://neo4j.com/developer/cypher-query-language/
Cypher graph creation: FOREACH to loop over things Add a new relationship between the existing node and several new nodes MATCH (you:person {name:"you"}) FOREACH (name in ["Johan","Rajesh","Anna","Julia","Andrew"] CREATE (you)-[:friend]->(:person {name:name})) https://neo4j.com/developer/cypher-query-language/
Cypher graph creation Create new relationship between two existing entities MATCH (neo:database {name:"neo4j"}) MATCH (anna:person {name:"anna"}) CREATE (anna)-[:friend]->(:person:expert {name:"amanda"})-[:worked_with]->(neo) https://neo4j.com/developer/cypher-query-language/
Cypher graph querying Find someone in your network who can help you learn Neo4j MATCH (you {name:"you"}) MATCH (expert)-[:worked_with]->(db:database {name:"neo4j"}) MATCH path = shortestpath( (you)-[:friend*..5]- (expert) ) RETURN db,expert,path https://neo4j.com/developer/cypher-query-language/
Example graph database: Neo4j Create example graph CREATE (TheMatrix:Movie {title:'the Matrix', released:1999, tagline:'welcome to the Real World'}) CREATE (JohnWick:Movie {title:'john Wick', released:2014, tagline:'silliest Keanu movie ever'}) CREATE (Keanu:Person {name:'keanu Reeves', born:1964}) CREATE (AndyW:Person {name:'andy Wachowski', born:1967}) CREATE (Keanu)-[:ACTED_IN {roles:['neo']}]->(thematrix), (Keanu)-[:ACTED_IN {roles:['john Wick']}]->(JohnWick), (AndyW)-[:DIRECTED]->(TheMatrix)
Example graph database: Neo4j Retrieve info on one node MATCH (keanu:person {name:"keanu Reeves"}) RETURN keanu; +----------------------------------------+ keanu +----------------------------------------+ Node[1]{name:"Keanu Reeves",born:1964} +----------------------------------------+
Bigger graph database in Neo4j Find all co-stars of Keanu MATCH (kenau:person {name:"keanu Reeves"})- [:ACTED_IN] (movie)<-[:acted_in]-(costar) RETURN costar.name; +----------------------+ costar.name +----------------------+ "Jack Nicholson" "Diane Keaton" "Dina Meyer" "Ice-T" "Takeshi Kitano"
Bigger graph database in Neo4j Find all nodes within 2 hops of Keanu MATCH (kenau:person {name:"keanu Reeves"})-[*1..2]- (related) RETURN distinct related; +------------------------------------------------------ related +------------------------------------------------------ Node[0]{title:"The Matrix",released:1999, tagline: } Node[7]{name:"Joel Silver",born:1952} Node[5]{name:"Andy Wachowski",born:1967} Node[6]{name:"Lana Wachowski",born:1965}
Bigger graph database in Neo4j Results are graphs too
Bigger graph database in Neo4j Demo
Graph database internals: Index-free adjacency Graph processing can be performed on databases irrespective of their internal storage format It is a logical model, not a specific implementation Efficient real-time graph processing requires moving through graph without index lookups Referred to as index-free adjacency In RDBMS, indexes allow logical key values to be translated to a physical addresses Typically, three or four logical IO operations are required to traverse a B-Tree index Plus another lookup to retrieve the value Can be cached, but usually some disk IO required In a native graph database utilizing index-free adjacency, each node knows the physical location of all adjacent nodes So no need to use indexes to efficiently navigate the graph
Graph compute engines: Add graph interface on other models Implements efficient graph processing algorithms Exposes graph APIs Doesn t necessarily store data in index free adjacency graph Usually designed for batch processing of most or all of a graph Significant examples: Apache Giraph: graph processing on Hadoop using MapReduce GraphX: graph processing part in Spark (part of Berkeley Data Analytic Stack) Titan: graph processing on Big Data storage engines like Hbase and Cassandra
Strengths Graph databases Strengths and weaknesses Joins are precomputed Flexible schema Fast and scalable Weaknesses Harder to execute queries not embodied in relationships
Document databases Non-relational database that stores data as structured documents Usually XML or JSON formats Doesn t imply anything specific beyond the document storage model Could implement ACID transactions, etc Most provide modest transactional support Try to remove object-relational impedance mismatch Easy to incorporate media in documents
Document databases JSON-based document databases flourished 2008 because Address the conflict between object-oriented programming and the relational database model Self-describing document formats could be interrogated independently of the program that had created them Aligned well with the dominant web-based programming paradigms
JSON databases Not a single specification, just store data in JSON format, but usually Basic unit of storage is the Document a row in an RDBMS One or more key-value pairs May also contain nested documents and arrays Arrays may also contain documents, creating complex hierarchical structure A collection or bucket is a set of documents sharing some common purpose a relational table Documents in a collection don t have to be the same type Could implement 3 rd normal form schema But usually model data in a smaller number of collections With nested documents representing master-detail relationships
JSON format
Example JSON structure {"menu": { "id": "file", "value": "File", "popup": { "menuitem": [ {"value": "New", "onclick": "CreateNewDoc()"}, {"value": "Open", "onclick": "OpenDoc()"}, {"value": "Close", "onclick": "CloseDoc()"} ] } }}
Example JSON database Relational schema
Example JSON database Relational schema Example JSON database Document embedding Mirrors OO design
Example JSON database Could instead organize by linking to list of actor documents Document linking
Example JSON database Could instead organize by linking to list of actor documents Another option, closer to 3 rd normal form Less natural for document DB because of lack of joins Document linking
Example document DB: MongoDB Query with JavaScript
Example document DB: MongoDB Query with JavaScript
Example document DB: MongoDB Query with JavaScript
Document database summary Like a key-value store, but with self-documenting values Like a traditional RDBMS, but more scalable, less mis-match with object-oriented programming Many types of databases are adding support for JSON, so a JSON document database might soon be feature of other databases instead of a distinct type of database
Strengths Document databases Strengths and weaknesses Self-documenting schema Easier for non-programmers to query Can offer availability instead of strict consistency Weaknesses Typically weak transactional support Joins implemented in application code
Column databases RDBMs is tuned for OLTP: OnLine Transaction Processing Data in relations are organized by row (tuple) All data in a row are stored together Data warehouses used for analytics present a different workload: OLAP: OnLine Analytic Processing Aggregate information over many records to provide insight into trends Organizing relations by column has several advantages
Data warehousing In the 1970s, OLTP happened in real time, reports were generated in batches overnight In the 1980s and 90s, RDBMs started to be used for generating reports in real-time in parallel to the OLTP systems Known as data warehouses Star schema de-normalizes data somewhat to provide real-time responses
Star Schema For RDBMS data warehouse Large central fact table Measurements or metrics of each event Smaller dimension tables Attributes to describe facts Still CPU and I/O intensive to use
Instead: Column Databases Store relations by column
Column database advantages Two main advantages Acceleration of aggregation queries Because data are stored together Better data compression Because similar data are stored together
Column database disadvantages Slower to insert RDBMS requires one IO to insert a row Column database could require one per column Inserts are generally batched across a number of rows
C-store is a column store with a row-based cache Column store is bulk-loaded periodically Delta-store is updated in real-time Queries combine results from both Company Vertica commercialized it
Data are stored as Projections Columns that are accessed together can be stored together Workload-dependent Can be created manually Or on the fly by the query optimizer Or in bulk based on historical workloads Like indexes in RDBMS
Column store summary Designed for data warehousing and analytics Re-organize data for compression and aggregation Can be added on as feature to other DB systems Significant component in some in-memory DBs For analytics applications (SAP HANA) For multimedia, store metadata in column store, media in key-value store
Strengths Column store Strengths and weaknesses Fast aggregations Efficient compression Weaknesses Vanilla model expensive to insert one row
Summary NoSQL architectures utilize different tradeoffs for different workloads Document databases for better fit with objectoriented code Graph databases for modeling relationships between things Column stores for efficient analytics