COSC 6397 Big Data Analytics NoSQL databases Edgar Gabriel Spring 2017 Relational databases Long lasting industry standard to store data persistently Key points concurrency control, transactions, standard interfaces to access data Act as an integration point between different applications 1
Problems with relational databases Mismatch between the relational data structures and the in-memory data structures of applications Everything is mapped to tables All entries within a table have to have the same schema Often using generic fields that have very different meanings for different entries to overcome this limitation No way to include more complex structure, e.g. nested records, lists, etc. Relational databases were not designed (with very few exceptions) to run efficiently on clusters ACID Relational databases often provide properties summarized as ACID Atomicity: if a transaction is started, it should be either completed or undone (rollback) Consistency: guarantees that a transaction never leaves your database in a half-finished state. Isolation: keeps transactions separated from each other until they re finished. Durability: guarantees that the database will keep track of pending changes in such a way that the server can recover from an abnormal termination. 2
RDBMS data management Typical RDBMS representation of a purchasing system A transaction is an update to (multiple) tables as a single, atomic operation Customer ID Name Firstname 1 Martin John Order ID CustomerId ShippingAddr BillingAddress 99 1 77 28 Address Id City Street Number 77 Houston Calhoun Blvd 4800 Aggregate Model Aggregate Model representation //in orders { Id : 99, Customer : { Id : 1, FirstName : John, LastName : Martin, } BillingAddress : { Id : 77, City : Houston, Street : Calhoun Blvd, Number : 4800 } 3
Aggregate Data Model An aggregate is a collection of data managed as a single unit Form the boundaries for ACID operations with the database Drawing boundaries on how much information to include in a single aggregate is domain and problem specific Aggregate-oriented databases work best when most data interaction is done with the same aggregate Atomic updates typically only supported within a single aggregate Aggregate-oriented databases make inter-aggregate relationships more difficult to handle than intraaggregate relationship NoSQL databases Loosely defined term integrating various classes of nonrelational data storage systems Typically don t rely (exclusively) on SQL Open source projects Usually do not require a fixed table schema Designed to run on clustered environments Relaxing one or more of the ACID properties 4
Sharding Distributes Data across multiple servers Each server acts as a single source for a subset of the data Aggregate data models allow to store an entire aggregate on a single server Scales well for both reads and writes Data distribution Automatic: e.g. hash functions, lexicographic order, etc. User defined Availability Traditionally, thought of as the server/process available five 9 s (99.999 %). However, for large node system, at almost any point in time there s a good chance that a node is either down or there is a network disruption among the nodes. Want a system that is resilient in the face of network disruption 5
Availability Master-Slave Replication All writes are written to the master. All reads performed against the replicated slave databases Critical reads may be incorrect as writes may not have been propagated down Large data sets can pose problems as master needs to duplicate data to slaves Availability Peer-to-peer replication Hold copy of the data on multiple servers Nodes coordinate synchronization of the data internally Removes the single point-of-failure of master-slave replication Improves the write-load performance 6
Consistency Consistency model determines rules for visibility and apparent order of updates Write-write conflict: concurrent update of the same entry Optimistic approach: detects and reports a conflict, but does not prevent them conditional update: test value to be modified before update to see whether it has changed Perform and log both updates and report a conflict (e.g. done by revision control software) Pessimistic approach: prevent conflict E.g. using locks to serialize access to an entry Read-write conflict: accessing an element that was modified by another client Not trivial with replication Consistency Sequential consistency: the result of any execution is the same as if the operations of all clients were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program Comes at often significant costs in case of sharding and replication Replication consistency: ensure that data has the same value across all replicas Inconsistency window: length of time an inconsistency is present Also referred to as Eventual Consistency: all replicas will be eventually updated, but there might be an inconsistency window 7
Consistency Session consistency: within a single user session, your own writes are immediately visible Sticky session: ensure that both reads and writes are handled by a single server Version stamps: ensure that every interaction with the data store includes latest version stamp seen by that session Consistency within a single aggregate typically ensured by NoSQL databases Often the driving force in determining what belongs into a single aggregate NoSQL Databases Key-value stores: everything is stored as a key-value pair Value is consider a blob without internal structure Lookup and retrieval is based purely on the key Examples: Memcached Redis Riak Project Voldemor 8
NoSQL Databases Document Databases: The data aggregate (value) has a structure (e.g. text) and can be used for query operations Boundaries between key-value stores and Document Databases is not always clear-cut Examples: MongoDB CouchDB OrientDB Terrastore NoSQL Databases Column-family stores: Optimize scenarios where only a subset of the entries in a table are required for a query/analysis Stores data based on columns, not rows Assumes that data is read significantly more often than written Examples: HBase Cassandra Hypertable Amazon SimbleDB 9
NoSQL Databases Graph Databases: Organize data into nodes and edges of a graph Allows to capture complex relations between data Supports querying along (selected) edges of the graph Not well suited for sharding! Examples: Neo4J FlockDB HyperGraphDB Infinite Graph Summary Increasing number of highly popular NoSQL databases Do not necessarily replace RDBMS, but have a very special purpose for a targeted application scenario Lots of literature available on the topic 10