Introduction to NoSQL Databases Roman Kern KTI, TU Graz 2017-10-16 Roman Kern (KTI, TU Graz) Dbase2 2017-10-16 1 / 31
Introduction Intro Why NoSQL? Roman Kern (KTI, TU Graz) Dbase2 2017-10-16 2 / 31
Introduction Introduction The birth of NoSQL Term appeared in 2009 Not only SQL Common properties (pros) Non relational Schema-less (schema free) Good scalability Potential down-sides (cons) Limited query abilities Not standardised (evolving technology) Roman Kern (KTI, TU Graz) Dbase2 2017-10-16 3 / 31
Introduction Introduction Motivations for starting NoSQL 1 Growth of data User-generated Machine-generated, eg log-files, sensors Higher degree of connectedness 2 Need for flexibility instead of a rigid schema For semi-structured data (schema-free / schema-less) 3 No separation of data management and data processing Roman Kern (KTI, TU Graz) Dbase2 2017-10-16 4 / 31
Introduction Introduction Data Management vs Data Processing Classic CRUD operations no longer sufficient for advanced data analytics need to combine both functionalities Paradigm shift: Bring the code to the data ie the locality of data is taken into considerations for the data processing Example applications: Online transaction processing (OLTP) relational databases Online analytical processing (OLAP) data warehousing High performance, scalability NoSQL Roman Kern (KTI, TU Graz) Dbase2 2017-10-16 5 / 31
Introduction Introduction Scalability Scale up (scale vertically) vs scale out (scale horizontally) Scale up: Add more hardware to a single machine Scale out: Add more machines Degree of sharing Shared memory (single machine, single storage) Shared disk (multiple machines, single storage) Shared nothing (multiple machines, multiple storage) Roman Kern (KTI, TU Graz) Dbase2 2017-10-16 6 / 31
Introduction Introduction Replication In an distributed system, data is replicated between nodes thus data is stored multiple times Types of replication 1 Synchronous (eager) All data is replicated to all nodes before ending the operation complex, even impossible in some configurations 2 Asynchronous (lazy) Operation is finished before all data has been written by all nodes potentially inconsistent Access for writing options 1 Single node accepts writing of data (master/slave, primary copy) 2 All nodes accept write operations (update anywhere) Roman Kern (KTI, TU Graz) Dbase2 2017-10-16 7 / 31
Introduction Introduction Sharding In an distributed system, each node may be responsible for different parts of the full data still data is replicated for redundancy Also known as: partitioning, fragmentation Advantage: improved efficiency (fewer resources) Types of sharding: 1 Hash-based Hash-key determines partition no data locality 2 Range-based Assigns range (binning) rebalancing needed 3 Entity-group All data from single transactions assigned to a single partition partitions cannot easily change Roman Kern (KTI, TU Graz) Dbase2 2017-10-16 8 / 31
Introduction Introduction ACID vs BASE ACID Atomicity Consistency Isolation Durability BASE Basically Available Soft state Eventually consistent Trade-offs for improved performance Some database systems prefer performance over durability Redundancy for improved performance (no normalisation) Roman Kern (KTI, TU Graz) Dbase2 2017-10-16 9 / 31
Introduction Introduction CAP theorem Not possible to achieve all three properties: Consistent Reads are guaranteed to incorporate all previous writes (all nodes see the same data at the same time) Availability Every query returns an answer, instead of an error (failures do not prevent the remaining system to be operational) Partitioned The systems runs, even if a part of the system is not reachable (eg due to network failure, message loss) Implications of CAP One needs to find a trade-off between the properties, eg choose availability over consistency (as consistency is a major bottleneck for scalability) Roman Kern (KTI, TU Graz) Dbase2 2017-10-16 10 / 31
Introduction Introduction Classification scheme of NoSQL systems 1 According to the data model Key-Values Tabular (wide column) Document Graph Specialised, eg time-series, triples, objects, XML, files, 2 According to the CAP trade-off Available & partition tolerant Consistent & partition tolerant Not partition tolerant 3 According to the replication & sharding types lazy vs eager hash based vs range based vs entity-group Roman Kern (KTI, TU Graz) Dbase2 2017-10-16 11 / 31
Systems What types of NoSQL systems are out there? Roman Kern (KTI, TU Graz) Dbase2 2017-10-16 12 / 31
Systems Distributed File System Data model Folders & files (plus metadata, eg time of creation, ) Interface File system operations Variations Examples NFS, GPFS, HDFS Network File System: (often) single storage Cluster File Systems: (multiple) storage Distributed File Systems: multiple, independent storage Roman Kern (KTI, TU Graz) Dbase2 2017-10-16 13 / 31
Systems Key/Value Store Data model Key Value where the value is a (binary) opaque blob similar to hash-tables Interface CRUD operations Properties Excellent scalability May support redundant storage Examples Amazon Dynamo (AP, lazy, hash-based), Redis (CP, lazy, hash-based), Riak (AP, lazy, hash-based), Memcached (CP), Roman Kern (KTI, TU Graz) Dbase2 2017-10-16 14 / 31
Systems Tabular / Wide Column Data model (Rowkey, Column, Timestamp) Value where the value is a (binary) opaque blob Interface CRUD operations, scan operations Properties Allow vertical and horizontal partitioning adjacent rows are stored closed to other certain columns are stored close to each other, eg via column families Each cell might have multiple versions (timestamps) Examples Cassandra (AP, lazy, hash-based), Google BigTable (CP, eager, range-based), HBase (CP, eager, range-based), Parquet, Roman Kern (KTI, TU Graz) Dbase2 2017-10-16 15 / 31
Systems Example of Cassandra Query Language Roman Kern (KTI, TU Graz) Dbase2 2017-10-16 16 / 31
Systems Document Storage Data model (Collection, Key) Value where the value is understood by the system Interface CRUD operations, specialised queries (eg JavaScript) Properties Documents are schema free, ie no need for schema migrations Documents may also be versioned Documents are often JSON Examples CouchDB (AP, lazy), MongoDB (CP, lazy eager, range-based), Amazon SimpleDB (AP), Cloudant, Rethink (lazy eager, range-based), Roman Kern (KTI, TU Graz) Dbase2 2017-10-16 17 / 31
Systems Key/Value Store vs Document Storage vs Tabular Storage Key/Value store, if requirements are simple Document store, if need to access parts of the value Document store, if documents are independent units Tabular store, if multiple entries (eg rows) are updated at the same time Tabular store, if only certain columns need to be retrieved Things to watch out for Maximum size of value depends on actual implementation Avoid joins for optimal performance Roman Kern (KTI, TU Graz) Dbase2 2017-10-16 18 / 31
Systems Consistency vs Availability vs Partitioning See also: http://blognahurstcom/visual-guide-to-nosql-systems Roman Kern (KTI, TU Graz) Dbase2 2017-10-16 19 / 31
Systems Graph Storage Data model G = (V, E) where each vertex or edge may have additional properties Interface Graph traversals, specialised queries & insert/update methods Properties Optimised for graph traversal, ie no joins needed Types of edges can be specified by the user Examples Neo4J (CA), OrientDB (CA), TitanDB, Giraph, InfiniteGraph (CA), Roman Kern (KTI, TU Graz) Dbase2 2017-10-16 20 / 31
Systems Search Storage Data model documents, metadata often stored as Vector Space Model Interface specialised query languages Properties Documents may consist of multiple fields (facets) field may be structured as well, eg date, integer, strings Fine control over indexing process, ie how each field is indexed Examples Solr, ElasticSearch, Roman Kern (KTI, TU Graz) Dbase2 2017-10-16 21 / 31
Systems Object Oriented Storage Data model classes, objects, relations Interface CRUD, traversal methods Properties Known model from OO programming Often strong coupling between DB system and programming language Examples db4o (Ca), Versant (CA), Objectivity (CA), Roman Kern (KTI, TU Graz) Dbase2 2017-10-16 22 / 31
Systems XML Databases Data model XML, RDF (triples) Interface CRUD, query languages (XQuery, SPARQL, ) Properties RDF based systems often called TripleStore Often used in combination with semantic technologies Examples BaseX, MarkLogic (CA), AllegroGraph (CA), BigData, Roman Kern (KTI, TU Graz) Dbase2 2017-10-16 23 / 31
Systems Timeseries Databases Data model (timestamp) > value Interface CRUD, specialised query languages Variations Properties Type of value is the sample for all entries, typically simple, eg floating point number Complex value type, eg JSON Optimised for time series data, ie small storage requirements Query for time ranges Operations on time series Examples InfluxDB, KairoDB, Roman Kern (KTI, TU Graz) Dbase2 2017-10-16 24 / 31
Systems In-Memory Databases Data model (key) > value but not limited to this model Interface CRUD, specialised query languages Properties Data is stored in RAM Often distributed over multiple machine (RAM is the new Disk) In its purest form does not satisfy durability criteria Examples Hazelcast, Redis, SAP HANA, Roman Kern (KTI, TU Graz) Dbase2 2017-10-16 25 / 31
Systems API & Data Formats NoSQL system often use RESTful APIs Direct match with data model and CRUD operations Serialisation of objects Many techniques used eg Apache Avro, Protocol Buffers, Roman Kern (KTI, TU Graz) Dbase2 2017-10-16 26 / 31
Systems Features Not all NoSQL systems support transactions Instead they support atomic single transactions Therefore not all operations are supported Not all NoSQL systems support security features eg access control Roman Kern (KTI, TU Graz) Dbase2 2017-10-16 27 / 31
Systems Cloud Database Solutions Storage in the internet (cloud) DBaaS - Database as a Service Not limited to NoSQL, traditional SQL are available as well Multi-tenancy as important feature (separation of multiple clients) Private OS - all separate (eg Amazon RDS) Private process - same machine (eg Compose) Private schema - same database (eg Google DataStore) Shared schema - same tables (most SaaS apps) Roman Kern (KTI, TU Graz) Dbase2 2017-10-16 28 / 31
Systems Current State Current state of data storage systems Depending on the actual requirements select a suitable storage solution Or select multiple solutions for each sub-system polyglot persistence Roman Kern (KTI, TU Graz) Dbase2 2017-10-16 29 / 31
Systems Future of Outlook - NewSQL Attempt to achieve consistency and availability for distributed systems Eg Google Spanner, CockroachDB build on the Raft Consensus algorithm relies on specialised hardware https://githubcom/cockroachdb/cockroach Roman Kern (KTI, TU Graz) Dbase2 2017-10-16 30 / 31
Systems The End Next: Graph Databases Credits Scalable Data Management: NoSQL Data Stores in Research and Practice http://icde2016fi/tutorialsphp Roman Kern (KTI, TU Graz) Dbase2 2017-10-16 31 / 31