Lecture Notes to Big Data Management and Analytics Winter Term 2017/2018 NoSQL Databases

Similar documents
NOSQL EGCO321 DATABASE SYSTEMS KANAT POOLSAWASD DEPARTMENT OF COMPUTER ENGINEERING MAHIDOL UNIVERSITY

Chapter 24 NOSQL Databases and Big Data Storage Systems

Non-Relational Databases. Pelle Jakovits

Introduction to Computer Science. William Hsu Department of Computer Science and Engineering National Taiwan Ocean University

Introduction to NoSQL Databases

Goal of the presentation is to give an introduction of NoSQL databases, why they are there.

Overview. * Some History. * What is NoSQL? * Why NoSQL? * RDBMS vs NoSQL. * NoSQL Taxonomy. *TowardsNewSQL

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2015 Lecture 14 NoSQL

Introduction Aggregate data model Distribution Models Consistency Map-Reduce Types of NoSQL Databases

Database Availability and Integrity in NoSQL. Fahri Firdausillah [M ]

Database Evolution. DB NoSQL Linked Open Data. L. Vigliano

Distributed Data Store

CS 655 Advanced Topics in Distributed Systems

CISC 7610 Lecture 2b The beginnings of NoSQL

NoSQL Databases. Amir H. Payberah. Swedish Institute of Computer Science. April 10, 2014

Big Data Analytics. Rasoul Karimi

Cassandra, MongoDB, and HBase. Cassandra, MongoDB, and HBase. I have chosen these three due to their recent

Introduction to Big Data. NoSQL Databases. Instituto Politécnico de Tomar. Ricardo Campos

NoSQL Databases MongoDB vs Cassandra. Kenny Huynh, Andre Chik, Kevin Vu

CIB Session 12th NoSQL Databases Structures

Relational databases

Sources. P. J. Sadalage, M Fowler, NoSQL Distilled, Addison Wesley

Advanced Data Management Technologies

Distributed Non-Relational Databases. Pelle Jakovits

CompSci 516 Database Systems

Introduction to NoSQL

CSE 530A. Non-Relational Databases. Washington University Fall 2013

Databases : Lecture 1 2: Beyond ACID/Relational databases Timothy G. Griffin Lent Term Apologies to Martin Fowler ( NoSQL Distilled )

CISC 7610 Lecture 5 Distributed multimedia databases. Topics: Scaling up vs out Replication Partitioning CAP Theorem NoSQL NewSQL

Extreme Computing. NoSQL.

Migrating Oracle Databases To Cassandra

Transactions and ACID

Apache Cassandra - A Decentralized Structured Storage System

Jargons, Concepts, Scope and Systems. Key Value Stores, Document Stores, Extensible Record Stores. Overview of different scalable relational systems

NoSQL systems. Lecture 21 (optional) Instructor: Sudeepa Roy. CompSci 516 Data Intensive Computing Systems

CMU SCS CMU SCS Who: What: When: Where: Why: CMU SCS

Databases : Lectures 11 and 12: Beyond ACID/Relational databases Timothy G. Griffin Lent Term 2013

Study of NoSQL Database Along With Security Comparison

CS-580K/480K Advanced Topics in Cloud Computing. NoSQL Database

NoSQL Databases An efficient way to store and query heterogeneous astronomical data in DACE. Nicolas Buchschacher - University of Geneva - ADASS 2018

CSE 344 JULY 9 TH NOSQL

10. Replication. Motivation

Motivation Overview of NoSQL space Comparing technologies used Getting hands dirty tutorial section

5/2/16. Announcements. NoSQL Motivation. The New Hipster: NoSQL. Serverless. What is the Problem? Database Systems CSE 414

Database Systems CSE 414

Topics. History. Architecture. MongoDB, Mongoose - RDBMS - SQL. - NoSQL

Modern Database Concepts

Data Informatics. Seon Ho Kim, Ph.D.

NOSQL Databases: The Need of Enterprises

Final Exam Review 2. Kathleen Durant CS 3200 Northeastern University Lecture 23

A Study of NoSQL Database

Hands-on immersion on Big Data tools

NoSQL Databases. CPS352: Database Systems. Simon Miner Gordon College Last Revised: 4/22/15

Intro Cassandra. Adelaide Big Data Meetup.

SCALABLE CONSISTENCY AND TRANSACTION MODELS

10/18/2017. Announcements. NoSQL Motivation. NoSQL. Serverless Architecture. What is the Problem? Database Systems CSE 414

Database Solution in Cloud Computing

NoSQL Databases Analysis

Large-Scale Key-Value Stores Eventual Consistency Marco Serafini

Spotify. Scaling storage to million of users world wide. Jimmy Mårdell October 14, 2014

CA485 Ray Walshe NoSQL

COMP9321 Web Application Engineering

4. Managing Big Data. Cloud Computing & Big Data MASTER ENGINYERIA INFORMÀTICA FIB/UPC. Fall Jordi Torres, UPC - BSC

MongoDB and Mysql: Which one is a better fit for me? Room 204-2:20PM-3:10PM

A NoSQL Introduction for Relational Database Developers. Andrew Karcher Las Vegas SQL Saturday September 12th, 2015

CIT 668: System Architecture. Distributed Databases

EECS 498 Introduction to Distributed Systems

Database Architectures

COSC 416 NoSQL Databases. NoSQL Databases Overview. Dr. Ramon Lawrence University of British Columbia Okanagan

Distributed Databases: SQL vs NoSQL

RDBMS - Features. Lecture 5

Architekturen für die Cloud

Understanding NoSQL Database Implementations

DEMYSTIFYING BIG DATA WITH RIAK USE CASES. Martin Schneider Basho Technologies!

A Comparative Analysis of MongoDB and Cassandra

International Journal of Informative & Futuristic Research ISSN:

The NoSQL Ecosystem. Adam Marcus MIT CSAIL

A Review Of Non Relational Databases, Their Types, Advantages And Disadvantages

10 Million Smart Meter Data with Apache HBase

Performance and Forgiveness. June 23, 2008 Margo Seltzer Harvard University School of Engineering and Applied Sciences

Warnings! Today. New Systems. OLAP vs. OLTP. New Systems vs. RDMS. NoSQL 11/5/17. Material from Cattell s paper ( ) some info will be outdated

NoSQL systems: sharding, replication and consistency. Riccardo Torlone Università Roma Tre

Relational Database Features

Why NoSQL? Why Riak?

Final Exam Logistics. CS 133: Databases. Goals for Today. Some References Used. Final exam take-home. Same resources as midterm

PROFESSIONAL. NoSQL. Shashank Tiwari WILEY. John Wiley & Sons, Inc.

Key Value Store. Yiding Wang, Zhaoxiong Yang

Dynamo: Amazon s Highly Available Key-value Store. ID2210-VT13 Slides by Tallat M. Shafaat

NoSQL Databases. Concept, Types & Use-cases.

Cassandra- A Distributed Database

Perspectives on NoSQL

Introduction to Databases

NoSQL Databases. Vincent Leroy

Performance Comparison of NOSQL Database Cassandra and SQL Server for Large Databases

MIS Database Systems.

BIS Database Management Systems.

Advances in Data Management - NoSQL, NewSQL and Big Data A.Poulovassilis

Advanced Database Technologies NoSQL: Not only SQL

Challenges for Data Driven Systems

Eric Redmond. Seven Databases in Seven Weeks. Pragmatic Press Out by the end of Summer

Transcription:

Lecture Notes to Big Data Management and Analytics Winter Term 2017/2018 NoSQL Databases Matthias Schubert, Matthias Renz, Felix Borutta, Evgeniy Faerman, Christian Frey, Klaus Arthur Schmid, Daniyal Kazempour, Julian Busch 2016-2018 DBS

NoSQL Database Systems Outline History Concepts ACID BASE CAP Data Models Key-Value Stores Document Databases Wide Column Stores Graph Databases Big Data Management and Analytics 2

History 60s: IBM developed the Hierarchical Database Model Tree-like structure Data stored as records connected by links Support only one-to-one and one-to-many relationships Mid 80 s: Rise of Relational Database Model Data stored in a collection of tables (rows and columns) Strict relational scheme SQL became standard language (based on relational algebra) Impedance Mismatch! Big Data Management and Analytics 3

History Impedance Mismatch Supply: Supplier: LNR: L1 Lname: Meier Status: 20 Sitz: Wetter Project: PNR: P2 Pname: Pleite Ort: Bonn Pieces: TNR: T6 Tname: Schraube Farbe: rot Gewicht: 03 Menge: 700 LNR Lname Status Sitz TNR Tname Farbe Gewicht PNR Pname Ort LNR PNR TNR Menge Given the LTP scheme from Datenbanksysteme I and an object of type Supply: How to incorporate the data bundled in the object Supply into the DB? Big Data Management and Analytics 4

History Impedance Mismatch Supply: Supplier: LNR: L1 Lname: Meier Status: 20 Sitz: Wetter Project: PNR: P2 Pname: Pleite Ort: Bonn Pieces: TNR: T6 Tname: Schraube Farbe: rot Gewicht: 03 Menge: 700 LNR Lname Status Sitz L1 Meier 20 Wetter TNR Tname Farbe Gewicht T6 Schraube rot 03 PNR Pname Ort P2 Pleite Bonn LNR PNR TNR Menge INSERT INTO L VALUES (Supply.getSupplier().getLNR(),...); INSERT INTO P VALUES (Supply.getProject().getPNR(),...);... Big Data Management and Analytics 5

History Impedance Mismatch Supply: Supplier: LNR: L1 Lname: Meier Status: 20 Sitz: Wetter Project: PNR: P2 Pname: Pleite Ort: Bonn Pieces: TNR: T6 Tname: Schraube Farbe: rot Gewicht: 03 Menge: 700 LNR Lname Status Sitz L1 Meier 20 Wetter TNR Tname Farbe Gewicht T6 Schraube rot 03 PNR Pname Ort P2 Pleite Bonn LNR PNR TNR Menge L1 P2 T6 700 INSERT INTO LTP VALUES (...); Object-oriented encapsulation vs. storing data distributed among several tables Lots of data type maintenance by the programmer Big Data Management and Analytics 6

History Mid 90 s: Trend of the Object-Relational Database Model Data stored as objects (including data and methods) Avoidance of object-relational mapping Programmer-friendly But still Relational Databases prevailed in the 90 s Mid 2000 s: Rise of Web 2.0 Lots of user generated data through web applications Storage systems had to become scaled up Big Data Management and Analytics 7

History Approaches to scale up storage systems Two opportunities to solve the rising storage system: Vertical scaling Enlarge a single machine Limited in space Expensive Horizontal scaling Use many commodity machines and form computer clusters or grids Cluster maintenance Big Data Management and Analytics 8

History Approaches to scale up storage systems Two opportunities to solve the rising storage system: Vertical scaling Enlarge a single machine Limited in space Expensive Horizontal scaling Use many commodity machines and form computer clusters or grids Cluster maintenance Big Data Management and Analytics 9

History Mid 2000 s: Birth of the NoSQL Movement Problem of computer clusters: Relational databases do not scale well horizontally Big Players like Google or Amazon developed their own storage systems: NoSQL ( Not-Only SQL ) databases were born Today: Age of NoSQL Several different NoSQL systems available (>225) Big Data Management and Analytics 10

Characterstics of NoSQL Databases There is no unique definition but some characteristics for NoSQL Databases: Horizontal scalability (cluster-friendliness) Non-relational Distributed Schema-less Open-source (at least most of the systems) Big Data Management and Analytics 11

About the concepts behind NoSQL Databases ACID The holy grail of RDBMSs: Atomicity: Transactions happen entirely or not at all. If a transaction fails (partly), the state of the database is unchanged. Consistency: Any transaction brings the database from one valid state to another and does not break one of the predefined rules (like constraints). Isolation: Concurrent execution of transactions results in a system state that would be obtained if transactions were executed serially. Durability: Once a transaction has been commited, it will remain so. Big Data Management and Analytics 12

About the concepts behind NoSQL Databases BASE An artificial concept for NoSQL databases: Basically Available: The system is generally available, but some data might not at any time (e.g. due to node failures) Soft State: The system s state changes over time. Stale data may expire if not refreshed. Eventual consistency: The system is consistent from time to time, but not always. Updates are propagated through the system if there is enough time. BASE is settled on the opposite site to ACID when considering a consistency-availability spectrum Big Data Management and Analytics 13

About the concepts behind NoSQL Databases Levels of Consistency: Eventual Consistency Monotonic Read Consistency M.R.C. + R.Y.O.W. Immediate Consistency Strong Consistency Transactions Read-Your-Own-Writes Big Data Management and Analytics 14

About the concepts behind NoSQL Databases Levels of Consistency: Eventual Consistency: Write operations are not spread across all servers/partitions immediately Monotononic Read Consistency: A client who read an object once will never read an older version of this object Read Your Own Writes: A client who wrote an object will never read an older version of this object Immediate Consistency: Updates are propagated immediately, but not atomic Big Data Management and Analytics 15

About the concepts behind NoSQL Databases Levels of Consistency: Strong consistency: Updates are propagated immediately + support of atomic operations on single data entities (usually on master nodes) Transactions: Full support of ACID transaction model Big Data Management and Analytics 16

About the concepts behind NoSQL Databases Data sharding Data replication Document Document The two types of consistency: Logical consistency: Data is consistent within itself (Data Integrity) Replication consistency: Data is consistent across multiple replicas (on multiple machines) Big Data Management and Analytics 17

About the concepts behind NoSQL Databases Brewer s CAP Theorem: CONSISTENCY AVAILABILITY PARTITION TOLERANCE Any networked shared-data system can have at most two of the three desired properties! Big Data Management and Analytics 18

About the concepts behind NoSQL Databases DB-Systems allowed by CAP Theorem: CP-Systems: Fully consistent and partitioned systems renounce availability. Only consistent nodes are available. AP-Systems: Fully available and partitioned systems renounce consistency. All nodes answer to queries all the time, even if answers are inconsistent. AC-Systems: Fully available and consistent systems renounce partitioning. Only possible if the system is not distributed. Big Data Management and Analytics 19

Big Picture CAP Theorem: C All clients always have the same view of the data C A A Each client can always read and write P The system works well despite physical network partitions Big Data Management and Analytics 20

Big Picture CAP Theorem: C ACID All clients always have the same view of the data C AC-Systems CP-Systems - RDBMSs (MySQL, Postgres, ) A BASE A Each client can always read and write AP-Systems P The system works well despite physical network partitions Big Data Management and Analytics 21

NoSQL Data Models The 4 Main NoSQL Data Models: Key/Value Stores Document Stores Wide Column Stores Graph Databases Big Data Management and Analytics 22

NoSQL Data Models Key/Value Stores: Most simple form of database systems Store key/value pairs and retrieve values by keys Values can be of arbitrary format 10213 10334 51 10023 Big Data Management and Analytics 23

NoSQL Data Models Key/Value Stores: Consistency models range from Eventual consistency to serializibility Some systems support ordering of keys, which enables efficient querying, like range queries Some systems support in-memory data maintenance, some use disks There are very heterogeneous systems Big Data Management and Analytics 24

NoSQL Data Models Key/Value Stores - Redis: In-memory data structure store with built-in replication, transactions and different levels of on-disk persistence Support of complex types like lists, sets, hashes, Support of many atomic operations >> SET val 1 >> GET val => 1 >> INCR val => 2 >> LPUSH my_list a (=> a ) >> LPUSH my_list b (=> b, a ) >> RPUSH my_list c (=> b, a, c ) >> LRANGE my_list 0 1 => b,a Big Data Management and Analytics 25

NoSQL Data Models Key/Value Stores The Redis cluster model: Data is automatically sharded across nodes Some degree of availability, achieved by master-slave architecture (but cluster stops in the event of larger failures) Easily extendable Big Data Management and Analytics 26

NoSQL Data Models Key/Value Stores The Redis cluster model: Nodes Hash slots 0 Nodes A B C Hash slots 0 5000 5001 10000 10001 14522 add node A B C D remove node 4000 4001 8000 8001 12000 12001 14522 Nodes A B Hash slots 0 7500 7501 14522 Big Data Management and Analytics 27

NoSQL Data Models Key/Value Stores The Redis cluster model: Master Nodes Hash slots Master Nodes Hash slots Slave Nodes Replicated Hash slots A 0 5000 A 0 5000 A 0 5000 B 5001 10000 B 5001 10000 B 5001 10000 C 10001 14522 C 10001 14522 C 10001 14522 Hash slots 5001 10000 cannot be used anymore Slave node B is promoted as the new master and hash slots 5001 10000 are still available Big Data Management and Analytics 28

Big Picture CAP Theorem: C ACID All clients always have the same view of the data C Key/Value Stores AC-Systems - RDBMSs (MySQL, Postgres, ) CP-Systems - Redis A BASE A Each client can always read and write AP-Systems - Dynamo P The system works well despite physical network partitions Big Data Management and Analytics 29

NoSQL Data Models Document Stores: Store documents in form of XML or JSON Semi-structured data records that do not have a homogeneous structure Columns can have more than one value (arrays) Documents include internal structure, or metadata Data structure enables efficient use of indexes Big Data Management and Analytics 30

NoSQL Data Models Document Stores: Given following text: Max Mustermann Musterstraße 12 D-12345 Musterstadt <contact> <first_name>max</first_name> <last_name>mustermann</last_name> <street>musterstraße 12</street> <city>musterstadt</city> <zip>12345</zip> <country>d</country> </contact> Find all <contact>s where <zip> is 12345 Big Data Management and Analytics 31

NoSQL Data Models Document Stores: Data stored as documents in binary representation (BSON) Similarly structured documents are bundled in collections Provides own ad-hoc query language Supports ACID transactions on document level Big Data Management and Analytics 32

NoSQL Data Models Document Stores: MongoDB Data Management: Automatic data sharding Automatic re-balancing Multiple sharding policies: Hash-based sharding: Documents are distributed according to an MD5 hash uniform distribution Range-based sharding: Documents with shard key values close to one another are likely to be co-located on the same shard works well for range queries Location-based sharding: Documents are partitioned wrt to a user-specified configuration that associates shard key ranges with specific shards and hardware Big Data Management and Analytics 33

NoSQL Data Models Document Stores: MongoDB Consistency & Availabilty: Default: Strong consistency (but configurable) Increased availability through replication Replica sets consist of one primary and multiple secondary members MongoDB applies writes on the primary and then records the operations on the primary s oplog Big Data Management and Analytics 34

Big Picture CAP Theorem: C ACID All clients always have the same view of the data C Key/Value Stores Document Stores AC-Systems - RDBMSs (MySQL, Postgres, ) CP-Systems - Redis - MongoDB A BASE A Each client can always read and write AP-Systems - Dynamo - CouchDB P The system works well despite physical network partitions Big Data Management and Analytics 35

NoSQL Data Models Wide Column Stores: Rows are identified by keys Rows can have different numbers of columns (up to millions) Order of rows depend on key values (locality is important!) Multiple rows can be summarized to families (or tablets) Multiple families can be summarized to a key space Big Data Management and Analytics 36

NoSQL Data Models Wide Column Stores: Key Space Column Family Row Key Row Key Column Name Value Column Name Value Column Name Value Column Name Value Column Row Key Column Name Value Column Name Value Column Name Value Column Name Value Column Family Big Data Management and Analytics 37

NoSQL Data Models Wide Column Stores: Key Space Edibles Column Family Fruit Apple color green weight 95 variety Granny Smith Cherry color red Lemon color yellow weight 50 origin Egypt flavor sour Column Family Vegetable Carrot 2015-08-11 65 2015-08-12 50 2015-09-21 87 Big Data Management and Analytics 38

NoSQL Data Models Wide Column Stores: Developed by Facebook, Apache project since 2009 N N N Cluster Architecture: P2P system (ordered as rings) N Each node plays the same role (decentralized) Each node accepts read/write operations N N N N N N N N N N N N User access through nodes via Cassandra Query Language (CQL) Big Data Management and Analytics 39

NoSQL Data Models Wide Column Stores: Consistency Tunable Data Consistency (choosable per operation) Read repair: if stale data is read, Cassandra issues a read repair find most up-to-date data and update stale data Generally: Eventually consistent Main focus on availability! Big Data Management and Analytics 40

Big Picture CAP Theorem: C ACID All clients always have the same view of the data C Key/Value Stores Document Stores Wide Column Stores AC-Systems - RDBMSs (MySQL, Postgres, ) CP-Systems - Redis - MongoDB - HBase A BASE A Each client can always read and write AP-Systems - Dynamo - CouchDB - Cassandra P The system works well despite physical network partitions Big Data Management and Analytics 41

NoSQL Data Models Graph Databases: Use graphs to store and represent relationships between entities Composed of nodes and edges Each node and each edge can contain properties (Property- Graphs) Alice Bob lent money to knows Carol Dave Big Data Management and Analytics 42

NoSQL Data Models Graph Databases: Alice is a friend of Bob and vice versa. They both love the movie Titanic. name = Alice name = Bob title = Titanic Big Data Management and Analytics 43

NoSQL Data Models Graph Databases: Alice is a friend of Bob and vice versa. They both love the movie Titanic. Person name = Alice Person name = Bob Movie title = Titanic Big Data Management and Analytics 44

NoSQL Data Models Graph Databases: Alice is a friend of Bob and vice versa. They both love the movie Titanic. Person name = Alice is a friend of is a friend of Person name = Bob loves Movie title = Titanic loves Big Data Management and Analytics 45

NoSQL Data Models Graph Databases: Master-Slave Replication (no partitioning!) Consistency: Eventual Consistency (tunable to Immediate Consistency) Support of ACID Transactions Cypher Query Language Schema-optional Big Data Management and Analytics 46

Big Picture CAP Theorem: C ACID All clients always have the same view of the data C Key/Value Stores Document Stores Wide Column Stores Graph Databases AC-Systems - RDBMSs (MySQL, Postgres, ) - Neo4J CP-Systems - Redis - MongoDB - HBase A BASE A Each client can always read and write AP-Systems - Dynamo - CouchDB - Cassandra P The system works well despite physical network partitions Big Data Management and Analytics 47