NoSQL Databases. Concept, Types & Use-cases.

Similar documents
CIB Session 12th NoSQL Databases Structures

Introduction to NoSQL Databases

Migrating Oracle Databases To Cassandra

Goal of the presentation is to give an introduction of NoSQL databases, why they are there.

NoSQL Databases MongoDB vs Cassandra. Kenny Huynh, Andre Chik, Kevin Vu

Chapter 24 NOSQL Databases and Big Data Storage Systems

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2015 Lecture 14 NoSQL

Jargons, Concepts, Scope and Systems. Key Value Stores, Document Stores, Extensible Record Stores. Overview of different scalable relational systems

A Survey Paper on NoSQL Databases: Key-Value Data Stores and Document Stores

CISC 7610 Lecture 2b The beginnings of NoSQL

NoSQL systems. Lecture 21 (optional) Instructor: Sudeepa Roy. CompSci 516 Data Intensive Computing Systems

NOSQL EGCO321 DATABASE SYSTEMS KANAT POOLSAWASD DEPARTMENT OF COMPUTER ENGINEERING MAHIDOL UNIVERSITY

Introduction to Big Data. NoSQL Databases. Instituto Politécnico de Tomar. Ricardo Campos

Introduction Aggregate data model Distribution Models Consistency Map-Reduce Types of NoSQL Databases

Cassandra, MongoDB, and HBase. Cassandra, MongoDB, and HBase. I have chosen these three due to their recent

CompSci 516 Database Systems

Introduction to NoSQL

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

CS 655 Advanced Topics in Distributed Systems

SQL, NoSQL, MongoDB. CSE-291 (Cloud Computing) Fall 2016 Gregory Kesden

Overview. * Some History. * What is NoSQL? * Why NoSQL? * RDBMS vs NoSQL. * NoSQL Taxonomy. *TowardsNewSQL

NoSQL Databases. Amir H. Payberah. Swedish Institute of Computer Science. April 10, 2014

Modern Database Concepts

Modern Database Concepts

Architecture of a Real-Time Operational DBMS

Database Availability and Integrity in NoSQL. Fahri Firdausillah [M ]

Introduction to Computer Science. William Hsu Department of Computer Science and Engineering National Taiwan Ocean University

Course Content MongoDB

Comparing SQL and NOSQL databases

Advances in Data Management - NoSQL, NewSQL and Big Data A.Poulovassilis

Shen PingCAP 2017

CS-580K/480K Advanced Topics in Cloud Computing. NoSQL Database

Rule 14 Use Databases Appropriately

L22: NoSQL. CS3200 Database design (sp18 s2) 4/5/2018 Several slides courtesy of Benny Kimelfeld

Data Informatics. Seon Ho Kim, Ph.D.

Database Evolution. DB NoSQL Linked Open Data. L. Vigliano

Advanced Data Management Technologies

Distributed Data Store

Stages of Data Processing

Relational databases

Topics. History. Architecture. MongoDB, Mongoose - RDBMS - SQL. - NoSQL

GridGain and Apache Ignite In-Memory Performance with Durability of Disk

Accelerate MySQL for Demanding OLAP and OLTP Use Cases with Apache Ignite. Peter Zaitsev, Denis Magda Santa Clara, California April 25th, 2017

Presented by Sunnie S Chung CIS 612

CISC 7610 Lecture 5 Distributed multimedia databases. Topics: Scaling up vs out Replication Partitioning CAP Theorem NoSQL NewSQL

Scaling for Humongous amounts of data with MongoDB

COSC 416 NoSQL Databases. NoSQL Databases Overview. Dr. Ramon Lawrence University of British Columbia Okanagan

CMU SCS CMU SCS Who: What: When: Where: Why: CMU SCS

Copyright 2013, Oracle and/or its affiliates. All rights reserved.

A Study of NoSQL Database

Oracle NoSQL Database Enterprise Edition, Version 18.1

Advanced Database Technologies NoSQL: Not only SQL

DIVING IN: INSIDE THE DATA CENTER

Perspectives on NoSQL

A NoSQL Introduction for Relational Database Developers. Andrew Karcher Las Vegas SQL Saturday September 12th, 2015

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

1

Motivation Overview of NoSQL space Comparing technologies used Getting hands dirty tutorial section

NoSQL systems: introduction and data models. Riccardo Torlone Università Roma Tre

Extreme Computing. NoSQL.

NoSQL Databases Analysis

/ Cloud Computing. Recitation 6 October 2 nd, 2018

Distributed Non-Relational Databases. Pelle Jakovits

Accelerate MySQL for Demanding OLAP and OLTP Use Case with Apache Ignite December 7, 2016

Safe Harbor Statement

DEMYSTIFYING BIG DATA WITH RIAK USE CASES. Martin Schneider Basho Technologies!

Big Data Hadoop Course Content

Oracle NoSQL Database Enterprise Edition, Version 18.1

Next-Generation Cloud Platform

Big Data Management and NoSQL Databases

Kim Greene - Introduction

TALK 1: CONVINCE YOUR BOSS: CHOOSE THE "RIGHT" DATABASE. Prof. Dr. Stefan Edlich Beuth University of Technology Berlin (App.Sc.)

CIT 668: System Architecture. Distributed Databases

Non-Relational Databases. Pelle Jakovits

Beyond Relational Databases: MongoDB, Redis & ClickHouse. Marcos Albe - Principal Support Percona

Databases : Lecture 1 2: Beyond ACID/Relational databases Timothy G. Griffin Lent Term Apologies to Martin Fowler ( NoSQL Distilled )

Final Exam Review 2. Kathleen Durant CS 3200 Northeastern University Lecture 23

PROFESSIONAL. NoSQL. Shashank Tiwari WILEY. John Wiley & Sons, Inc.

Sources. P. J. Sadalage, M Fowler, NoSQL Distilled, Addison Wesley

Study of NoSQL Database Along With Security Comparison

Performance Comparison of NOSQL Database Cassandra and SQL Server for Large Databases

SCALABLE CONSISTENCY AND TRANSACTION MODELS

Polyglot Persistence in Today s Data World

NOSQL DATABASE SYSTEMS: DECISION GUIDANCE AND TRENDS. Big Data Technologies: NoSQL DBMS (Decision Guidance) - SoSe

How do we build TiDB. a Distributed, Consistent, Scalable, SQL Database

Cassandra Design Patterns

Advances in Data Management - NoSQL, NewSQL and Big Data A.Poulovassilis

CA485 Ray Walshe NoSQL

Cassandra- A Distributed Database

Architectural challenges for building a low latency, scalable multi-tenant data warehouse

Couchbase Architecture Couchbase Inc. 1

Lecture Notes to Big Data Management and Analytics Winter Term 2017/2018 NoSQL Databases

Getting to know. by Michelle Darling August 2013

NOSQL Databases: The Need of Enterprises

Spread the Database Love with Heterogeneous Replication. MC Brown, VP, Products

Challenges for Data Driven Systems

CSE 530A. Non-Relational Databases. Washington University Fall 2013

<Insert Picture Here> Oracle NoSQL Database A Distributed Key-Value Store

Final Exam Logistics. CS 133: Databases. Goals for Today. Some References Used. Final exam take-home. Same resources as midterm

Data Management for Big Data Part 1

Transcription:

NoSQL Databases Concept, Types & Use-cases 1of93 Hello World Alon Spiegel alon@brillix.co.il Mamram grad. Programmer since 1995 DBA since 1997 Co founder and CEO since 2007 Brillix VP products since 2014 Brillix/DBAces 2of93 1

Let s talk about What is NoSQL NoSQL Architecture - Distributed database architecture principals NoSQL Flavors exploring the leaders On-premise vs managed service Community vs Enterprise Adoption plan and choosing the right one 3of93 THE OBJECT/RELATIONAL MAPPING PROBLEM 4of93 2

Object-relational Mapping (ORM) Mapping is good for C.R.U.D Complex querying becomes very inefficient RS to Obj mapping is tedious 5of93 Thinking out of the box 1998 Carlo Strozzi RDBMS without SQL API. 2009 NoSQL movement emerges Storzzi refers to it as NoREL Eric Evans (Rackspace employee ) reintroduces the term in an event that discusses open-source distributed dbs 6of93 3

FORMATION 7of93 The Jurassic era 60 s Edgar Frank Codd s (IBM): relational model theory 1974 First RDBMS ever System R (IBM) 1979 Oracle v.2 for PDP-11 1983 DB2 for MVS mainframe (IBM) 1984 Sybase is founded but first released in 1986 1988 MS branches MSSQL 6 & 6.5 8of93 4

RDBMS Built to last Best relationship description Full normalization Powerful queries (joins + indexes) Rich language (SQL, PLSQL,TSQL) 9of93 RDBMS Built to last Easy to use and integrate Rich toolset Many vendors Many drivers Fully transactional 10 of 93 5

Expected limited Concurrency 80 s - Terminal based computers Mainframes AS 400 PDP-11, VAX 90 s Client server based application Dos Windows 11 of 93 2000 s - A need emerges Web based application and web 2.0 Social media (facebook, twitter) Huge data-sets Unstructured 12 of 93 6

So, what is NoSQL Not Only SQL Modern web-scale database Key principals Often more characteristics Non relational Distributed Open-source Horizontally scalable schema-free easy replication support simple API (CRUD+) eventually consistent (not ACID) http://nosql-database.org/ huge amount of data 13 of 93 The NoSQL movement s roots Google Big Table Amazon DynamoDB CAP Theorem 14 of 93 7

ARCHITECTURE 15 of 93 General layout Records are distributed among chunks in shards Shards are usually installed one per host Shards usually host a mix of primary chunks and replica chunks Some implementations separate primary from replica chunks by the shard 16 of 93 8

General layout Application Sahrds Primary/Replica Chunks Primary/Replica Records KV/Documents/Records/Links&Endpoints 17 of 93 General layout scaling (out) The process of adding or removing a shard from the cluster The process is complete after a rebalance redistribute (long) operation 18 of 93 9

ARCHITECTURE BY EXAMPLE 19 of 93 Config Config Each is a host Config Arbiter Arbiter Arbiter Arbiter 20 of 93 10

MongoDB replication 21 of 93 MongoDB failover 22 of 93 11

MongoDB DR 23 of 93 ARCHITECTURE BY EXAMPLE 24 of 93 12

25 of 93 Couchbase sharding 26 of 93 13

Couchbase replication 27 of 93 Couchbase failover Shard Shard As part of Rebalance Failover 28 of 93 14

Couchbase DR 29 of 93 Consistency Availability Partition Tolerance CAP THEOREM 30 of 93 15

CAP Theory Proof Theorem 2000 Eric Brewer a Berkeley prof. and Google co. founder conjecture 2002 Gilbert and Lynch proof! asynchronous and partially synchronous network models 31 of 93 Brewer's (CAP) Theorem Consistency - no logical data corruption Availability - service always available Partition tolerance - protect against data discrepancies when network islands form 32 of 93 16

CAP Theorem in practice In real life, the network will eventually fail: Inconsistent (AP): Return what every data node holds (Outdated or not yet propagated) Unavailable (CP): Don t return failing node s data MongoDB is AP allows to play with two settings: Write concern and Read reference Couchbase is CP; automatic failover (AP) 33 of 93 34 of 93 17

So why NoSQL Workload diversity flexible design and data modeling Scalability resource elasticity Performance Avoid monolithically limitations Durability and continuous availability no single point of failure no matter what 35 of 93 So why NoSQL Manageability Minimum operational complexity and specialized personnel Cost Cluster of commodity servers with cheap/free licensing Strong Community An important factor to choosing a NoSQL platform 36 of 93 18

RDBMS vs NoSQL comparison Types Development history Schemas One type (SQL Database) with minor variations Developed in 1970s to deal with first wave of data storage applications Structure and data types are fixed in advance Scaling Vertically Horizontally Data manipulation Examples SQL mostly ANSI and very similar among all the vendors Oracle, MSSQL, MySQL, PostgreSQL Many different types including key-value stores, document database, wide column stores and graph databases Developed in the 2000s to deal with limitations of SQL databases, particularly concerning scale, replication and unstructured data storage Typically dynamic. Records can add new information on the fly unlike SQL tables Through object-oriented API MongoDB, Cassandra, Aerospike, Couchbase, DynamoDB 37 of 93 EXPLORING NOSQL FLAVORS 38 of 93 19

4 flavors of NoSQL databases Key-value store (KVS) Document database Wide Column Store - Column Families Graph database 39 of 93 Key-value store Designed for scaling to huge amounts of data handling massive loads A hash table of unique keys that point to a data store. A very easy model to implement. Only keys are indexed No complex data model 40 of 93 20

Key-value store - Aerospike Hybrid memory architecture Key always in memory Value always on disk UDF User defined functions with Optimized for SSD storage Strong consistency Fast as a cache, reliable as a database 41 of 93 Aerospike Popular uses AdTech Travel caching Online gaming 42 of 93 21

Key-value cache - Redis Key-value cache and store Referred to as data structure server: Strings, hashes, lists, sets, sorted sets, bitmaps and hyperloglogs Interval checkpoint to disk or log appends Very good and fast as long as dataset is contained in machine s RAM 43 of 93 Redis - Popular uses Main feeds Queues Caching Follow 44 of 93 22

Document databases Support a complex, multi-value, nested data model. Non PK indexes updated online or batch Document level operations Recognize special data-formats such as spatial Update partial / entire document 45 of 93 Document databases Organized similarly to key/value store Value is a semi-structured document JSON BSON XML YMAL 46 of 93 23

Document based - mongodb Collection of BSON objects Secondary indexes (online) Fully consistent reads Provides embedded docs and linking (joins) Uses a locking mechanism Master-slave replication 10 checkpoints a second 47 of 93 MongoDB - Popular uses Weather alerts Travel arrangement Personalization Custom CMS for readers 48 of 93 24

Document based Couchbase A hybrid of CouchDB and memcached Buckets of JSON documents Views of Map/Reduce (batch) for indexing and aggregations. Multi-Master replication with XDCR Integration with Hadoop, Elasticsearch, etc 49 of 93 Couchbase - Popular users Product listing cache Session/Token store cache Follow Answers on Demand big data analytics on consumer purchase behavior and product sales trends 50 of 93 25

Document database - DynamoDB Managed service by Amazon Store 3 geographical distributed replicas Read consistency customizable Eventually or Strong PK single hashed attribute (e.g. User ID) or composite hash-range (e.g. User ID, timestamp) 51 of 93 DynamoDB - Popular uses Recommendation platform Kids, you tried your best and you have failed miserably. The lesson is, never try. (Homer J. Simpson) 52 of 93 26

Wide column store Extensible record store OR Column family store Can hold a lots of columns (millions) Each key points to a row, each column is a tuple of column name and a value or a triplet adding a timestamp 53 of 93 Wide column store A column family (table) is a list of values that reside together on disk (like a table in RDBMS). Granularity update individual column 54 of 93 27

Wide column store A Super column is a list of columns It is actually one nesting level Super columns do not have timestamps 55 of 93 Wide column store - Cassandra Born at Facebook, now under Apache A hybrid of Amazon s Dynamo and Google s Bigtable SQL like query language: CQL Write to one / majority nodes (Tunable consistency) Use gossip protocol for intra-cluster communication Row-level write isolation Use secondary indexes for low cardinality searches 56 of 93 28

Wide column store Popular uses Logging, tracking and notifications Fraud detection Social signals Ad serving Audit (fight spam and abusive) Churn tracking, follow requests 57 of 93 Wide column store HBase Part of Hadoop under Apache Strictly consistent reads and writes Intra-cluster communication is handled by zookeeper ACID level semantics on per-row basis Secondary indexes alternates Filter query Periodic-update secondary indexes using MapReduce Dual write index Summary tables using MapReduce Coprocessor secondary index using internal triggers 58 of 93 29

Graph database Data is stored in a flexible graph model instead of rigid structure of tables (rows and columns) In the graph each node can be connected to any number of nodes such as network topologies and public transportation system 59 of 93 Graph database - Property graph Entities (nodes) Attributes (KV) Labels, roles, constraints, indexes, metadata Relationships Start node & End node Attributes (KV), Named, typed, Directional (BiDi navigation) 60 of 93 30

Graph database neo4j First graph database Fully ACID compliant Visualization http://neo4j.com/developer/guide-data-visualization/ 61 of 93 Graph database - Popular uses Online recommendations Social networks Graph based search of content Spatial (Geo) 62 of 93 31

PROGRAMMING PARADIGM SHIFT 63 of 93 Write preference Option in the source code (per op) Can write to memory of primary Add write to secondary Can write to storage of primary Add write to secondary If forcing write to secondary and secondary is missing write may fail 64 of 93 32

Read preference Option in the source code (per op) Read from primary Read from secondary Force Preferred Read from majority (to minimize probability if inconsistent reads) 65 of 93 Eventually consistent Read from primary replica is consistent But what happens if that replica fails? Persist writes to primary and secondary replica Only when necessary 66 of 93 33

Scalable writes Avoid multiple writes to the same shard Sequence or timestamp as range shard key Updates to the same items hotspot 67 of 93 Minimize use of cluster wide ops Secondary indexes are usually co-located with data Querying them performs a cluster wide search Instead model your data to use cluster key whenever possible 68 of 93 34

PK and secondary indexes Primary keys are used for sharding Most databases allow secondary indexes Secondary indexes are collocated with data 69 of 93 DEPLOYMENT 70 of 93 35

Comparison: On-premise vs Cloud On-premise Managed service (cloud) Cloud VMs Ease/speed of deployment Ease/speed of administration Ease/speed of scaling Upfront cost Dynamically change cost Fixed resources cost Customization Log analysis Ops/DevOps resources Amount of vendors 71 of 93 COMMUNITY VS ENTERPRISE 72 of 93 36

Comparison: CE vs EE Community edition Enterprise edition License cost Free Per: Node / RAM / Core / Data volume Features Basic Enhanced Security Basic Enhanced Support None SLA Scalability Low (limited cores / nodes / RAM support) High High 73 of 93 DATA MODELING 74 of 93 37

Self contained record Use only one record that contains the complete entity 75 of 93 Sibling records (relational) Indicates the relation using a common key in different sets 76 of 93 38

Referencing record (relational) Parent record holds limited information and pointers to other records 77 of 93 Referencing record (relational) Joining records 78 of 93 39

Atomicity and isolation Most NoSQL databases commit to atomic and isolated operations on the record level Some allow executing several operations on a single records atomically There are no multi-record transactions 79 of 93 Mixed entity & operations A record that holds both data and logical operations to perform on it. Doing so, allows mutating data and todo list in one atomic operation 80 of 93 40

Activity log Use sibling records with an incrementor; <prefix>::<key>::<incr> 81 of 93 Data versioning multi-record mutation atomicity Use referenced records with a version number in the key Committing all mutations is done atomically by the parent record 82 of 93 41

Data versioning 83 of 93 Data versioning 84 of 93 42

Multi document atomicity Multi document atomicity without a natural parent X <UNDO> Account: Allen Balance: 500 Trans: [115] X <UNDO> Account: Beth Balance: 100 Trans: [115] Account: Allen Balance: 500 Trans: [115] Account: Beth Balance: 100 Trans: [115] Account: Allen Balance: 500-30 470 Trans: [115] [] Account: Beth Balance: 100+30 130 Trans: [115] [] TranID: 115 Source: Allen Target: Beth Value: 30 State: [ init pending applied done canceling canceled ] LastModified: timestamp 85 of 93 https://docs.mongodb.com/manual/tutorial/perform-two-phase-commits/ Choosing the right one out of the bunch ADOPTION PLAN 86 of 93 43

Adoption plan Organizations usually opt for one of the followings: Convert a lesser important module to NoSQL Convert a main module to NoSQL Convert a suitable module to NoSQL Convert the entire system to NoSQL Organizations usually begin with CE (free) and later upgrade to EE (Availability, scalability, security, SLA and professional services) 87 of 93 Adoption plan RDBMS is not dead Migrate only modules that would really benefit Choose the best NoSQL engine to your application module Note that programming against a NoSQL dictates different paradigms than relational Using several NoSQL engines for different functionalities is advised Beware of falling into a DevOps pit 88 of 93 44

When to go NoSQL Most operations are a simple CRUD by Speed and volume is just too much for an RDBMS an average pace of thousands of CRUD operations per seconds Data model and most queries can be achieved without join (SQL JOIN) BASE is applicable (Basic Availability Soft-state Eventual consistency), though some databases support strong consistency 89 of 93 When to NOT go NoSQL Data model is very relational Constraints and referential integrity are mandatory Querying is complexed Even in those cases, usually some modules can be converted to NoSQL and benefit from its advantages 90 of 93 45

And there is much more More design patterns fitting for NoSQL More data models Search engines as Elasticsearch and SolR Data grids such as XAP Streams and Queues such as Kafka and RabbitMQ 91 of 93 92 of 93 46

93 of 93 47

Diving into RavenDB Michael Yarichuk micahel@ravendb.net Prime Directive Be a better database Not less of a database Transactional (ACID) document database For OLTP workloads Distributed, reliable, etc Shouldn t require black magic to run 48

CP Cluster wide consistency Consensus protocol (RAFT) For cluster wide operations: Create / Drop DB Add node to cluster Dual Distributed Nature AP Database will always accept writes Gossip protocol (merge) Conflict detection & handling As long as a single node is up, can serve read / writes Distribution of data Data is replicated Self monitoring and active repair Task assignment in cluster Automatic failover and topology updates 49

Documents (JSON) Transactions Modeling Features Single document / multiple documents / multiple collections Consistent transactions across machine boundary Attachments Employee photo Counters Document scans Distributed Also part of TX Associated with, but not part of, the document ETL SQL ETL for relational RavenDB ETL for data flow Move & shape data as it moves Integration Subscriptions Long term persistent subscription Batch operations Reliable 50

Indexes Query with RQL Query optimizer and dynamic indexes Spatial, full text search, facets, etc Indexes are BASE Batches, optimized Can chose to wait on write Map-Reduce Aggregation that is easy, up to date and available Computed in background Query on the results Cheap to update 51

Static Indexes Linq / JS (upcoming) Computation during indexing Queries do not allow computation (therefor, fast) Reducing costs Load Include Lazy Because who wants to go the server twice Reduce network roundtrips 52

Licensing Community free 3 Cores 6 GB Up to 3 nodes in cluster Developer free All features & functions Not for production Commercial pre core / year Professional Enterprise Encryption Dynamic cluster behavior Questions? 53

54