NoSQL Databases. Vincent Leroy

Similar documents
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters. Hung- chih Yang, Ali Dasdan Yahoo! Ruey- Lung Hsiao, D.

CISC 7610 Lecture 2b The beginnings of NoSQL

SCALABLE CONSISTENCY AND TRANSACTION MODELS

Webinar Series TMIP VISION

Goal of the presentation is to give an introduction of NoSQL databases, why they are there.

Introduction to NoSQL Databases

Introduction to NoSQL

Chapter 24 NOSQL Databases and Big Data Storage Systems

10. Replication. Motivation

NoSQL systems: sharding, replication and consistency. Riccardo Torlone Università Roma Tre

Database Availability and Integrity in NoSQL. Fahri Firdausillah [M ]

COSC 416 NoSQL Databases. NoSQL Databases Overview. Dr. Ramon Lawrence University of British Columbia Okanagan

CIB Session 12th NoSQL Databases Structures

CAP Theorem. March 26, Thanks to Arvind K., Dong W., and Mihir N. for slides.

Haridimos Kondylakis Computer Science Department, University of Crete

Consistency in Distributed Storage Systems. Mihir Nanavati March 4 th, 2016

Big Data Analytics. Rasoul Karimi

Transactions and ACID

Large-Scale Key-Value Stores Eventual Consistency Marco Serafini

Data Informatics. Seon Ho Kim, Ph.D.

Apache Cassandra - A Decentralized Structured Storage System

CS 655 Advanced Topics in Distributed Systems

NOSQL EGCO321 DATABASE SYSTEMS KANAT POOLSAWASD DEPARTMENT OF COMPUTER ENGINEERING MAHIDOL UNIVERSITY

Sources. P. J. Sadalage, M Fowler, NoSQL Distilled, Addison Wesley

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2015 Lecture 14 NoSQL

Extreme Computing. NoSQL.

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

Lecture Notes to Big Data Management and Analytics Winter Term 2017/2018 NoSQL Databases

NoSQL Concepts, Techniques & Systems Part 1. Valentina Ivanova IDA, Linköping University

Eventual Consistency 1

AN introduction to nosql databases

CS-580K/480K Advanced Topics in Cloud Computing. NoSQL Database

Cassandra, MongoDB, and HBase. Cassandra, MongoDB, and HBase. I have chosen these three due to their recent

Introduction to Big Data. NoSQL Databases. Instituto Politécnico de Tomar. Ricardo Campos

Migrating Oracle Databases To Cassandra

Safe Harbor Statement

HBase vs Neo4j. Technical overview. Name: Vladan Jovičić CR09 Advanced Scalable Data (Fall, 2017) Ecolé Normale Superiuere de Lyon

Floodless in SEATTLE A Scalable Ethernet Architecture for Large Enterprises By Changhoon Kim, Ma/hew Caesar, and Jennifer Rexford

Modern Database Concepts

Eventual Consistency Today: Limitations, Extensions and Beyond

Course Introduction & Foundational Concepts

Performance and Forgiveness. June 23, 2008 Margo Seltzer Harvard University School of Engineering and Applied Sciences

Overview. * Some History. * What is NoSQL? * Why NoSQL? * RDBMS vs NoSQL. * NoSQL Taxonomy. *TowardsNewSQL

Understanding NoSQL Database Implementations

Dynamo: Amazon s Highly Available Key-value Store. ID2210-VT13 Slides by Tallat M. Shafaat

Final Exam Review 2. Kathleen Durant CS 3200 Northeastern University Lecture 23

Structured Streaming. Big Data Analysis with Scala and Spark Heather Miller

NoSQL Databases. CPS352: Database Systems. Simon Miner Gordon College Last Revised: 4/22/15

4/9/2018 Week 13-A Sangmi Lee Pallickara. CS435 Introduction to Big Data Spring 2018 Colorado State University. FAQs. Architecture of GFS

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES

SESSION TITLE GOES HERE Second Cosmos for Line the Goes Business Here Intelligence Professional

CSE 530A. Non-Relational Databases. Washington University Fall 2013

Introduction Aggregate data model Distribution Models Consistency Map-Reduce Types of NoSQL Databases

A BigData Tour HDFS, Ceph and MapReduce

Distributed Data Store

Data Management for Big Data Part 1

Don t Give Up on Serializability Just Yet. Neha Narula

The CAP theorem. The bad, the good and the ugly. Michael Pfeiffer Advanced Networking Technologies FG Telematik/Rechnernetze TU Ilmenau

CompSci 516 Database Systems

CS6450: Distributed Systems Lecture 11. Ryan Stutsman

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

Introduction to Computer Science. William Hsu Department of Computer Science and Engineering National Taiwan Ocean University

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Big Data Development CASSANDRA NoSQL Training - Workshop. November 20 to (5 days) 9 am to 5 pm HOTEL DUBAI GRAND DUBAI

Final Exam Logistics. CS 133: Databases. Goals for Today. Some References Used. Final exam take-home. Same resources as midterm

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

CSE 344 JULY 9 TH NOSQL

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

Databases and Big Data Today. CS634 Class 22

SCALABLE CONSISTENCY AND TRANSACTION MODELS THANKS TO M. GROSSNIKLAUS

/ Cloud Computing. Recitation 6 October 2 nd, 2018

Big Data Technology Incremental Processing using Distributed Transactions

Big Data Analytics using Apache Hadoop and Spark with Scala

Processing Unstructured Data. Dinesh Priyankara Founder/Principal Architect dinesql Pvt Ltd.

GridGain and Apache Ignite In-Memory Performance with Durability of Disk

Key Value Store. Yiding Wang, Zhaoxiong Yang

Recap. CSE 486/586 Distributed Systems Case Study: Amazon Dynamo. Amazon Dynamo. Amazon Dynamo. Necessary Pieces? Overview of Key Design Techniques

Relational databases

BigTable: A Distributed Storage System for Structured Data

Database Design & Deployment

NoSQL Databases. Amir H. Payberah. Swedish Institute of Computer Science. April 10, 2014

CS 445 Introduction to Database Systems

Computing Parable. The Archery Teacher. Courtesy: S. Keshav, U. Waterloo. Computer Science. Lecture 16, page 1

Distributed Databases: SQL vs NoSQL

CS6450: Distributed Systems Lecture 15. Ryan Stutsman

CMU SCS CMU SCS Who: What: When: Where: Why: CMU SCS

CMPT 354: Database System I. Lecture 11. Transaction Management

COSC 304 Introduction to Database Systems. NoSQL Databases. Dr. Ramon Lawrence University of British Columbia Okanagan

Strong Consistency & CAP Theorem

Advanced Data Management Technologies

Distributed Systems. Lec 12: Consistency Models Sequential, Causal, and Eventual Consistency. Slide acks: Jinyang Li

Presented by Sunnie S Chung CIS 612

Voldemort. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

Achieving the Potential of a Fully Distributed Storage System

BIG DATA TECHNOLOGIES: WHAT EVERY MANAGER NEEDS TO KNOW ANALYTICS AND FINANCIAL INNOVATION CONFERENCE JUNE 26-29,

EECS 498 Introduction to Distributed Systems

Recap. CSE 486/586 Distributed Systems Case Study: Amazon Dynamo. Amazon Dynamo. Amazon Dynamo. Necessary Pieces? Overview of Key Design Techniques

Horizontal or vertical scalability? Horizontal scaling is challenging. Today. Scaling Out Key-Value Storage

CIS 601 Graduate Seminar. Dr. Sunnie S. Chung Dhruv Patel ( ) Kalpesh Sharma ( )

MongoDB and Mysql: Which one is a better fit for me? Room 204-2:20PM-3:10PM

Transcription:

NoSQL Databases Vincent Leroy 1

Database Large-scale data processing First 2 classes: Hadoop, Spark Perform some computacon/transformacon over a full dataset Process all data SelecCve query Access a specific part of the dataset Manipulate only data needed (1 record among millions) à Database system 2

Processing / Database Link Batch Job (Hadoop, Spark) Stream Job (Spark, Storm) e.g. sencment analysis Serve queries Load data Write results Write results Database Insert records ApplicaCon 1 ApplicaCon 2 ApplicaCon 3 e.g. TwiSer trends page 3

Different types of databases So far we used HDFS A file system can be seen as a very basic database Directories / files to organize data Very simple queries (file system path) Very good scalability, fault tolerance Other end of the spectrum: RelaConal Databases SQL query language, very expressive Limited scalability (generally 1 server) 4

Size / Complexity Graph DB Complexity RelaConal DB Document DB Column DB Key/Value DB Filesystem Size 5

The NoSQL Jungle 6

Goal of these slides Present an overview of the NoSQL landscape Trade-off in choosing a solucon Theorems and principles Not a manual to learn specific DBs Too many of them Not that complicated (especially K/V stores) Focus on Neo4j graph DB in lab work 7

RelaConal Databases: SQL SQL language born 1974 SCll used by most data processing systems (including Spark) à Learn it! Don t be a viccm of the NoSQL hype! 8

RelaConal Databases model Data organized as tables Row = record Column = asribute RelaCons between tables Integrity constraints Select Ctle from courses natural join takes_courses group by ClassID having count(*) > 10 9

ACID properces Atomicity TransacCon are all or nothing (e.g. when adding a bi-direcconal friendship relacon, it s added both ways or not at all) Consistency Only valid data wrisen (e.g. cannot say a student takes a course that is not in the courses table) IsolaCon When mulcple transaccons execute simultaneously, they appear as if they were executed sequencally (aka serializability) Durability When data has been wrisen and validated, it is permanent (i.e. no data loss, even in the case of some failures) à Easy life for the developer 10

Why NoSQL then? What does NoSQL mean? No SQL New SQL Not only SQL SQL strong properces limit its ability to scale to very large datasets Relax some properces to deal with larger datasets (ACID) But at what cost? SQL is very structured (each record has the same columns ), Web data ooen is not Semi-structured data Unstructured data Graph data 11

CAP Consistency When mulcple operacons execute simultaneously, it appears as if they were executed one aoer the other (A of ACID) Availability Every request received by a non failed node must be answered ParCCon tolerance System must respond correctly even if network fails 12

CAP theorem Impossible to have 3 simultaneously Choose CA, CP, or AP In a centralized system, no need for P RelaConal databases have CA In a distributed system, you cannot ignore P Distributed databases choose CP or AP 13

CAP intuicon 2 solucons: Refuse to answer in case of parccon Answer but risk inconsistencies A: 3 A: 2 A: 3 B: 5 B: 6 Client 1 ParCCon Client 2 14

NoSQL and CAP 15

Weaker consistency models Eventual consistency When there is no parccon, DB is consistent In case of parccon, DB can return stale data Once parccon is gone, there is a Cme limit on how long it takes for consistency to return Different levels of consistency (consistency / cost tradeoff) Causal consistency Read-your-writes consistency Session consistency Monotonic read consistency Monotonic write consistency à Again, many choices, so many different systems 16

Vector clocks & conflict deteccon 17

Vector clocks & conflict deteccon 18

Vector clocks & conflict deteccon 19

Vector clocks & conflict deteccon 20

Vector clocks & conflict deteccon 21

Vector clocks & conflict deteccon 22

Vector clocks & conflict deteccon 23

Vector clocks & conflict deteccon 24

Vector clocks & conflict deteccon 25

Key/Value store 2 basic operacons, similar to the HashMap data structure Put(K,V) Get(K) Ooen used for caching informacon in memory Facebook uses them a lot 26

Column/Tabular DB Data organized as rows with a primary key Flexible format, each row can have different fields in a column family Relies on HDFS for fault tolerance 27

Document DB Data stored as documents (ooen JSON) Richer than K/V stores Insert Delete Update Find AggregaCon funccons (Map, Reduce ) Indexes 28

Document DB 29

Document DB 30

Graph DB Represent data as graphs Nodes / relaconships with properces as K/V pairs 31

Graph DB: Neo4j Rich data format Query language as pasern matching Limited scalability ReplicaCon to scale reads, writes need to be done to every replica 32