CSE 530A. Non-Relational Databases. Washington University Fall 2013

Similar documents
NOSQL EGCO321 DATABASE SYSTEMS KANAT POOLSAWASD DEPARTMENT OF COMPUTER ENGINEERING MAHIDOL UNIVERSITY

Transactions and ACID

DATABASE DESIGN II - 1DL400

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2015 Lecture 14 NoSQL

Cassandra, MongoDB, and HBase. Cassandra, MongoDB, and HBase. I have chosen these three due to their recent

Chapter 24 NOSQL Databases and Big Data Storage Systems

CISC 7610 Lecture 2b The beginnings of NoSQL

COSC 416 NoSQL Databases. NoSQL Databases Overview. Dr. Ramon Lawrence University of British Columbia Okanagan

Introduction to NoSQL

Introduction to NoSQL Databases

NoSQL Databases. Amir H. Payberah. Swedish Institute of Computer Science. April 10, 2014

Database Availability and Integrity in NoSQL. Fahri Firdausillah [M ]

Jargons, Concepts, Scope and Systems. Key Value Stores, Document Stores, Extensible Record Stores. Overview of different scalable relational systems

Introduction to Computer Science. William Hsu Department of Computer Science and Engineering National Taiwan Ocean University

Final Exam Review 2. Kathleen Durant CS 3200 Northeastern University Lecture 23

Perspectives on NoSQL

NoSQL Databases MongoDB vs Cassandra. Kenny Huynh, Andre Chik, Kevin Vu

Relational databases

Introduction Aggregate data model Distribution Models Consistency Map-Reduce Types of NoSQL Databases

Goal of the presentation is to give an introduction of NoSQL databases, why they are there.

MongoDB and Mysql: Which one is a better fit for me? Room 204-2:20PM-3:10PM

AN introduction to nosql databases

Course Content MongoDB

CompSci 516 Database Systems

Introduction to Big Data. NoSQL Databases. Instituto Politécnico de Tomar. Ricardo Campos

SQL, NoSQL, MongoDB. CSE-291 (Cloud Computing) Fall 2016 Gregory Kesden

CIB Session 12th NoSQL Databases Structures

Overview. * Some History. * What is NoSQL? * Why NoSQL? * RDBMS vs NoSQL. * NoSQL Taxonomy. *TowardsNewSQL

Large-Scale Key-Value Stores Eventual Consistency Marco Serafini

Distributed Data Store

Non-Relational Databases. Pelle Jakovits

NoSQL systems. Lecture 21 (optional) Instructor: Sudeepa Roy. CompSci 516 Data Intensive Computing Systems

DATA MINING II - 1DL460

Databases : Lecture 1 2: Beyond ACID/Relational databases Timothy G. Griffin Lent Term Apologies to Martin Fowler ( NoSQL Distilled )

A Study of NoSQL Database

Making MongoDB Accessible to All. Brody Messmer Product Owner DataDirect On-Premise Drivers Progress Software

CA485 Ray Walshe NoSQL

Database Solution in Cloud Computing

Advanced Database Technologies NoSQL: Not only SQL

Why NoSQL? Why Riak?

Presented by Sunnie S Chung CIS 612

SCALABLE CONSISTENCY AND TRANSACTION MODELS

Advanced Data Management Technologies

CSE 344 Final Review. August 16 th

Migrating Oracle Databases To Cassandra

Motivation Overview of NoSQL space Comparing technologies used Getting hands dirty tutorial section

CSE 530A ACID. Washington University Fall 2013

Integrating Oracle Databases with NoSQL Databases for Linux on IBM LinuxONE and z System Servers

Architekturen für die Cloud

Rule 14 Use Databases Appropriately

Introduction to NoSQL by William McKnight

Scaling Up HBase. Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech. CSE6242 / CX4242: Data & Visual Analytics

NoSQL systems: sharding, replication and consistency. Riccardo Torlone Università Roma Tre

The MapReduce Abstraction

Databases : Lectures 11 and 12: Beyond ACID/Relational databases Timothy G. Griffin Lent Term 2013

Query Languages for Document Stores

Study of NoSQL Database Along With Security Comparison

CIT 668: System Architecture. Distributed Databases

CSE 344 JULY 9 TH NOSQL

Distributed Databases: SQL vs NoSQL

Hands-on immersion on Big Data tools

Parallel Computing: MapReduce Jin, Hai

Sources. P. J. Sadalage, M Fowler, NoSQL Distilled, Addison Wesley

Application development with relational and non-relational databases

Database Evolution. DB NoSQL Linked Open Data. L. Vigliano

Data Consistency Now and Then

A Review Of Non Relational Databases, Their Types, Advantages And Disadvantages

NoSQL Databases Analysis

Lecture Notes to Big Data Management and Analytics Winter Term 2017/2018 NoSQL Databases

Final Exam Logistics. CS 133: Databases. Goals for Today. Some References Used. Final exam take-home. Same resources as midterm

The NoSQL movement. CouchDB as an example

CS October 2017

Databases and Big Data Today. CS634 Class 22

Oral Questions and Answers (DBMS LAB) Questions & Answers- DBMS

CS 655 Advanced Topics in Distributed Systems

Data Management for Big Data Part 1

The NoSQL Ecosystem. Adam Marcus MIT CSAIL

PROFESSIONAL. NoSQL. Shashank Tiwari WILEY. John Wiley & Sons, Inc.

4. Managing Big Data. Cloud Computing & Big Data MASTER ENGINYERIA INFORMÀTICA FIB/UPC. Fall Jordi Torres, UPC - BSC

1

Getting to know. by Michelle Darling August 2013

NoSQL : A Panorama for Scalable Databases in Web

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

MongoDB DI Dr. Angelika Kusel

Encrypting Data of MongoDB at Application Level

COSC 304 Introduction to Database Systems. NoSQL Databases. Dr. Ramon Lawrence University of British Columbia Okanagan

CISC 7610 Lecture 5 Distributed multimedia databases. Topics: Scaling up vs out Replication Partitioning CAP Theorem NoSQL NewSQL

TDDD43 HT2014: Advanced databases and data models Theme 4: NoSQL, Distributed File System, Map-Reduce

NOSQL Databases: The Need of Enterprises

COMP9321 Web Application Engineering

1

Modern Database Concepts

Migrating to Cassandra in the Cloud, the Netflix Way

Group13: Siddhant Deshmukh, Sudeep Rege, Sharmila Prakash, Dhanusha Varik

Shen PingCAP 2017

Lecture 25 Overview. Last Lecture Query optimisation/query execution strategies

Advances in Data Management - NoSQL, NewSQL and Big Data A.Poulovassilis

SourceForge. Mark Ramm

10. Replication. Motivation

Extreme Computing. NoSQL.

Transcription:

CSE 530A Non-Relational Databases Washington University Fall 2013

NoSQL "NoSQL" was originally the name of a specific RDBMS project that did not use a SQL interface Was co-opted years later to refer to a growing category of non-relational distributed databases Was changed even later to stand for "Not only SQL" instead of "No SQL" When more mature members of the community realized that their "new" approach was not the be-all and end-all of everything

NoSQL NoSQL has come to refer to any nonrelational database But there are many types of non-relational databases Key-Value Document oriented Column based Graph

NoSQL Common characteristic is the lack of a relationship between records No "joins" Complexity of searches may also be limited E.g., a key-value store only allows searching by key

Distributed Another common characteristic is the ability to scale out Distribute across multiple machines

Distributed It is (relatively) easy to scale out the application part of a web application Can run multiple application servers with load balancing, etc. App Server App Server App Server Database

Distributed It is difficult to scale out a traditional RDBMS Problem is ACID compliance Atomicity Consistency Isolation Durability Particular problem with consistency If a database is distributed across multiple server then all servers would need to be updated atomically to preserve consistency

BASE But not all situations require full ACID compliance Does it really matter if some users see a tweet a few seconds before some others? BASE Basically Available Soft state Eventual consistency

CAP Theorem The CAP theorem states that in a distributed system you can have at most two, but not all three, of Consistency Data is the same across all nodes Availability Cluster is always available Partition tolerance Continues to function even if there are communication failures between nodes or nodes go down

CAP Theorem Possibilities CA (Consistency and Availability) Data is consistent across all nodes as long as all nodes are online Communication failure between nodes could cause inconsistency CP (Consistency and Partition tolerance) Data is consistent across all nodes but entire cluster becomes unavailable if there is a failure AP (Availability and Partition tolerance) Cluster remains available even if there is a communication failure between nodes, but the data across nodes may not be consistent

CAP Theorem RDBMSs fall into the CA category "NoSQL" databases (or distributed databases in general) tend to fall into either the CP or AP categories Give up either availability or consistency for partition tolerance CP BigTable, Hbase, MongoDB, Redis AP Dynamo, Cassandra, SimpleDB, CouchDB

Queries If a database does not support SQL then how is data inserted or retrieved? Generally a program must be written Client libraries for programming languages Database-system specific scripting language

Example MongoDB Document-oriented database Key-value where the values are JSON objects Actually stored in BSON (Binary JSON) format

MongoDB From the mongo shell: user1 = { name : "Amy", password : "foo" } user2 = { name : "Bob", password : "bar" } db.users.insert(user1) db.users.insert(user2) db.users.find() Will return something like { "_id" : ObjectId("4c2209f9f3924d31102bd84a"), "name" : "Amy", password : "foo" } { "_id" : ObjectId("4c2209fef3924d31102bd84b"), "name" : "Bob", password : "bar" }

MongoDB When you query a collection, MongoDB returns a cursor Can iterate over the results using the cursor var results = db.users.find() while (results.hasnext()) printjson(results.next()) )

MapReduce MapReduce Parallel programming model for processing large distributed data sets Inspired by the map and reduce (a.k.a fold) functions from functional programming

MapReduce Functional programming Map Applies a given function to each element of a list (map square [1, 2, 3])» Will apply square to each element of [1, 2, 3] resulting in [1, 4, 9] Fold Combines the elements of a list using a function and a base value (fold add 0 [1, 2, 3])» Will iterate over the list adding each value to the accumulated value (initialized to the base value) resulting in 6

MapReduce Functional programming Fold right vs fold left (foldl f x [a, b, c]) (f (f (f x a) b) c) (foldr f [a, b, c] x) (f a (f b (f c x)))

MapReduce The distributed MapReduce algorithm applies this pattern across a cluster Map Master node Takes the input Divides into smaller sub-problems Distributed sub-problems to worker nodes Worker node Processes the sub-problem Returns result to master node

MapReduce Reduce Master node Collects the results from the worker nodes Combines the results to form the final answer Could actually have multiple "reduce" nodes Parallelization Work nodes can operate independently in parallel Pipelining "Map" nodes often return results as they are found Present an iterator interface to the "reduce" nodes

MapReduce Count the appearance of each word in a set of documents function map(string name, String document): // name: document name // document: document contents for each word w in document: emit (w, 1) function reduce(string word, Iterator partialcounts): // word: a word // partialcounts: a list of aggregated partial counts sum = 0 for each pc in partialcounts: sum += ParseInt(pc) emit (word, sum)

MongoDB example

Transactions "NoSQL" databases typically do not support transactions Only operation on a single row/document are atomic Need to use something like two phase commit for transactions

Two Phase Commit Phase 1: prepare Manager tells resources to prepare Resources report back when prepared Phase 2: commit or rollback Manager tells resources to commit or rollback Resources report back when committed or rolled back

Two Phase Commit MongoDB Use two phase commit to update multiple documents "atomically" 1. Create transaction document in "initial" state with references to all affected rows 2. Set transaction document to "pending" state 3. For each affected row Apply update and add transaction document reference to row only if transaction document not already in row 4. Set transaction document state to "committed" 5. Remove transaction document reference from all affected rows 6. Set transaction document state to "done"

Two Phase Commit 1. Create transaction document in "initial" state with references to all affected rows db.transactions.save({source: "A", destination: "B", value: 100, state: "initial"}) 2. Set transaction document to "pending" state t = db.transactions.findone({state: "initial"}) db.transactions.update({_id: t._id}, {$set: {state: "pending"}})

Two Phase Commit 3. Apply update and add transaction document reference to rows db.accounts.update({name: t.source, pendingtransactions: {$ne: t._id}}, {$inc: {balance: -t.value}, $push: {pendingtransactions: t._id}}) db.accounts.update({name: t.destination, pendingtransactions: {$ne: t._id}}, {$inc: {balance: t.value}, $push: {pendingtransactions: t._id}}) 4. Set transaction document state to "committed" db.transactions.update({_id: t._id}, {$set: {state: "committed"}})

Two Phase Commit 5. Remove transaction document reference from all affected rows db.accounts.update({name: t.source}, {$pull: {pendingtransactions: t._id}}) db.accounts.update({name: t.destination}, {$pull: {pendingtransactions: t._id}}) 6. Set transaction document state to "done" db.transactions.update({_id: t._id}, {$set: {state: "done"}})

Two Phase Commit All failures that occur after the first step but before the third step To recover, applications should get a list of transactions in the pending state and resume from the second step All failures that occur after the third but before the fifth To recover, application should get a list of transactions in the committed state and resume from the fourth step

Conclusion RDBMSs and "NoSQL" database have different strengths and weaknesses Pick the right tool for the right job