SQL, NoSQL, MongoDB. CSE-291 (Cloud Computing) Fall 2016 Gregory Kesden

Similar documents
How to Scale MongoDB. Apr

Jargons, Concepts, Scope and Systems. Key Value Stores, Document Stores, Extensible Record Stores. Overview of different scalable relational systems

Course Content MongoDB

MONGODB INTERVIEW QUESTIONS

Group13: Siddhant Deshmukh, Sudeep Rege, Sharmila Prakash, Dhanusha Varik

MongoDB Schema Design

The course modules of MongoDB developer and administrator online certification training:

Scaling for Humongous amounts of data with MongoDB

NoSQL systems. Lecture 21 (optional) Instructor: Sudeepa Roy. CompSci 516 Data Intensive Computing Systems

MongoDB. copyright 2011 Trainologic LTD

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2015 Lecture 14 NoSQL

Scaling with mongodb

MongoDB An Overview. 21-Oct Socrates

MongoDB Architecture

Document Object Storage with MongoDB

CMSC 461 Final Exam Study Guide

NoSQL Databases MongoDB vs Cassandra. Kenny Huynh, Andre Chik, Kevin Vu

CA485 Ray Walshe Google File System

Scaling MongoDB. Percona Webinar - Wed October 18th 11:00 AM PDT Adamo Tonete MongoDB Senior Service Technical Service Engineer.

MongoDB 2.2 and Big Data

Engineering Goals. Scalability Availability. Transactional behavior Security EAI... CS530 S05

Administration Naive DBMS CMPT 454 Topics. John Edgar 2

Final Review. May 9, 2017

BigTable. CSE-291 (Cloud Computing) Fall 2016

Final Review. May 9, 2018 May 11, 2018

Distributed File Systems II

CSE 530A. Non-Relational Databases. Washington University Fall 2013

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

Oral Questions and Answers (DBMS LAB) Questions & Answers- DBMS

Final Exam Logistics. CS 133: Databases. Goals for Today. Some References Used. Final exam take-home. Same resources as midterm

CockroachDB on DC/OS. Ben Darnell, CTO, Cockroach Labs

Percona Live Updated Sharding Guidelines in MongoDB 3.x with Storage Engine Considerations. Kimberly Wilkins

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 14 Distributed Transactions

Kim Greene - Introduction

MongoDB and Mysql: Which one is a better fit for me? Room 204-2:20PM-3:10PM

MongoDB CRUD Operations

CSE 444: Database Internals. Lectures 26 NoSQL: Extensible Record Stores

HDFS Architecture. Gregory Kesden, CSE-291 (Storage Systems) Fall 2017

CS 655 Advanced Topics in Distributed Systems

Final Exam Review 2. Kathleen Durant CS 3200 Northeastern University Lecture 23

NOSQL EGCO321 DATABASE SYSTEMS KANAT POOLSAWASD DEPARTMENT OF COMPUTER ENGINEERING MAHIDOL UNIVERSITY

GFS: The Google File System. Dr. Yingwu Zhu

References. What is Bigtable? Bigtable Data Model. Outline. Key Features. CSE 444: Database Internals

Introduction to NoSQL Databases

! Design constraints. " Component failures are the norm. " Files are huge by traditional standards. ! POSIX-like

The Google File System (GFS)

Chapter 24 NOSQL Databases and Big Data Storage Systems

CMU SCS CMU SCS Who: What: When: Where: Why: CMU SCS

4/9/2018 Week 13-A Sangmi Lee Pallickara. CS435 Introduction to Big Data Spring 2018 Colorado State University. FAQs. Architecture of GFS

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.

CompSci 516 Database Systems

What s new in Mongo 4.0. Vinicius Grippa Percona

CSE 344 Final Review. August 16 th

Advances in Data Management - NoSQL, NewSQL and Big Data A.Poulovassilis

User Perspective. Module III: System Perspective. Module III: Topics Covered. Module III Overview of Storage Structures, QP, and TM

How do we build TiDB. a Distributed, Consistent, Scalable, SQL Database

Megastore: Providing Scalable, Highly Available Storage for Interactive Services & Spanner: Google s Globally- Distributed Database.

Beyond Relational Databases: MongoDB, Redis & ClickHouse. Marcos Albe - Principal Support Percona

Data Modeling and Databases Ch 14: Data Replication. Gustavo Alonso, Ce Zhang Systems Group Department of Computer Science ETH Zürich

MongoDB CRUD Operations

A Journey to DynamoDB

NoSQL Databases. Amir H. Payberah. Swedish Institute of Computer Science. April 10, 2014

How VoltDB does Transactions

Introduction to Computer Science. William Hsu Department of Computer Science and Engineering National Taiwan Ocean University

Use multi-document ACID transactions in MongoDB 4.0 November 7th Corrado Pandiani - Senior consultant Percona

Distributed Filesystem

10. Replication. CSEP 545 Transaction Processing Philip A. Bernstein Sameh Elnikety. Copyright 2012 Philip A. Bernstein

CSE 124: Networked Services Lecture-16

The Google File System

Building High Performance Apps using NoSQL. Swami Sivasubramanian General Manager, AWS NoSQL

A Survey Paper on NoSQL Databases: Key-Value Data Stores and Document Stores

MongoDB. History. mongodb = Humongous DB. Open-source Document-based High performance, high availability Automatic scaling C-P on CAP.

Big Data Processing Technologies. Chentao Wu Associate Professor Dept. of Computer Science and Engineering

10/18/2017. Announcements. NoSQL Motivation. NoSQL. Serverless Architecture. What is the Problem? Database Systems CSE 414

CS 138: Google. CS 138 XVII 1 Copyright 2016 Thomas W. Doeppner. All rights reserved.

Shen PingCAP 2017

Scylla Open Source 3.0

Comparing SQL and NOSQL databases

Database Architectures

7. Query Processing and Optimization

Reduce MongoDB Data Size. Steven Wang

Distributed systems. Lecture 6: distributed transactions, elections, consensus and replication. Malte Schwarzkopf

Georgia Institute of Technology ECE6102 4/20/2009 David Colvin, Jimmy Vuong

Scaling MongoDB: Avoiding Common Pitfalls. Jon Tobin Senior Systems

FAQs Snapshots and locks Vector Clock

MapReduce. Stony Brook University CSE545, Fall 2016

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Map-Reduce. Marco Mura 2010 March, 31th

Distributed System. Gang Wu. Spring,2018

CS 138: Google. CS 138 XVI 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

1 Big Data Hadoop. 1. Introduction About this Course About Big Data Course Logistics Introductions

CSE 124: Networked Services Fall 2009 Lecture-19

Distributed Data Management Replication

Bigtable: A Distributed Storage System for Structured Data By Fay Chang, et al. OSDI Presented by Xiang Gao

Distributed Data Store

Topics in Reliable Distributed Systems

Distributed Systems. Fall 2017 Exam 3 Review. Paul Krzyzanowski. Rutgers University. Fall 2017

MongoDB: Comparing WiredTiger In-Memory Engine to Redis. Jason Terpko DBA, Rackspace/ObjectRocket 1

CISC 7610 Lecture 2b The beginnings of NoSQL

<Insert Picture Here> Oracle NoSQL Database A Distributed Key-Value Store

Transcription:

SQL, NoSQL, MongoDB CSE-291 (Cloud Computing) Fall 2016 Gregory Kesden

SQL Databases Really better called Relational Databases Key construct is the Relation, a.k.a. the table Rows represent records Columns represent attribute sets Find things within tables by brute force or indexes, e.g. B-Trees or Hash Tables Cross-reference tables via shared keys, basically an optimized cross-product, known as a join Expensive operation

SQL Databases Backbone of modern apps Very, very high throughput can be achieved Scaling is challenging because there is no good way to partition tables while still achieving semantics Amazing work-arounds are possible virtualize SANS to large storage devices, etc But, the model is what it is.

NoSQL Databases Any of the more modern databases that essentially give up the ability to do joins in order to be able to avoid huge monolith tables and scale Key-Value (Dynamo or basic Cassandra) Column-based (Hbase) Document-based (MongoDB) Usually has more flexible scheme (no rigid tables means no rigid NxM structure)

MongoDB Document-based NoSQL database Max 16MB per document Documents are rich BSON (Binary JSON) key-value documents Collections hold documents and can share indexes Some like to suggest they are analogous to tables, but not all documents in a collection must have the same structure. They just have some of the same keys Databases hold collections hold documents

MongoDB Document Note field:value tuples https://docs.mongodb.com/manual/_images/crud-annotated-document.png

Embedded Documents var mydoc = { _id: ObjectId( 1abcd45b123456754321abcd"), name: { first: Gregory", last: Kesden" }, classes: [ CSE-291", CSE-110", CSE-500" ], contact: { phone: { type: "cell", number: 412-818-7813" } }, } Array access: classes.0 Embedded doc access: contact.phone.number

Join Operations? In general, not a MongoDB thing Get data from different places Slow and expensive operation Much better to take advantage of denormalized structure to embed related things Can also chase pointers by chasing an id from one document into another document via another query. (More like using a foreigh key in SQL than a join) Worst case? Multiple passes using shared key.

Indexes (Much like any other DB) https://docs.mongodb.com/manual/indexes/

Single Field Indexes https://docs.mongodb.com/manual/indexes/

Compound Indexes https://docs.mongodb.com/manual/indexes/

Multi-Key (Array Field) Indexes Note: One index for each element of the array https://docs.mongodb.com/manual/core/index-multikey/

More About Indexing Matches, Range-based results, etc Geospatial searches Text searches, language based, includes only meaningful words Partial indexes filter and only index matching documents TTL indexes, internally used to age out documents, where desired Covered queries are queries that can be answered directly from indexes, without scanning Intersection of indexes.

Aggregation Pipeline: Filter, Group, Sort, Ops (Average, Concatenation, etc) https://docs.mongodb.com/manual/aggregation/

Map-Reduce https://docs.mongodb.com/manual/aggregation/

Concurrency Multiple options, WiredTiger the default Document-level concurrency control for write operations. As a result, multiple clients can modify different documents of a collection at the same time. For most read and write operations, WiredTiger uses optimistic concurrency control. WiredTiger uses only intent locks at the global, database and collection levels. When the storage engine detects conflicts between two operations, one will incur a write conflict causing MongoDB to transparently retry that operation. Some global operations, typically short lived operations involving multiple databases, still require a global instance-wide lock. Some other operations, such as dropping a collection, still require an exclusive database lock. https://docs.mongodb.com/manual/core/wiredtiger/

Snapshots and Checkpoints At the start of an operation, WiredTiger provides a point-in-time snapshot of the data to the transaction. A snapshot presents a consistent view of the in-memory data. When writing to disk, WiredTiger writes all the data in a snapshot to disk in a consistent way across all data files. The now-durable data act as a checkpoint in the data files. The checkpoint ensures that the data files are consistent up to and including the last checkpoint; i.e. checkpoints can act as recovery points. MongoDB configures WiredTiger to create checkpoints at intervals of 60 seconds or 2 gigabytes of journal data. During the write of a new checkpoint, the previous checkpoint is still valid. The new checkpoint becomes accessible and permanent when WiredTiger s metadata table is atomically updated to reference the new checkpoint. Once the new checkpoint is accessible, WiredTiger frees pages from the old checkpoints. Journaling needed to recover changes ahead of checkpoints https://docs.mongodb.com/manual/core/wiredtiger/

Journaling Compressed write-ahead log (WAL) Used to recover state more recent than most recent checkpoint Buffered in memory, synced every 50ms Deleted upon clean shutdown Depending on file system, can preallocate log to avoid slow allocation

Replica Sets: Asynchronous Replication https://docs.mongodb.com/manual/replication/

Arbiters for Quorums: Real World Student-Like Move https://docs.mongodb.com/manual/replication/

Automatic Failover Missing heartbeats for 10sec? Call election Secondary with most votes becomes new primary, temporarily But, uses bully-like primary to agree on top dog in the end Can be non-voting secondaries. Can be read, but not elected or voting. Read-only during election https://docs.mongodb.com/manual/replication/

Supporting Scale Vertical bigger host Horizontal -- Sharding More hosts Higher throughput Greater capacity

Sharding Documents w/in sharded collection have shard key Immutable, sued for sharding Choice is very important, because key must be found in range by index. Can be bottleneck Collection partitioned by shard key range into chunks Chunks are distributed and replicated (replica sets)

Chunks Sharded into chunks by shard key Can be migrated manually or balancer Can be split if too large