NoSQL Database Comparison: Bigtable, Cassandra and MongoDB CJ Campbell Brigham Young University October 16, 2015

Similar documents
Cassandra, MongoDB, and HBase. Cassandra, MongoDB, and HBase. I have chosen these three due to their recent

CS November 2017

CS November 2018

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)

Bigtable: A Distributed Storage System for Structured Data. Andrew Hon, Phyllis Lau, Justin Ng

Bigtable: A Distributed Storage System for Structured Data by Google SUNNIE CHUNG CIS 612

Cassandra Design Patterns

Overview. * Some History. * What is NoSQL? * Why NoSQL? * RDBMS vs NoSQL. * NoSQL Taxonomy. *TowardsNewSQL

CSE 444: Database Internals. Lectures 26 NoSQL: Extensible Record Stores

NoSQL Databases MongoDB vs Cassandra. Kenny Huynh, Andre Chik, Kevin Vu

Cassandra - A Decentralized Structured Storage System. Avinash Lakshman and Prashant Malik Facebook

CIB Session 12th NoSQL Databases Structures

Chapter 24 NOSQL Databases and Big Data Storage Systems

CS 655 Advanced Topics in Distributed Systems

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2015 Lecture 14 NoSQL

NOSQL EGCO321 DATABASE SYSTEMS KANAT POOLSAWASD DEPARTMENT OF COMPUTER ENGINEERING MAHIDOL UNIVERSITY

Introduction to Big Data. NoSQL Databases. Instituto Politécnico de Tomar. Ricardo Campos

Bigtable. Presenter: Yijun Hou, Yixiao Peng

Big Data Processing Technologies. Chentao Wu Associate Professor Dept. of Computer Science and Engineering

CSE-E5430 Scalable Cloud Computing Lecture 9

Bigtable: A Distributed Storage System for Structured Data By Fay Chang, et al. OSDI Presented by Xiang Gao

CA485 Ray Walshe NoSQL

Big Data Analytics. Rasoul Karimi

Bigtable. A Distributed Storage System for Structured Data. Presenter: Yunming Zhang Conglong Li. Saturday, September 21, 13

NoSQL Databases Analysis

Getting to know. by Michelle Darling August 2013

Goal of the presentation is to give an introduction of NoSQL databases, why they are there.

Comparing SQL and NOSQL databases

Database Evolution. DB NoSQL Linked Open Data. L. Vigliano

Extreme Computing. NoSQL.

Cassandra- A Distributed Database

BigTable: A Distributed Storage System for Structured Data (2006) Slides adapted by Tyler Davis

References. What is Bigtable? Bigtable Data Model. Outline. Key Features. CSE 444: Database Internals

Scaling Up HBase. Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech. CSE6242 / CX4242: Data & Visual Analytics

NoSQL database and its business applications

Distributed File Systems II

Tools for Social Networking Infrastructures

Performance Comparison of NOSQL Database Cassandra and SQL Server for Large Databases

Challenges for Data Driven Systems

NoSQL systems. Lecture 21 (optional) Instructor: Sudeepa Roy. CompSci 516 Data Intensive Computing Systems

Migrating to Cassandra in the Cloud, the Netflix Way

NOSQL DATABASES OCTOBER 20, A comparison between the MongoDB, Cassandra, and Redis databases ANDREW HYTE

NoSQL Databases. Amir H. Payberah. Swedish Institute of Computer Science. April 10, 2014

Distributed Data Analytics Partitioning

Intro Cassandra. Adelaide Big Data Meetup.

Introduction to NoSQL Databases

NOSQL DATABASE SYSTEMS: DECISION GUIDANCE AND TRENDS. Big Data Technologies: NoSQL DBMS (Decision Guidance) - SoSe

Jargons, Concepts, Scope and Systems. Key Value Stores, Document Stores, Extensible Record Stores. Overview of different scalable relational systems

Outline. Introduction Background Use Cases Data Model & Query Language Architecture Conclusion

Migrating Oracle Databases To Cassandra

SCALABLE DATABASES. Sergio Bossa. From Relational Databases To Polyglot Persistence.

BigTable: A Distributed Storage System for Structured Data

A Survey Paper on NoSQL Databases: Key-Value Data Stores and Document Stores

A Review Of Non Relational Databases, Their Types, Advantages And Disadvantages

CIS 601 Graduate Seminar. Dr. Sunnie S. Chung Dhruv Patel ( ) Kalpesh Sharma ( )

/ Cloud Computing. Recitation 8 October 18, 2016

Scalable Storage: The drive for web-scale data management

Non-Relational Databases. Pelle Jakovits

10/18/2017. Announcements. NoSQL Motivation. NoSQL. Serverless Architecture. What is the Problem? Database Systems CSE 414

CSE 344 JULY 9 TH NOSQL

Distributed Systems [Fall 2012]

MongoDB - a No SQL Database What you need to know as an Oracle DBA

CS5412: DIVING IN: INSIDE THE DATA CENTER

DIVING IN: INSIDE THE DATA CENTER

BigTable. Chubby. BigTable. Chubby. Why Chubby? How to do consensus as a service

Study of NoSQL Database Along With Security Comparison

Google big data techniques (2)

A Non-Relational Storage Analysis

Column-Family Databases Cassandra and HBase

relational Relational to Riak Why Move From Relational to Riak? Introduction High Availability Riak At-a-Glance

Course Content MongoDB

5/2/16. Announcements. NoSQL Motivation. The New Hipster: NoSQL. Serverless. What is the Problem? Database Systems CSE 414

/ Cloud Computing. Recitation 10 March 22nd, 2016

Database Systems CSE 414

Workshop Report: ElaStraS - An Elastic Transactional Datastore in the Cloud

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University

Data Informatics. Seon Ho Kim, Ph.D.

A Study of NoSQL Database

When, Where & Why to Use NoSQL?

Rule 14 Use Databases Appropriately

Architekturen für die Cloud

Spotify. Scaling storage to million of users world wide. Jimmy Mårdell October 14, 2014

CISC 7610 Lecture 2b The beginnings of NoSQL

Introduction Data Model API Building Blocks SSTable Implementation Tablet Location Tablet Assingment Tablet Serving Compactions Refinements

10. Replication. Motivation

Time Series Live 2017

Distributed Database Case Study on Google s Big Tables

MongoDB. copyright 2011 Trainologic LTD

Introduction to Computer Science. William Hsu Department of Computer Science and Engineering National Taiwan Ocean University

/ Cloud Computing. Recitation 7 October 10, 2017

Apache Cassandra - A Decentralized Structured Storage System

Migrating massive monitoring to Bigtable without downtime. Martin Parm, Infrastructure Engineer for Monitoring

ADVANCED DATABASES CIS 6930 Dr. Markus Schneider

Scalability of web applications

Accelerate MySQL for Demanding OLAP and OLTP Use Case with Apache Ignite December 7, 2016

Fusion iomemory PCIe Solutions from SanDisk and Sqrll make Accumulo Hypersonic

Distributed Databases: SQL vs NoSQL

Safe Harbor Statement

Why Choose Percona Server for MongoDB? Tyler Duzan

Transcription:

Running Head: NOSQL DATABASE COMPARISON: BIGTABLE, CASSANDRA AND MONGODB NoSQL Database Comparison: Bigtable, Cassandra and MongoDB CJ Campbell Brigham Young University October 16, 2015

1 INTRODUCTION THE SYSTEMS Google Bigtable History Data model & operations Physical Storage ACID properties Scalability Apache Cassandra History Data model & operations Physical Storage ACID properties Scalability MongoDB History Data model & operations Physical Storage ACID properties Scalability Differences Conclusion References

2 Introduction As distributed systems are adopted and grown to scale, the need for scalable database solutions which meet the application s exact need has become increasingly important. In the early days of computing, databases were almost entirely relational. Today, new breeds of database have emerged, called NoSQL databases. They are a common element in the grand design for most distributed software platforms. Each database is suited to a slightly different purpose from its peers. This paper discusses the features, similarities, and differences of three NoSQL databases: Google Bigtable, Apache Cassandra, and MongoDB. The Systems In this section, each of the three NoSQL databases are analyzed in-depth, starting with Google Bigtable, then Apache Cassandra, and finally MongoDB. Analysis includes their history, data model, accepted operations, physical storage schema, ACID properties, and scalability. Google Bigtable History. Bigtable was designed within Google to meet their internal data processing needs at scale. It began development in 2004 as part of their effort to handle large amounts of data across applications such as web indexing, Google Earth, Google Finance and more (Google, Inc., 2006). It first went into production use in April 2005. In May 2015, Google released a public version of Bigtable called Cloud Bigtable as part of the Google Cloud Platform (O'Connor, 2015). Data model & operations. Bigtable offers semi-structured data. At a high-level view, it is a key-value store. Diving deeper, the value is a set of columns which can be unique for each row, as in a jagged array. Columns are grouped together in column families which allows for iterating across similar data and backend efficiency. Cells can contain multiple versions of the same data, indexed by timestamp, with a configurable limit to keep only recent entries. Data is sorted in lexicographic order by row key, which allows users to exploit key selection for good data locality, thereby increasing performance. Physical Storage. Google s 2006 Bigtable paper describes its file structure as, a sparse, distributed, persistent multidimensional sorted map. Data is stored on Google

3 File System (GFS) in the SSTable file format, which is optimized for reads/writes on similarly-keyed data. ACID properties. Data reads and writes are atomic on a per-row basis, regardless of how many columns that row contains. Atomic actions are not available across multiple rows. Scalability. The introduction to Google s Bigtable paper claims the ability to reliably scale to petabytes of data and thousands of machines. It can be configured to optimize for different needs, such as availability or low latency. An example of configuring this is the ability to read from memory instead of hard-disk. Apache Cassandra History. The Cassandra project was created around 2008 by Avinash Lakshman and Prashant Malik. It is named after a mythological Greek prophet. Some reports online claim that the name is in opposition to Oracle s database (The meaning behind the name of Apache Cassandra, 2013). The purpose of its creation was to power the inbox search feature for Facebook. The Cassandra project was open-sourced on Google code on July 2008, became an Apache Incubator project in 2009 and finally graduated to a top-level Apache project in 2010. While the open community continued to embrace Cassandra, Facebook actually tapered its usage. In 2010 Facebook released a new version of messaging which used HBase instead of Cassandra because they found the model to be a difficult pattern to reconcile for our new Messages infrastructure (Muthukkaruppan, 2010). Despite being abandoned by its parent project, Cassandra is ranked the most popular wide column store, and eighth-most popular database overall as of October 2015 (DB-Engines Ranking, 2015). Data model & operations. Cassandra s data model has evolved over time. It began with column families and super column families. Only three data operations were initially available: insert, get and delete. The original design is completely unrecognizable in the Cassandra of 2015 (Ellis, n.d.). Today s model looks more like a collection of denormalized non-relational tables, with a query language similar to relational databases. This provides a speed increase because there is no need to join across tables, although it comes at the price of data duplication. Tables can be updated live without locking or downtime (Datastax, 2015). Physical Storage. Data is stored across a cluster using a consistent hashing ring. Cassandra uses virtual nodes to rearrange data for load balancing. Therefore, adding or

4 removing a node only affects its immediate neighbors (Ellis, n.d.). The nodes to which data is initially assigned are called coordinators. They can be configured to replicate N copies of the data across the cluster, with additional configurations for locality-awareness. This ring allows the cluster to operate without any single point of failure. Data is stored on filesystem. It is optimized for fast reads at the cost of slower writes. Changes are written to a local commit log, which then goes into a memory cache. At a dynamically-calculated threshold, data in memory dumps to hard disk. ACID properties. Communication within a cluster is based on the gossip protocol, an eventual consistency model. This means that like most distributed database systems, Cassandra is built for high availability and partitioning with eventual consistency. A useful feature, however, is that this consistency is configurable to meet specific use cases. Operations on a single node are ACID compliant, though not across the cluster. For transactional writes, Cassandra uses of a modified Paxos consensus protocol. This of course costs performance, and should only be used for transactionally-sensitive operations (Ellis, Lightweight transactions in Cassandra 2.0, 2013). Scalability. The distributed structure of Cassandra makes it a viable option for globally-replicated data. In 2011, Netflix performed a benchmarking test and reported that it is linearly-scalable (Cockroft & Sheahan, 2011). The University of Toronto performed a similar test in 2012 with similar results, explaining that this comes at the price of high write and read latencies (Rabl, et al.). Cassandra s feature-richness is its own cost, however. Though the database itself is riddled with powerful tooling and configurability. As the complexity of the system increases, the learning curve also increases. Thus, the user base that can support Cassandra is smaller than other databases, and the availability of maintenance staff is ever-important. MongoDB History. MongoDB was created by 10gen in 2007 as the data layer to their platform as a service called Babble. The database got its name from the word humongous (History of MongoDB, n.d.). The market didn t take to Babble very well, and so in 2009 the project was open sourced. By August of 2013, the project had become the central focus of 10gen s development, so much that the company changed its name to MongoDB (Harris, 2013). Since then, it has become the world s most popular NoSQL database (DB-Engines Ranking, 2015).

5 Data model & operations. MongoDB was developed in a javascript-oriented environment, and it shows in its data structure. It is classified as a document store, or document-oriented database. These provide the same lookup functionality as a key-value store, but also provide visibility into the stored documents (MongoDB, 2015). Data is stored in BSON, or Binary JSON, which is just what you d think: an optimized structure for JSON. In everyday usage, it looks almost exactly like JSON to developers. Because JSON usage is so widespread, MongoDB s learning curve is small compared to other databases. This opens the api for querying, filtering and sorting based on values within the document, modifying individual document values, and MapReduce and aggregation functions. Documents are partitioned into collections in MongoDB as rows are partitioned into tables in a relational store. Documents in a collection should contain similar data and have the same structure, though this is not enforced. Physical Storage. The size of BSON objects is limited to 16MB. Just as documents can be queried by inner value, they can also be indexed. The administrator can define a sharding key to increase data locality, which optimizes aggregation functions (Suter, 2012). ACID properties. Operations are atomic on a document-level. This means that data which must be atomic must be within a single document. Atomic transactions are not possible across multiple documents. Scalability. MongoDB automatically manages horizontal scaling across shards. As a node is added to the cluster, data will automatically offload from other nodes onto the new one until balance is restored. Differences Whereas Apache Cassandra and MongoDB are both open-source projects, Google Bigtable is offered only as a proprietary, hosted database solution. There are pros and cons to either side, depending on the resources of the company. When using a hosted solution, the company unfortunately pays for database usage. However, with paid usage comes paid support, and the company doesn t need an administrator to handle the infrastructural and operational needs of the database. The open-source databases are free to use, but deployment, configuration and infrastructure is left entirely to the implementors. In its early days, Cassandra was similar to Bigtable, because they both supported column families. Over time their data models diverged, but they both still work on a table structure. According to a study by the 451 Group in September 2015, MongoDB is by far

6 the most referenced NoSQL skill on LinkedIn (The 451 Group, 2015). MongoDB is also more popular than Cassandra and Bigtable on the db-engines index (DB-Engines Ranking, 2015). This is probably because it is so easy to learn and administrate. It definitely beats Cassandra and Bigtable in ease of use. Although MongoDB tops the charts for ease of use, it isn t nearly as feature-complete or scalable as the other two. Bigtable and Cassandra are much more suited to a reporting workload. When tuned correctly for such tasks, they are more performant at high scale, and allow semi-relational queries. Conclusion Each of these databases is unique in their own right, and the decision of which one to pick really depends on the needs of the developer and the project. If an app is being prototyped with data structures that aren t quite solid, MongoDB is a great choice. If a company needs to scale its database quickly without worrying about infrastructure and can afford the price tag, Bigtable may be a good option. If the company needs to tune their data access to fit their needs exactly, and if they are able to provide in-house support, Cassandra may be the right choice. There s really no panacea for NoSQL databases. All have their strengths, and all have their weaknesses. The right database depends on the need. References Cockroft, A., & Sheahan, D. (2011, Nov 2). Benchmarking Cassandra Scalability on AWS - Over a million writes per second. Retrieved from Netflix: http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html Datastax. (2015, Oct 12). Data modeling example. Retrieved from Datastax: http://docs.datastax.com/en/cql/3.1/cql/ddl/ddl_intro_c.html DB-Engines Ranking. (2015, Oct 15). Retrieved from DB-Engines: http://db-engines.com/en/ranking Ellis, J. (2013, July 23). Lightweight transactions in Cassandra 2.0. Retrieved from Datastax Developer Blog: http://www.datastax.com/dev/blog/lightweight-transactions-in-cassandra-2-0 Ellis, J. (n.d.). Facebook s Cassandra paper, annotated and compared to Apache Cassandra 2.0. Retrieved Oct 15, 2015, from Datastax: http://docs.datastax.com/en/articles/cassandra/cassandrathenandnow.html

7 Google, Inc. (2006). Bigtable: A Distributed Storage System for Structured Data. Retrieved from http://research.google.com/archive/bigtable-osdi06.pdf Harris, D. (2013, Aug 27). 10gen embraces what it created, becomes MongoDB Inc. Retrieved from Gigaom Research: https://gigaom.com/2013/08/27/10gen-embraces-what-it-created-becomes-mong odb-inc/ History of MongoDB. (n.d.). Retrieved October 16, 2015, from Snail in a Turtleneck: http://www.kchodorow.com/blog/2010/08/23/history-of-mongodb/ Kellabyte. (2013, Jan 4). The meaning behind the name of Apache Cassandra. Retrieved from Kellabyte: http://kellabyte.com/2013/01/04/the-meaning-behind-the-name-of-apache-cassan dra/ MongoDB. (2015, October 16). Data Model Design. Retrieved from MongoDB: http://docs.mongodb.org/manual/core/data-model-design/ Muthukkaruppan, K. (2010, Nov 15). The Underlying Technology of Messages. Retrieved from Facebook: https://www.facebook.com/notes/facebook-engineering/the-underlying-technolog y-of-messages/454991608919 O'Connor, C. (2015, May 6). Announcing Google Cloud Bigtable: The same database that powers Google Search, Gmail and Analytics is now available on Google Cloud Platform. Retrieved from Google Cloud Platform Blog: http://googlecloudplatform.blogspot.com/2015/05/introducing-google-cloud-bigt able.html Rabl, T., Gomez-Villamor, S., Sadoghi, M., Muntes-Mulero, V., Jacobsen, H.-A., & Mankovskii, S. (n.d.). Solving Big Data Challenges for Enterprise Application. Retrieved October 16, 2015, from www.vldb.org: http://vldb.org/pvldb/vol5/p1724_tilmannrabl_vldb2012.pdf Suter, R. (2012, January). MongoDB: An introduction and performance analysis. Retrieved from http://wiki.hsr.ch/datenbanken/files/mongodb.pdf The 451 Group. (2015, October 1). NoSQL LinkedIn Skills Index September 2015. Retrieved from Too much information: The 451 Take on information management: https://blogs.the451group.com/information_management/tag/nosql/