NoSQL Database Comparison: Bigtable, Cassandra and MongoDB CJ Campbell Brigham Young University October 16, PDF Free Download

Running Head: NOSQL DATABASE COMPARISON: BIGTABLE, CASSANDRA AND MONGODB NoSQL Database Comparison: Bigtable, Cassandra and MongoDB CJ Campbell Brigham Young University October 16, 2015

1 INTRODUCTION THE SYSTEMS Google Bigtable History Data model & operations Physical Storage ACID properties Scalability Apache Cassandra History Data model & operations Physical Storage ACID properties Scalability MongoDB History Data model & operations Physical Storage ACID properties Scalability Differences Conclusion References

2 Introduction As distributed systems are adopted and grown to scale, the need for scalable database solutions which meet the application s exact need has become increasingly important. In the early days of computing, databases were almost entirely relational. Today, new breeds of database have emerged, called NoSQL databases. They are a common element in the grand design for most distributed software platforms. Each database is suited to a slightly different purpose from its peers. This paper discusses the features, similarities, and differences of three NoSQL databases: Google Bigtable, Apache Cassandra, and MongoDB. The Systems In this section, each of the three NoSQL databases are analyzed in-depth, starting with Google Bigtable, then Apache Cassandra, and finally MongoDB. Analysis includes their history, data model, accepted operations, physical storage schema, ACID properties, and scalability. Google Bigtable History. Bigtable was designed within Google to meet their internal data processing needs at scale. It began development in 2004 as part of their effort to handle large amounts of data across applications such as web indexing, Google Earth, Google Finance and more (Google, Inc., 2006). It first went into production use in April 2005. In May 2015, Google released a public version of Bigtable called Cloud Bigtable as part of the Google Cloud Platform (O'Connor, 2015). Data model & operations. Bigtable offers semi-structured data. At a high-level view, it is a key-value store. Diving deeper, the value is a set of columns which can be unique for each row, as in a jagged array. Columns are grouped together in column families which allows for iterating across similar data and backend efficiency. Cells can contain multiple versions of the same data, indexed by timestamp, with a configurable limit to keep only recent entries. Data is sorted in lexicographic order by row key, which allows users to exploit key selection for good data locality, thereby increasing performance. Physical Storage. Google s 2006 Bigtable paper describes its file structure as, a sparse, distributed, persistent multidimensional sorted map. Data is stored on Google

3 File System (GFS) in the SSTable file format, which is optimized for reads/writes on similarly-keyed data. ACID properties. Data reads and writes are atomic on a per-row basis, regardless of how many columns that row contains. Atomic actions are not available across multiple rows. Scalability. The introduction to Google s Bigtable paper claims the ability to reliably scale to petabytes of data and thousands of machines. It can be configured to optimize for different needs, such as availability or low latency. An example of configuring this is the ability to read from memory instead of hard-disk. Apache Cassandra History. The Cassandra project was created around 2008 by Avinash Lakshman and Prashant Malik. It is named after a mythological Greek prophet. Some reports online claim that the name is in opposition to Oracle s database (The meaning behind the name of Apache Cassandra, 2013). The purpose of its creation was to power the inbox search feature for Facebook. The Cassandra project was open-sourced on Google code on July 2008, became an Apache Incubator project in 2009 and finally graduated to a top-level Apache project in 2010. While the open community continued to embrace Cassandra, Facebook actually tapered its usage. In 2010 Facebook released a new version of messaging which used HBase instead of Cassandra because they found the model to be a difficult pattern to reconcile for our new Messages infrastructure (Muthukkaruppan, 2010). Despite being abandoned by its parent project, Cassandra is ranked the most popular wide column store, and eighth-most popular database overall as of October 2015 (DB-Engines Ranking, 2015). Data model & operations. Cassandra s data model has evolved over time. It began with column families and super column families. Only three data operations were initially available: insert, get and delete. The original design is completely unrecognizable in the Cassandra of 2015 (Ellis, n.d.). Today s model looks more like a collection of denormalized non-relational tables, with a query language similar to relational databases. This provides a speed increase because there is no need to join across tables, although it comes at the price of data duplication. Tables can be updated live without locking or downtime (Datastax, 2015). Physical Storage. Data is stored across a cluster using a consistent hashing ring. Cassandra uses virtual nodes to rearrange data for load balancing. Therefore, adding or

4 removing a node only affects its immediate neighbors (Ellis, n.d.). The nodes to which data is initially assigned are called coordinators. They can be configured to replicate N copies of the data across the cluster, with additional configurations for locality-awareness. This ring allows the cluster to operate without any single point of failure. Data is stored on filesystem. It is optimized for fast reads at the cost of slower writes. Changes are written to a local commit log, which then goes into a memory cache. At a dynamically-calculated threshold, data in memory dumps to hard disk. ACID properties. Communication within a cluster is based on the gossip protocol, an eventual consistency model. This means that like most distributed database systems, Cassandra is built for high availability and partitioning with eventual consistency. A useful feature, however, is that this consistency is configurable to meet specific use cases. Operations on a single node are ACID compliant, though not across the cluster. For transactional writes, Cassandra uses of a modified Paxos consensus protocol. This of course costs performance, and should only be used for transactionally-sensitive operations (Ellis, Lightweight transactions in Cassandra 2.0, 2013). Scalability. The distributed structure of Cassandra makes it a viable option for globally-replicated data. In 2011, Netflix performed a benchmarking test and reported that it is linearly-scalable (Cockroft & Sheahan, 2011). The University of Toronto performed a similar test in 2012 with similar results, explaining that this comes at the price of high write and read latencies (Rabl, et al.). Cassandra s feature-richness is its own cost, however. Though the database itself is riddled with powerful tooling and configurability. As the complexity of the system increases, the learning curve also increases. Thus, the user base that can support Cassandra is smaller than other databases, and the availability of maintenance staff is ever-important. MongoDB History. MongoDB was created by 10gen in 2007 as the data layer to their platform as a service called Babble. The database got its name from the word humongous (History of MongoDB, n.d.). The market didn t take to Babble very well, and so in 2009 the project was open sourced. By August of 2013, the project had become the central focus of 10gen s development, so much that the company changed its name to MongoDB (Harris, 2013). Since then, it has become the world s most popular NoSQL database (DB-Engines Ranking, 2015).

5 Data model & operations. MongoDB was developed in a javascript-oriented environment, and it shows in its data structure. It is classified as a document store, or document-oriented database. These provide the same lookup functionality as a key-value store, but also provide visibility into the stored documents (MongoDB, 2015). Data is stored in BSON, or Binary JSON, which is just what you d think: an optimized structure for JSON. In everyday usage, it looks almost exactly like JSON to developers. Because JSON usage is so widespread, MongoDB s learning curve is small compared to other databases. This opens the api for querying, filtering and sorting based on values within the document, modifying individual document values, and MapReduce and aggregation functions. Documents are partitioned into collections in MongoDB as rows are partitioned into tables in a relational store. Documents in a collection should contain similar data and have the same structure, though this is not enforced. Physical Storage. The size of BSON objects is limited to 16MB. Just as documents can be queried by inner value, they can also be indexed. The administrator can define a sharding key to increase data locality, which optimizes aggregation functions (Suter, 2012). ACID properties. Operations are atomic on a document-level. This means that data which must be atomic must be within a single document. Atomic transactions are not possible across multiple documents. Scalability. MongoDB automatically manages horizontal scaling across shards. As a node is added to the cluster, data will automatically offload from other nodes onto the new one until balance is restored. Differences Whereas Apache Cassandra and MongoDB are both open-source projects, Google Bigtable is offered only as a proprietary, hosted database solution. There are pros and cons to either side, depending on the resources of the company. When using a hosted solution, the company unfortunately pays for database usage. However, with paid usage comes paid support, and the company doesn t need an administrator to handle the infrastructural and operational needs of the database. The open-source databases are free to use, but deployment, configuration and infrastructure is left entirely to the implementors. In its early days, Cassandra was similar to Bigtable, because they both supported column families. Over time their data models diverged, but they both still work on a table structure. According to a study by the 451 Group in September 2015, MongoDB is by far

6 the most referenced NoSQL skill on LinkedIn (The 451 Group, 2015). MongoDB is also more popular than Cassandra and Bigtable on the db-engines index (DB-Engines Ranking, 2015). This is probably because it is so easy to learn and administrate. It definitely beats Cassandra and Bigtable in ease of use. Although MongoDB tops the charts for ease of use, it isn t nearly as feature-complete or scalable as the other two. Bigtable and Cassandra are much more suited to a reporting workload. When tuned correctly for such tasks, they are more performant at high scale, and allow semi-relational queries. Conclusion Each of these databases is unique in their own right, and the decision of which one to pick really depends on the needs of the developer and the project. If an app is being prototyped with data structures that aren t quite solid, MongoDB is a great choice. If a company needs to scale its database quickly without worrying about infrastructure and can afford the price tag, Bigtable may be a good option. If the company needs to tune their data access to fit their needs exactly, and if they are able to provide in-house support, Cassandra may be the right choice. There s really no panacea for NoSQL databases. All have their strengths, and all have their weaknesses. The right database depends on the need. References Cockroft, A., & Sheahan, D. (2011, Nov 2). Benchmarking Cassandra Scalability on AWS - Over a million writes per second. Retrieved from Netflix: http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html Datastax. (2015, Oct 12). Data modeling example. Retrieved from Datastax: http://docs.datastax.com/en/cql/3.1/cql/ddl/ddl_intro_c.html DB-Engines Ranking. (2015, Oct 15). Retrieved from DB-Engines: http://db-engines.com/en/ranking Ellis, J. (2013, July 23). Lightweight transactions in Cassandra 2.0. Retrieved from Datastax Developer Blog: http://www.datastax.com/dev/blog/lightweight-transactions-in-cassandra-2-0 Ellis, J. (n.d.). Facebook s Cassandra paper, annotated and compared to Apache Cassandra 2.0. Retrieved Oct 15, 2015, from Datastax: http://docs.datastax.com/en/articles/cassandra/cassandrathenandnow.html

7 Google, Inc. (2006). Bigtable: A Distributed Storage System for Structured Data. Retrieved from http://research.google.com/archive/bigtable-osdi06.pdf Harris, D. (2013, Aug 27). 10gen embraces what it created, becomes MongoDB Inc. Retrieved from Gigaom Research: https://gigaom.com/2013/08/27/10gen-embraces-what-it-created-becomes-mong odb-inc/ History of MongoDB. (n.d.). Retrieved October 16, 2015, from Snail in a Turtleneck: http://www.kchodorow.com/blog/2010/08/23/history-of-mongodb/ Kellabyte. (2013, Jan 4). The meaning behind the name of Apache Cassandra. Retrieved from Kellabyte: http://kellabyte.com/2013/01/04/the-meaning-behind-the-name-of-apache-cassan dra/ MongoDB. (2015, October 16). Data Model Design. Retrieved from MongoDB: http://docs.mongodb.org/manual/core/data-model-design/ Muthukkaruppan, K. (2010, Nov 15). The Underlying Technology of Messages. Retrieved from Facebook: https://www.facebook.com/notes/facebook-engineering/the-underlying-technolog y-of-messages/454991608919 O'Connor, C. (2015, May 6). Announcing Google Cloud Bigtable: The same database that powers Google Search, Gmail and Analytics is now available on Google Cloud Platform. Retrieved from Google Cloud Platform Blog: http://googlecloudplatform.blogspot.com/2015/05/introducing-google-cloud-bigt able.html Rabl, T., Gomez-Villamor, S., Sadoghi, M., Muntes-Mulero, V., Jacobsen, H.-A., & Mankovskii, S. (n.d.). Solving Big Data Challenges for Enterprise Application. Retrieved October 16, 2015, from www.vldb.org: http://vldb.org/pvldb/vol5/p1724_tilmannrabl_vldb2012.pdf Suter, R. (2012, January). MongoDB: An introduction and performance analysis. Retrieved from http://wiki.hsr.ch/datenbanken/files/mongodb.pdf The 451 Group. (2015, October 1). NoSQL LinkedIn Skills Index September 2015. Retrieved from Too much information: The 451 Take on information management: https://blogs.the451group.com/information_management/tag/nosql/