Tanton Jeppson CS 401R Lab 3 Cassandra, MongoDB, and HBase Introduction For my report I have chosen to take a deeper look at 3 NoSQL database systems: Cassandra, MongoDB, and HBase. I have chosen these three due to their recent popularity and growth mostly. As I present data on the history, data model, physical storage, transactions, and scalability of these three NoSQL systems I will also be better prepared in the future to choose which one would be best for specific situations. Cassandra History: Cassandra is a database management system designed for Facebook by Avinash Lakshman and Prashant Malik. The original purpose and goal for the project was to create a system that could be spread across many computers/nodes, yet if any part failed it didn't mean failure for the whole system (no single point of failure). Data Model: The data model used in Cassandra could be considered a mix between Google's BigTable and Amazon's Dynamo. Like Google s BigTable, Cassandra s data model has a key- value where columns are added to keys. Similar to Amazon's Dynamo, the database uses nodes organized into clusters. Each node in the cluster
has the same job, which is the reason why there is no single point of failure. Cassandra has its own language, CQL (Cassandra Query Language), which is used to perform operations. These operations include insert, copy, create, and are very similarly to SQL language usage. Physical Storage: The storage for Cassandra is based off of the table scheme mentioned earlier. The use of multi- dimensional maps and keys is implemented through partitioning and hashing. Each node in the cluster is then responsible for a certain range of values based upon this hash system. Consistent hashing makes possible the division of work across nodes even with many adds or removal of nodes. Transactions: Rather than using fully "ACID" transactions, Cassandra uses an atomic, isolated, and durable transaction system (no strong consistency). It has eventual consistency that can be tuned/adjusted by the user. Since the transactions are atomic, all transactions are either completed in their entirety, or rolled back. Transactions also do not interfere with each other and, since they are durable, will persist even in the case of system crashes or failures. There are also different levels of transactions (such as lightweight transactions) that can be used for different situations or needs. The need for different levels really depends on the situation. For example, in a situation where a little more is needed and the consistency of durable transactions isn't enough, a lightweight transaction (sometimes known as compare and set) uses a consistency that is linearalizable, and therefore might meet the
situations needs. That said, for the majority of situations, the normal durable transactions would typically suffice. Scalability: When considering scalability, it's important to recognize that there are various definitions of the terms. For Cassandra and throughout the rest of this report, the term will be used based off of the definition given by Datastax: "we ll define scalability as the ability to add computational resources to a database in order to gain more throughput." Using this definition, we will also talk about two types of scalability: vertical and horizontal. Vertical scalability is moving data from one machine to another machine that has more power/capacity. This can be very expensive. Horizontal scalability refers to the addition of hardware to improve performance. Cassandra fits into a horizontal scalability very well due to the use of nodes. As more hardware is added, the addition of nodes can detect this as well and takes advantage of the increase of resources. MongoDB History: The beginnings of MongoDB database management system can be traced to as early as 2007. The company MongoDB, inc., began the development of the database system to be used on a product that it was going to be used for originally. By the year 2009 the development had been released to the open source community where it has quickly become a leading choice in the world of databases.
Data Model: MongoDB uses a layout similar to JSON where a key maps to a value. Each element is called a document and a group of elements is called a collection. It uses a dynamic layout where each document does not need to have the same keys as another document in the same collection. MongoDB also uses similar keywords such as insert, delete, and update. Due to the map- layout of the documents, searching and retrieving are fast operations. Physical Storage: MongoDB's storage is implemented in virtual memory. It uses memory- mapped files so that the virtual memory can be handled by the operating system. This leads to variety of performance across operating systems. If something the database is trying to retrieve is not found on RAM then the operating system will swap it out so that is. The way the OS handles this is where the variety can emerge. Another issue that can arise with Mongo's storage is fragmentations. When documents are removed or moved they leave holes behind. These holes are later filled with other documents, but not in a perfect way, leaving some gaps behind still. Over time this can lead to severe fragmentation. Transactions: The transactions in MongoDB are semi- atomic. This means that some operations, such as the write operation, are atomic on a single document action. When this same operation is performed on multiple documents though, it is atomic in each write in and of itself, but as a whole the operation is not atomic, allowing other operations to interleave. This model holds true with other actions as well. At a document level it is ACID compliant but anything above that isn't guaranteed.
Scalability: The biggest advantage that MongoDB has when it comes to scalability is the use of virtual memory in its storage implementation. This allows MongoDB to excel over other NoSQL databases especially in cases where the memory needed exceeds the RAM available. This helps especially in the case of vertical scalability talked about earlier. To handle horizontal scaling MongoDB uses a technique called sharding. Sharding means data records are stored across various machines. This helps ease the load on each machine so that operations that would normally require too much memory can be shared across machines. HBase History: As with many other (if not all) database systems, HBase was designed when there appeared to be nothing else that fit the needs of a project. It was designed by the company Powerset when they had the need to process large amounts of natural language data and be able to search within that data. It has since evolved and grown into a top- level, open source Apache project. Data Model: HBase's data model is also very alike to Google's BigTable design. It is also implemented with columns and rows that are based off of keys that may or may not be unique to the data. This allows for more specific- case lookups as well as more flexible for adding more data in later on in projects without the previous data being a hindrance.
Physical Storage: HBase runs on top of Hadoop Distributed Filesystem, which allows for a lot of interesting advantages when it comes to its data since it also incorporates Hadoop features. One of these advantages is that it can manage having small bits of useful data amongst a sea of less than useful information, and is fault- tolerant in handling it at the same time. Another advantage this brings is that HBase is very MapReduce compatible, including that it can serve as input and output to the algorithm. It also implements the keys using hashing (even better, anything that can be stored in byte arrays can be used as a key). The hierarchy for preference is row key under table row key. Transactions: The transactions within HBase are atomic across a row. That is, if they only mutate one row, then they are atomic, even if they cross over row "families." HBase has partial consistency with "read committed" isolation as well. All operations that return as successful are durable, but those that fail are not necessarily durable. Another interesting part of HBase is that the durability may be tuned by a user to flush data to disk. Scalability: For horizontal scalability HBase uses what are called regions, which are a subset from the table's data stored together as a sorted range of rows. As these regions grow in size they are split into smaller sections to accomodate for the growth and size. HBase also has region servers that act as the responsible unit for a group of regions. Each region has only one region server though.
Differences and Conclusion: All databases are going to have their strengths and weaknesses. The main differences between different options is usually what is given up in exchange for the benefits and how these line up with the specific needs of whatever needs the database will be serving. In these three examples Cassandra and HBase are rather similar. One difference between them though is that Cassandra is a write- oriented system and HBase is designed for more intensive read workloads. This is in important fact to take into account when designing a project. MongoDB, on the other hand, has a document based design rather than the table and row design found in Cassandra and HBase. Depending on what data is going to be stored this may be a more efficient way to manage the situation. The key in all of these situations though is making sure that you are well informed and choose the correct database for the needs of the project.
References: https://en.wikipedia.org/wiki/apache_cassandra http://docs.datastax.com/en/cassandra/2.2/cassandra/cassandraabout.html http://docs.datastax.com/en/cql/3.3/cql/cql_using/useaboutcql.html http://www.datastax.com/dev/blog/schema- in- cassandra- 1-1 http://www.datastax.com/dev/blog/multi- datacenter- replication http://vldb.org/pvldb/vol5/p1724_tilmannrabl_vldb2012.pdf http://docs.datastax.com/en/cassandra/2.0/cassandra/dml/dml_ltwt_transaction_ c.html http://www.datastax.com/dev/blog/why- does- scalability- matter- and- how- does- cassandra- scale https://en.wikipedia.org/wiki/apache_hbase http://hbase.apache.org/0.94/book/datamodel.html http://hbase.apache.org/acid- semantics.html http://blog.cloudera.com/blog/2013/04/how- scaling- really- works- in- apache- hbase/ https://en.wikipedia.org/wiki/mongodb https://www.mongodb.org/about/introduction/ http://learnmongodbthehardway.com/schema/chapter3/ https://docs.mongodb.org/manual/core/write- operations- atomicity/ https://docs.mongodb.org/manual/sharding/