Cassandra, MongoDB, and HBase. Cassandra, MongoDB, and HBase. I have chosen these three due to their recent

Similar documents
NoSQL Database Comparison: Bigtable, Cassandra and MongoDB CJ Campbell Brigham Young University October 16, 2015

NoSQL systems. Lecture 21 (optional) Instructor: Sudeepa Roy. CompSci 516 Data Intensive Computing Systems

Goal of the presentation is to give an introduction of NoSQL databases, why they are there.

Introduction to NoSQL Databases

CIB Session 12th NoSQL Databases Structures

Jargons, Concepts, Scope and Systems. Key Value Stores, Document Stores, Extensible Record Stores. Overview of different scalable relational systems

NoSQL Databases MongoDB vs Cassandra. Kenny Huynh, Andre Chik, Kevin Vu

CISC 7610 Lecture 2b The beginnings of NoSQL

CSE 530A. Non-Relational Databases. Washington University Fall 2013

NoSQL Databases. Amir H. Payberah. Swedish Institute of Computer Science. April 10, 2014

Database Availability and Integrity in NoSQL. Fahri Firdausillah [M ]

Chapter 24 NOSQL Databases and Big Data Storage Systems

CA485 Ray Walshe NoSQL

Apache Cassandra - A Decentralized Structured Storage System

Introduction to Big Data. NoSQL Databases. Instituto Politécnico de Tomar. Ricardo Campos

NOSQL EGCO321 DATABASE SYSTEMS KANAT POOLSAWASD DEPARTMENT OF COMPUTER ENGINEERING MAHIDOL UNIVERSITY

Presented by Sunnie S Chung CIS 612

Cassandra - A Decentralized Structured Storage System. Avinash Lakshman and Prashant Malik Facebook

Distributed Databases: SQL vs NoSQL

Introduction to NoSQL

Introduction to Computer Science. William Hsu Department of Computer Science and Engineering National Taiwan Ocean University

Overview. * Some History. * What is NoSQL? * Why NoSQL? * RDBMS vs NoSQL. * NoSQL Taxonomy. *TowardsNewSQL

CSE-E5430 Scalable Cloud Computing Lecture 9

Performance Comparison of NOSQL Database Cassandra and SQL Server for Large Databases

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2015 Lecture 14 NoSQL

Distributed Data Store

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

CSE 344 JULY 9 TH NOSQL

Module - 17 Lecture - 23 SQL and NoSQL systems. (Refer Slide Time: 00:04)

Tools for Social Networking Infrastructures

Scalability of web applications

Cassandra- A Distributed Database

COSC 416 NoSQL Databases. NoSQL Databases Overview. Dr. Ramon Lawrence University of British Columbia Okanagan

A Review Of Non Relational Databases, Their Types, Advantages And Disadvantages

Database Architectures

Next-Generation Cloud Platform

CompSci 516 Database Systems

CS-580K/480K Advanced Topics in Cloud Computing. NoSQL Database

DIVING IN: INSIDE THE DATA CENTER

Apache Hadoop Goes Realtime at Facebook. Himanshu Sharma

CMU SCS CMU SCS Who: What: When: Where: Why: CMU SCS

Advanced Database Technologies NoSQL: Not only SQL

big picture parallel db (one data center) mix of OLTP and batch analysis lots of data, high r/w rates, 1000s of cheap boxes thus many failures

Migrating Oracle Databases To Cassandra

L22: NoSQL. CS3200 Database design (sp18 s2) 4/5/2018 Several slides courtesy of Benny Kimelfeld

Challenges for Data Driven Systems

8/24/2017 Week 1-B Instructor: Sangmi Lee Pallickara

MongoDB and Mysql: Which one is a better fit for me? Room 204-2:20PM-3:10PM

Column-Family Databases Cassandra and HBase

Introduction Aggregate data model Distribution Models Consistency Map-Reduce Types of NoSQL Databases

Introduction to Graph Databases

RAMCloud. Scalable High-Performance Storage Entirely in DRAM. by John Ousterhout et al. Stanford University. presented by Slavik Derevyanko

PRESENTATION TITLE GOES HERE. Understanding Architectural Trade-offs in Object Storage Technologies

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)

PROFESSIONAL. NoSQL. Shashank Tiwari WILEY. John Wiley & Sons, Inc.

Comparing SQL and NOSQL databases

Cassandra Design Patterns

Spotify. Scaling storage to million of users world wide. Jimmy Mårdell October 14, 2014

Big Data Processing Technologies. Chentao Wu Associate Professor Dept. of Computer Science and Engineering

CS5412: DIVING IN: INSIDE THE DATA CENTER

Database Evolution. DB NoSQL Linked Open Data. L. Vigliano

MapReduce and Friends

Rule 14 Use Databases Appropriately

CIS 601 Graduate Seminar. Dr. Sunnie S. Chung Dhruv Patel ( ) Kalpesh Sharma ( )

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University

GridGain and Apache Ignite In-Memory Performance with Durability of Disk

Extreme Computing. NoSQL.

Non-Relational Databases. Pelle Jakovits

Building Consistent Transactions with Inconsistent Replication

Cassandra Database Security

NoSQL Databases Analysis

Advances in Data Management - NoSQL, NewSQL and Big Data A.Poulovassilis

Intro Cassandra. Adelaide Big Data Meetup.

5/2/16. Announcements. NoSQL Motivation. The New Hipster: NoSQL. Serverless. What is the Problem? Database Systems CSE 414

relational Key-value Graph Object Document

Why NoSQL? Why Riak?

Database Systems CSE 414

Modern Database Concepts

NOSQL DATABASES OCTOBER 20, A comparison between the MongoDB, Cassandra, and Redis databases ANDREW HYTE

10. Replication. Motivation

Outline. Introduction Background Use Cases Data Model & Query Language Architecture Conclusion

CS 655 Advanced Topics in Distributed Systems

ICALEPS 2013 Exploring No-SQL Alternatives for ALMA Monitoring System ADC

10 Million Smart Meter Data with Apache HBase

Study of NoSQL Database Along With Security Comparison

10/18/2017. Announcements. NoSQL Motivation. NoSQL. Serverless Architecture. What is the Problem? Database Systems CSE 414

Scaling Up HBase. Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech. CSE6242 / CX4242: Data & Visual Analytics

CSE 444: Database Internals. Lectures 26 NoSQL: Extensible Record Stores

Databases : Lecture 1 2: Beyond ACID/Relational databases Timothy G. Griffin Lent Term Apologies to Martin Fowler ( NoSQL Distilled )

Final Exam Logistics. CS 133: Databases. Goals for Today. Some References Used. Final exam take-home. Same resources as midterm

Distributed File Systems II

Final Exam Review 2. Kathleen Durant CS 3200 Northeastern University Lecture 23

DATABASE DESIGN II - 1DL400

References. What is Bigtable? Bigtable Data Model. Outline. Key Features. CSE 444: Database Internals

Relational databases

COSC 304 Introduction to Database Systems. NoSQL Databases. Dr. Ramon Lawrence University of British Columbia Okanagan

Data Informatics. Seon Ho Kim, Ph.D.

Time Series Storage with Apache Kudu (incubating)

Copyright 2013, Oracle and/or its affiliates. All rights reserved.

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Transcription:

Tanton Jeppson CS 401R Lab 3 Cassandra, MongoDB, and HBase Introduction For my report I have chosen to take a deeper look at 3 NoSQL database systems: Cassandra, MongoDB, and HBase. I have chosen these three due to their recent popularity and growth mostly. As I present data on the history, data model, physical storage, transactions, and scalability of these three NoSQL systems I will also be better prepared in the future to choose which one would be best for specific situations. Cassandra History: Cassandra is a database management system designed for Facebook by Avinash Lakshman and Prashant Malik. The original purpose and goal for the project was to create a system that could be spread across many computers/nodes, yet if any part failed it didn't mean failure for the whole system (no single point of failure). Data Model: The data model used in Cassandra could be considered a mix between Google's BigTable and Amazon's Dynamo. Like Google s BigTable, Cassandra s data model has a key- value where columns are added to keys. Similar to Amazon's Dynamo, the database uses nodes organized into clusters. Each node in the cluster

has the same job, which is the reason why there is no single point of failure. Cassandra has its own language, CQL (Cassandra Query Language), which is used to perform operations. These operations include insert, copy, create, and are very similarly to SQL language usage. Physical Storage: The storage for Cassandra is based off of the table scheme mentioned earlier. The use of multi- dimensional maps and keys is implemented through partitioning and hashing. Each node in the cluster is then responsible for a certain range of values based upon this hash system. Consistent hashing makes possible the division of work across nodes even with many adds or removal of nodes. Transactions: Rather than using fully "ACID" transactions, Cassandra uses an atomic, isolated, and durable transaction system (no strong consistency). It has eventual consistency that can be tuned/adjusted by the user. Since the transactions are atomic, all transactions are either completed in their entirety, or rolled back. Transactions also do not interfere with each other and, since they are durable, will persist even in the case of system crashes or failures. There are also different levels of transactions (such as lightweight transactions) that can be used for different situations or needs. The need for different levels really depends on the situation. For example, in a situation where a little more is needed and the consistency of durable transactions isn't enough, a lightweight transaction (sometimes known as compare and set) uses a consistency that is linearalizable, and therefore might meet the

situations needs. That said, for the majority of situations, the normal durable transactions would typically suffice. Scalability: When considering scalability, it's important to recognize that there are various definitions of the terms. For Cassandra and throughout the rest of this report, the term will be used based off of the definition given by Datastax: "we ll define scalability as the ability to add computational resources to a database in order to gain more throughput." Using this definition, we will also talk about two types of scalability: vertical and horizontal. Vertical scalability is moving data from one machine to another machine that has more power/capacity. This can be very expensive. Horizontal scalability refers to the addition of hardware to improve performance. Cassandra fits into a horizontal scalability very well due to the use of nodes. As more hardware is added, the addition of nodes can detect this as well and takes advantage of the increase of resources. MongoDB History: The beginnings of MongoDB database management system can be traced to as early as 2007. The company MongoDB, inc., began the development of the database system to be used on a product that it was going to be used for originally. By the year 2009 the development had been released to the open source community where it has quickly become a leading choice in the world of databases.

Data Model: MongoDB uses a layout similar to JSON where a key maps to a value. Each element is called a document and a group of elements is called a collection. It uses a dynamic layout where each document does not need to have the same keys as another document in the same collection. MongoDB also uses similar keywords such as insert, delete, and update. Due to the map- layout of the documents, searching and retrieving are fast operations. Physical Storage: MongoDB's storage is implemented in virtual memory. It uses memory- mapped files so that the virtual memory can be handled by the operating system. This leads to variety of performance across operating systems. If something the database is trying to retrieve is not found on RAM then the operating system will swap it out so that is. The way the OS handles this is where the variety can emerge. Another issue that can arise with Mongo's storage is fragmentations. When documents are removed or moved they leave holes behind. These holes are later filled with other documents, but not in a perfect way, leaving some gaps behind still. Over time this can lead to severe fragmentation. Transactions: The transactions in MongoDB are semi- atomic. This means that some operations, such as the write operation, are atomic on a single document action. When this same operation is performed on multiple documents though, it is atomic in each write in and of itself, but as a whole the operation is not atomic, allowing other operations to interleave. This model holds true with other actions as well. At a document level it is ACID compliant but anything above that isn't guaranteed.

Scalability: The biggest advantage that MongoDB has when it comes to scalability is the use of virtual memory in its storage implementation. This allows MongoDB to excel over other NoSQL databases especially in cases where the memory needed exceeds the RAM available. This helps especially in the case of vertical scalability talked about earlier. To handle horizontal scaling MongoDB uses a technique called sharding. Sharding means data records are stored across various machines. This helps ease the load on each machine so that operations that would normally require too much memory can be shared across machines. HBase History: As with many other (if not all) database systems, HBase was designed when there appeared to be nothing else that fit the needs of a project. It was designed by the company Powerset when they had the need to process large amounts of natural language data and be able to search within that data. It has since evolved and grown into a top- level, open source Apache project. Data Model: HBase's data model is also very alike to Google's BigTable design. It is also implemented with columns and rows that are based off of keys that may or may not be unique to the data. This allows for more specific- case lookups as well as more flexible for adding more data in later on in projects without the previous data being a hindrance.

Physical Storage: HBase runs on top of Hadoop Distributed Filesystem, which allows for a lot of interesting advantages when it comes to its data since it also incorporates Hadoop features. One of these advantages is that it can manage having small bits of useful data amongst a sea of less than useful information, and is fault- tolerant in handling it at the same time. Another advantage this brings is that HBase is very MapReduce compatible, including that it can serve as input and output to the algorithm. It also implements the keys using hashing (even better, anything that can be stored in byte arrays can be used as a key). The hierarchy for preference is row key under table row key. Transactions: The transactions within HBase are atomic across a row. That is, if they only mutate one row, then they are atomic, even if they cross over row "families." HBase has partial consistency with "read committed" isolation as well. All operations that return as successful are durable, but those that fail are not necessarily durable. Another interesting part of HBase is that the durability may be tuned by a user to flush data to disk. Scalability: For horizontal scalability HBase uses what are called regions, which are a subset from the table's data stored together as a sorted range of rows. As these regions grow in size they are split into smaller sections to accomodate for the growth and size. HBase also has region servers that act as the responsible unit for a group of regions. Each region has only one region server though.

Differences and Conclusion: All databases are going to have their strengths and weaknesses. The main differences between different options is usually what is given up in exchange for the benefits and how these line up with the specific needs of whatever needs the database will be serving. In these three examples Cassandra and HBase are rather similar. One difference between them though is that Cassandra is a write- oriented system and HBase is designed for more intensive read workloads. This is in important fact to take into account when designing a project. MongoDB, on the other hand, has a document based design rather than the table and row design found in Cassandra and HBase. Depending on what data is going to be stored this may be a more efficient way to manage the situation. The key in all of these situations though is making sure that you are well informed and choose the correct database for the needs of the project.

References: https://en.wikipedia.org/wiki/apache_cassandra http://docs.datastax.com/en/cassandra/2.2/cassandra/cassandraabout.html http://docs.datastax.com/en/cql/3.3/cql/cql_using/useaboutcql.html http://www.datastax.com/dev/blog/schema- in- cassandra- 1-1 http://www.datastax.com/dev/blog/multi- datacenter- replication http://vldb.org/pvldb/vol5/p1724_tilmannrabl_vldb2012.pdf http://docs.datastax.com/en/cassandra/2.0/cassandra/dml/dml_ltwt_transaction_ c.html http://www.datastax.com/dev/blog/why- does- scalability- matter- and- how- does- cassandra- scale https://en.wikipedia.org/wiki/apache_hbase http://hbase.apache.org/0.94/book/datamodel.html http://hbase.apache.org/acid- semantics.html http://blog.cloudera.com/blog/2013/04/how- scaling- really- works- in- apache- hbase/ https://en.wikipedia.org/wiki/mongodb https://www.mongodb.org/about/introduction/ http://learnmongodbthehardway.com/schema/chapter3/ https://docs.mongodb.org/manual/core/write- operations- atomicity/ https://docs.mongodb.org/manual/sharding/