Cassandra- A Distributed Database

Similar documents
A Review to the Approach for Transformation of Data from MySQL to NoSQL

L22: NoSQL. CS3200 Database design (sp18 s2) 4/5/2018 Several slides courtesy of Benny Kimelfeld

A Cloud Storage Adaptable to Read-Intensive and Write-Intensive Workload

CSE-E5430 Scalable Cloud Computing Lecture 9

NOSQL DATABASE PERFORMANCE BENCHMARKING - A CASE STUDY

Bigtable. Presenter: Yijun Hou, Yixiao Peng

The material in this lecture is taken from Dynamo: Amazon s Highly Available Key-value Store, by G. DeCandia, D. Hastorun, M. Jampani, G.

DEMYSTIFYING BIG DATA WITH RIAK USE CASES. Martin Schneider Basho Technologies!

Column-Family Databases Cassandra and HBase

Dynamo: Amazon s Highly Available Key-value Store

CAP Theorem, BASE & DynamoDB

Cassandra Design Patterns

PROFESSIONAL. NoSQL. Shashank Tiwari WILEY. John Wiley & Sons, Inc.

ADVANCED DATABASES CIS 6930 Dr. Markus Schneider

Apache Cassandra-The Data Storage Framework for Hadoop

Enhancing the Query Performance of NoSQL Datastores using Caching Framework

NOSQL EGCO321 DATABASE SYSTEMS KANAT POOLSAWASD DEPARTMENT OF COMPUTER ENGINEERING MAHIDOL UNIVERSITY

Cassandra, MongoDB, and HBase. Cassandra, MongoDB, and HBase. I have chosen these three due to their recent

Introduction Data Model API Building Blocks SSTable Implementation Tablet Location Tablet Assingment Tablet Serving Compactions Refinements

Dynamo: Amazon s Highly Available Key-value Store

Goal of the presentation is to give an introduction of NoSQL databases, why they are there.

Performance Comparison of NOSQL Database Cassandra and SQL Server for Large Databases

CSE 444: Database Internals. Lectures 26 NoSQL: Extensible Record Stores

Scalable Storage: The drive for web-scale data management

CIB Session 12th NoSQL Databases Structures

ΕΠΛ 602:Foundations of Internet Technologies. Cloud Computing

Tools for Social Networking Infrastructures

CS 655 Advanced Topics in Distributed Systems

Outline. Introduction Background Use Cases Data Model & Query Language Architecture Conclusion

Apache Cassandra - A Decentralized Structured Storage System

Improving Logical Clocks in Riak with Dotted Version Vectors: A Case Study

Introduction to Big Data. NoSQL Databases. Instituto Politécnico de Tomar. Ricardo Campos

CISC 7610 Lecture 2b The beginnings of NoSQL

Extreme Computing. NoSQL.

Big Data Development CASSANDRA NoSQL Training - Workshop. November 20 to (5 days) 9 am to 5 pm HOTEL DUBAI GRAND DUBAI

Enabling Distributed Key-Value Stores with Low Latency-Impact Snapshot Support

NOSQL Databases: The Need of Enterprises

Evaluating Auto Scalable Application on Cloud

Exploring Cassandra and HBase with BigTable Model

Bigtable: A Distributed Storage System for Structured Data By Fay Chang, et al. OSDI Presented by Xiang Gao

Intro Cassandra. Adelaide Big Data Meetup.

Cassandra Database Security

Comparing SQL and NOSQL databases

Chapter 24 NOSQL Databases and Big Data Storage Systems

Getting to know. by Michelle Darling August 2013

Cassandra 1.0 and Beyond

A NOVEL APPROACH FOR HOTEL MANAGEMENT SYSTEM USING CASSANDRA

W b b 2.0. = = Data Ex E pl p o l s o io i n

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

An Efficient Distributed B-tree Index Method in Cloud Computing

Evaluating Cassandra as a Manager of Large File Sets

Cassandra - A Decentralized Structured Storage System. Avinash Lakshman and Prashant Malik Facebook

FAQs Snapshots and locks Vector Clock

References. What is Bigtable? Bigtable Data Model. Outline. Key Features. CSE 444: Database Internals

App Engine: Datastore Introduction

Distributed Systems. 29. Distributed Caching Paul Krzyzanowski. Rutgers University. Fall 2014

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

Overview. * Some History. * What is NoSQL? * Why NoSQL? * RDBMS vs NoSQL. * NoSQL Taxonomy. *TowardsNewSQL

ExaminingCassandra Constraints: Pragmatic. Eyes

. International Journal of Advance Research in Engineering, Science & Technology. Identifying Vulnerabilities in Apache Cassandra

Migrating Oracle Databases To Cassandra

Introduction to store data in Redis, a persistent and fast key-value database

NoSQL Databases MongoDB vs Cassandra. Kenny Huynh, Andre Chik, Kevin Vu

8/24/2017 Week 1-B Instructor: Sangmi Lee Pallickara

Reprise: Stability under churn (Tapestry) A Simple lookup Test. Churn (Optional Bamboo paper last time)

CS-580K/480K Advanced Topics in Cloud Computing. NoSQL Database

NoSQL systems. Lecture 21 (optional) Instructor: Sudeepa Roy. CompSci 516 Data Intensive Computing Systems

Rule 14 Use Databases Appropriately

A Proxy-based Query Aggregation Method for Distributed Key-Value Stores

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)

CS November 2018

10. Replication. Motivation

Cassandra. JOEL BOHMAN and JOHAN HILDING

Big Data Hadoop Course Content

CS November 2017

CSE-E5430 Scalable Cloud Computing Lecture 10

BigTable. CSE-291 (Cloud Computing) Fall 2016

SQL Query Optimization on Cross Nodes for Distributed System

Distributed Data Store

Certified Apache Cassandra Professional VS-1046

Ashok Kumar P S, Md Ateeq Ur Rahman Department of CSE, JNTU/ SCET, Hyderabad, Andra Pradesh, India

The MapReduce Framework

Non-Relational Databases. Pelle Jakovits

Scaling Up HBase. Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech. CSE6242 / CX4242: Data & Visual Analytics

A Survey Paper on NoSQL Databases: Key-Value Data Stores and Document Stores

Column Stores and HBase. Rui LIU, Maksim Hrytsenia

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2015 Lecture 14 NoSQL

Axway API Management 7.5.x Cassandra Best practices. #axway

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

Database Evolution. DB NoSQL Linked Open Data. L. Vigliano

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

NoSQL and Database as a Service

Leveraging High-Performance In-Memory Key-Value Data Stores to Accelerate Data Intensive Tasks

Bigtable: A Distributed Storage System for Structured Data by Google SUNNIE CHUNG CIS 612

Analysis of HBase Read/Write

MONGODB INTERVIEW QUESTIONS

Introduction to NoSQL

Migrating to Cassandra in the Cloud, the Netflix Way

Defining Weakly Consistent Byzantine Fault-Tolerant Services

Advanced Data Management Technologies

Transcription:

Cassandra- A Distributed Database Tulika Gupta Department of Information Technology Poornima Institute of Engineering and Technology Jaipur, Rajasthan, India Abstract- A relational database is a traditional database which works on table based architecture. Due to relational system, there is no facility of scalability and also no concept of data replication is available. These problems in a relational database lead to the development of non relational database also called as NOSQL. This paper analyzes the need of NOSQL database and presents an architectural overview of Cassandra. It explains the merits of Cassandra i.e. how it is able to handle huge amount of data, which is always a big issue with relational database like MYSQL and ORACLE etc. Keywords- Cassandra, No SQL, RDBMS, Scalability I. Introduction Today we are living in the world of data. Data is present in each and every organization and it is very crucial to manage and preserve this data. Loss of single data entity can result in massive loss. If we talk of the social world, we can find out the ultimate data.the various social media sites like Facebook, twitter, etc have millions of crucial data to be stored and preserved and loss of even.05% data is not compensated. Also the data available at these data centres is in such a bulk that the access mechanism needs to be very efficient to handle this bulky and important data. We have our very common relational database which is popularly used all over the world to manage various kinds of data and information. Relational database management systems work on relational models which were introduced by Mr E.F CODD. It always works on tabular format of data where relationships are enforced within the tables. The major problem with this database management system is that it cannot handle scalability i.e. it cannot handle continuous growth of data at a single point of time and requires a lot of processing to scale bulk of data so as to retrieve data quickly and efficiently. Also many applications do not need data to be informed of related table that use SQL to perform any sort of applications on data. They recommend data to be stored in form of objects and documents that can be retrieved with the help of key. We can never think of RDBMS like DB2, ORACLE and MYSQL to handle data of social sites. So another type of database called NOSQL is present to solve the problems such as scalability, schema free and replication support etc. Cassandra is one of database model of NOSQL that supports all such features and is highly reliable. Apache Cassandra is an open source distributed database system that is designed for storing and managing large amount of data across multiple servers. It can serve as a real time operational data for store online transactional application as well as a read intensive database for a large scale business intelligence system. Apache Cassandra was developed at Facebook in the year 2008 based on Amazon s dynamo and Google s big table concept. It is written in java and support programming languages such as C#, C++, DOJURE, ERLANG, HASKELL, JAVA, JAVASCRIPT, PERL, PHP, PYTHON, RUBY and SCALA. Cassandra is one of the right choices when you need scalability and high availability without any compromise in performance. One of its best features is automatic replication facility. So there is no single point of failure. The data gets replicated across multiple data centres. This is a highly scalable database and a decentralized data sore which can be easily scaled up by adding hardware nodes to system to offer high fault tolerance. ISSN: 2231-2803 http://www.ijcttjournal.org Page 216

It is designed on the principle of peer to peer symmetric nodes to avoid single failure point unlike RDBMS which works on master and slave method. The master less ring design is elegant, easy to setup and thus is easy to maintain. Cassandra works on CQL(Cassandra query language) which is very similar to SQL with respect to syntax.cql is supported in the CQL command line client installed with Cassandra. Cassandra supported CQL (Cassandra Query Language) which is an efficient query language. B. Architecture Overview Cassandra is designed in such a way that it is able to offer continuous uptime. Figure1 shows the ring architecture of Cassandra. II. Apache Cassandra Cassandra is a new distributed data system which handles the complete data of Facebook. It is an open source database management system that provides continuous data availability, scalability, performance and distribution of data easily across multiple servers. Cassandra works on the ideology of peer to peer architecture. The data is automatically partitioned across multiple nodes with the use of consistent hashing. This preserves the database from the single point of failure. The goal of scalability is easily achievable by just adding more nodes o the cluster. Cassandra supports rich data structure and efficient query language. Fig.1.Cassandra s master less ring architecture Unlike RDBMS which supports master slave architecture, Cassandra has a ring architecture which is master less. It is easy to setup and maintain. Here all nodes play the same role and nodes communicate each other through a gossip protocol. C. Writing and Reading Data A. Key features and benefits Many numbers of key features and benefits are provided by Cassandra for using it as an efficient database for applications: It has a scalable architecture and supports a master less design i.e. all nodes have same position and there is no concept of master and slave. All nodes work actively so data can be written to and read from all nodes. Nodes can be easily added without any down time. The database is continuously available and there is automatic redundancy facility. All the modern data types are supported with fast reads and writes facility. Data can be compressed up to 80% without any overheads. Cassandra shows efficient performance in reading as well writing data. Data is written in a manner such that high durability and high performance can be gained. Firstly, data written on node is recorded on a disk called as commit log. It is then transferred to memory structure. It is called a memtable. When size of memtable is full, the data is written on an immutable file called as sstable. As the operation is sequential, megabytes of Input Output operations can be performed at a same time with no delay. Figure2 shows the right path of Cassandra ISSN: 2231-2803 http://www.ijcttjournal.org Page 217

Write data D. Data Distribution and Replication Relational databases required to replicate data manually across several machines. Cassandra distributes data automatically. Commit log Index SSTable It has an internal component known as partitoner. Partitoner decides the method of how to distribute data across nodes. It uses hashing table to replicate the data. Cassandra automatically maintains balance of data along cluster so that whenever a new node is added or existing is removed, the balance is maintained. Fig.2. Cassandra s Write Path For reading the data, many processes are initiated by Cassandra to increase the read response time for a read request, a memory data structure called bloom filter is used to check probability of SSTable. Bloom filter informs that data is available or not. If data is present, Cassandra moves to memory cache layer and retrieves the compressed data. If not found, then the next SSTable is searched for the data and the same process repeats again. III. Data Management Components This section explains Cassandra s data model and its attributes. A. Cassandra Data Model Cassandra works in the format of key value store. Data can be queried only by the help of keys and there is no support for joining queries. Column: It is the lowest unit of data. It contains a name, value and a timestamp. Column Name: byte[] Value: byte[] Clock: clock[] Read Request Fig.4. Column Structure Bloom Filter Compression Offset struct Column { 1: binary name, 2: binary value, 3: i64 timestamp, } Cache DATA Memory Here, name represents the name of column and the value represents its attributes and the timestamp shows the information about time. Row: It is a uniquely identifiable data which maintains integrity in the data. Each row possesses Return Result Fig.3. Cassandra s Read Path Column Family: It consists of keyed rows to together with the columns i.e. it acts as a container for all the columns present in a table. ISSN: 2231-2803 http://www.ijcttjournal.org Page 218

Row key 1 Column 1 Column 2 Column 3 Value 1 Value 2 Value 3 Row key 2 Column 1 Column 4 Value 1 Value 4 Fig.5. Column Family Key space: It is the container of column families. All the column families are configured within key space. Key space Column Family Column Family. B. Distributed Hash Tables For distributed, decentralized database, hash tables are used. Each key value pair is stored in the database and any node can retrieve the data with the help of keys. Hash tables can scale any number of available nodes and can handle the continuous arrival of new nodes as well as departure or failure of any nodes. IV. Cassandra Query Language The communication in Cassandra database is entertained with the help of the primary language called as Cassandra Query Language. With the help of CQL shell i.e. cqlsh, interaction with Cassandra can be made. With the help of cqlsh, the tables and the keyspaces can be created. It helps in creating the schema, insertion of data and execution of the query. It works same as SQL and has almost all the commands same as SQL query language. V. Comparison between Cassandra and Relational Databases Fig.6. Key space RDBS can handle moderate incoming data velocity whereas Cassandra can handle high incoming data velocity. ISSN: 2231-2803 http://www.ijcttjournal.org Page 219

In RDBMS, data arrives from one or few locations, and in Cassandra data can arrive from numerous locations. RDBMS can manage only the primary structured data and in comparison, Cassandra can manage any and all type of data. Due to master slave architecture, RDBMS has a single point of failure but unlike this Cassandra works on ring architecture and so there is no single failure point and database is constantly up. In RDBMS, the deployments are centralized whereas in Cassandra, the deployment is decentralized. Cassandra can support high data volumes but RDBMS can support only moderate data volumes. Nested and complex transactions are supported in RDBMS but Cassandra can work with only simple queries. In RDBMS, data is written at only one location, but unlike this in Cassandra the data can be written at multiple locations. The scalability is provided at only read level in RDBMS whereas in Cassandra both read and write scalability are provided. With so many merits over RDBMS, Cassandra can be considered to be more beneficial in many respects. VI. Conclusion This paper has provided an overview of Cassandra and the principles on which it works. Cassandra s complete architecture is being discussed i.e. how read and write operations are performed. Also, the contrast has been provided throughout the research paper between Cassandra and relational databases. From the entire discussion it can be easily concluded that Cassandra provides efficient scalability and high update with low latency and is widely used. VII. References [1] Misc. Authors Apache Cassandra 0.6.3Java Source Code Available from http://cassandra.apache.org [2] F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber, Bigtable: A Distributed Storage System for Structured Data OSDI 06: Seventh Symposium on Operating System Design and Implementation, 2006, Seattle, WA, 2006. [3] G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels, Dynamo: Amazons Highly Available Keyvalue Store In Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles (2007), ACM Press New York, NY, USA, pp. 205220 [4] A. Lakshman, P. Malik, Cassandra - A Decentralized Structured Storage System, Cornell, 2009 [5] M. Slee, A. Agarwal, M. Kwiatkowski, Thrift: Scalable Cross-Language Services Implementation Facebook, Palo Alto, CA, 2007 [6] R. Tavory, Hector a Java Cassandra ISSN: 2231-2803 http://www.ijcttjournal.org Page 220