Cassandra- A Distributed Database

Cassandra- A Distributed Database Tulika Gupta Department of Information Technology Poornima Institute of Engineering and Technology Jaipur, Rajasthan, India Abstract- A relational database is a traditional database which works on table based architecture. Due to relational system, there is no facility of scalability and also no concept of data replication is available. These problems in a relational database lead to the development of non relational database also called as NOSQL. This paper analyzes the need of NOSQL database and presents an architectural overview of Cassandra. It explains the merits of Cassandra i.e. how it is able to handle huge amount of data, which is always a big issue with relational database like MYSQL and ORACLE etc. Keywords- Cassandra, No SQL, RDBMS, Scalability I. Introduction Today we are living in the world of data. Data is present in each and every organization and it is very crucial to manage and preserve this data. Loss of single data entity can result in massive loss. If we talk of the social world, we can find out the ultimate data.the various social media sites like Facebook, twitter, etc have millions of crucial data to be stored and preserved and loss of even.05% data is not compensated. Also the data available at these data centres is in such a bulk that the access mechanism needs to be very efficient to handle this bulky and important data. We have our very common relational database which is popularly used all over the world to manage various kinds of data and information. Relational database management systems work on relational models which were introduced by Mr E.F CODD. It always works on tabular format of data where relationships are enforced within the tables. The major problem with this database management system is that it cannot handle scalability i.e. it cannot handle continuous growth of data at a single point of time and requires a lot of processing to scale bulk of data so as to retrieve data quickly and efficiently. Also many applications do not need data to be informed of related table that use SQL to perform any sort of applications on data. They recommend data to be stored in form of objects and documents that can be retrieved with the help of key. We can never think of RDBMS like DB2, ORACLE and MYSQL to handle data of social sites. So another type of database called NOSQL is present to solve the problems such as scalability, schema free and replication support etc. Cassandra is one of database model of NOSQL that supports all such features and is highly reliable. Apache Cassandra is an open source distributed database system that is designed for storing and managing large amount of data across multiple servers. It can serve as a real time operational data for store online transactional application as well as a read intensive database for a large scale business intelligence system. Apache Cassandra was developed at Facebook in the year 2008 based on Amazon s dynamo and Google s big table concept. It is written in java and support programming languages such as C#, C++, DOJURE, ERLANG, HASKELL, JAVA, JAVASCRIPT, PERL, PHP, PYTHON, RUBY and SCALA. Cassandra is one of the right choices when you need scalability and high availability without any compromise in performance. One of its best features is automatic replication facility. So there is no single point of failure. The data gets replicated across multiple data centres. This is a highly scalable database and a decentralized data sore which can be easily scaled up by adding hardware nodes to system to offer high fault tolerance. ISSN: 2231-2803 http://www.ijcttjournal.org Page 216

It is designed on the principle of peer to peer symmetric nodes to avoid single failure point unlike RDBMS which works on master and slave method. The master less ring design is elegant, easy to setup and thus is easy to maintain. Cassandra works on CQL(Cassandra query language) which is very similar to SQL with respect to syntax.cql is supported in the CQL command line client installed with Cassandra. Cassandra supported CQL (Cassandra Query Language) which is an efficient query language. B. Architecture Overview Cassandra is designed in such a way that it is able to offer continuous uptime. Figure1 shows the ring architecture of Cassandra. II. Apache Cassandra Cassandra is a new distributed data system which handles the complete data of Facebook. It is an open source database management system that provides continuous data availability, scalability, performance and distribution of data easily across multiple servers. Cassandra works on the ideology of peer to peer architecture. The data is automatically partitioned across multiple nodes with the use of consistent hashing. This preserves the database from the single point of failure. The goal of scalability is easily achievable by just adding more nodes o the cluster. Cassandra supports rich data structure and efficient query language. Fig.1.Cassandra s master less ring architecture Unlike RDBMS which supports master slave architecture, Cassandra has a ring architecture which is master less. It is easy to setup and maintain. Here all nodes play the same role and nodes communicate each other through a gossip protocol. C. Writing and Reading Data A. Key features and benefits Many numbers of key features and benefits are provided by Cassandra for using it as an efficient database for applications: It has a scalable architecture and supports a master less design i.e. all nodes have same position and there is no concept of master and slave. All nodes work actively so data can be written to and read from all nodes. Nodes can be easily added without any down time. The database is continuously available and there is automatic redundancy facility. All the modern data types are supported with fast reads and writes facility. Data can be compressed up to 80% without any overheads. Cassandra shows efficient performance in reading as well writing data. Data is written in a manner such that high durability and high performance can be gained. Firstly, data written on node is recorded on a disk called as commit log. It is then transferred to memory structure. It is called a memtable. When size of memtable is full, the data is written on an immutable file called as sstable. As the operation is sequential, megabytes of Input Output operations can be performed at a same time with no delay. Figure2 shows the right path of Cassandra ISSN: 2231-2803 http://www.ijcttjournal.org Page 217

Write data D. Data Distribution and Replication Relational databases required to replicate data manually across several machines. Cassandra distributes data automatically. Commit log Index SSTable It has an internal component known as partitoner. Partitoner decides the method of how to distribute data across nodes. It uses hashing table to replicate the data. Cassandra automatically maintains balance of data along cluster so that whenever a new node is added or existing is removed, the balance is maintained. Fig.2. Cassandra s Write Path For reading the data, many processes are initiated by Cassandra to increase the read response time for a read request, a memory data structure called bloom filter is used to check probability of SSTable. Bloom filter informs that data is available or not. If data is present, Cassandra moves to memory cache layer and retrieves the compressed data. If not found, then the next SSTable is searched for the data and the same process repeats again. III. Data Management Components This section explains Cassandra s data model and its attributes. A. Cassandra Data Model Cassandra works in the format of key value store. Data can be queried only by the help of keys and there is no support for joining queries. Column: It is the lowest unit of data. It contains a name, value and a timestamp. Column Name: byte[] Value: byte[] Clock: clock[] Read Request Fig.4. Column Structure Bloom Filter Compression Offset struct Column { 1: binary name, 2: binary value, 3: i64 timestamp, } Cache DATA Memory Here, name represents the name of column and the value represents its attributes and the timestamp shows the information about time. Row: It is a uniquely identifiable data which maintains integrity in the data. Each row possesses Return Result Fig.3. Cassandra s Read Path Column Family: It consists of keyed rows to together with the columns i.e. it acts as a container for all the columns present in a table. ISSN: 2231-2803 http://www.ijcttjournal.org Page 218

Row key 1 Column 1 Column 2 Column 3 Value 1 Value 2 Value 3 Row key 2 Column 1 Column 4 Value 1 Value 4 Fig.5. Column Family Key space: It is the container of column families. All the column families are configured within key space. Key space Column Family Column Family. B. Distributed Hash Tables For distributed, decentralized database, hash tables are used. Each key value pair is stored in the database and any node can retrieve the data with the help of keys. Hash tables can scale any number of available nodes and can handle the continuous arrival of new nodes as well as departure or failure of any nodes. IV. Cassandra Query Language The communication in Cassandra database is entertained with the help of the primary language called as Cassandra Query Language. With the help of CQL shell i.e. cqlsh, interaction with Cassandra can be made. With the help of cqlsh, the tables and the keyspaces can be created. It helps in creating the schema, insertion of data and execution of the query. It works same as SQL and has almost all the commands same as SQL query language. V. Comparison between Cassandra and Relational Databases Fig.6. Key space RDBS can handle moderate incoming data velocity whereas Cassandra can handle high incoming data velocity. ISSN: 2231-2803 http://www.ijcttjournal.org Page 219

In RDBMS, data arrives from one or few locations, and in Cassandra data can arrive from numerous locations. RDBMS can manage only the primary structured data and in comparison, Cassandra can manage any and all type of data. Due to master slave architecture, RDBMS has a single point of failure but unlike this Cassandra works on ring architecture and so there is no single failure point and database is constantly up. In RDBMS, the deployments are centralized whereas in Cassandra, the deployment is decentralized. Cassandra can support high data volumes but RDBMS can support only moderate data volumes. Nested and complex transactions are supported in RDBMS but Cassandra can work with only simple queries. In RDBMS, data is written at only one location, but unlike this in Cassandra the data can be written at multiple locations. The scalability is provided at only read level in RDBMS whereas in Cassandra both read and write scalability are provided. With so many merits over RDBMS, Cassandra can be considered to be more beneficial in many respects. VI. Conclusion This paper has provided an overview of Cassandra and the principles on which it works. Cassandra s complete architecture is being discussed i.e. how read and write operations are performed. Also, the contrast has been provided throughout the research paper between Cassandra and relational databases. From the entire discussion it can be easily concluded that Cassandra provides efficient scalability and high update with low latency and is widely used. VII. References [1] Misc. Authors Apache Cassandra 0.6.3Java Source Code Available from http://cassandra.apache.org [2] F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber, Bigtable: A Distributed Storage System for Structured Data OSDI 06: Seventh Symposium on Operating System Design and Implementation, 2006, Seattle, WA, 2006. [3] G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels, Dynamo: Amazons Highly Available Keyvalue Store In Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles (2007), ACM Press New York, NY, USA, pp. 205220 [4] A. Lakshman, P. Malik, Cassandra - A Decentralized Structured Storage System, Cornell, 2009 [5] M. Slee, A. Agarwal, M. Kwiatkowski, Thrift: Scalable Cross-Language Services Implementation Facebook, Palo Alto, CA, 2007 [6] R. Tavory, Hector a Java Cassandra ISSN: 2231-2803 http://www.ijcttjournal.org Page 220