Autonomic Data Replication in Cloud Environment

Size: px

Start display at page:

Download "Autonomic Data Replication in Cloud Environment"

Bruno Ford
5 years ago
Views:

1 International Journal of Electronics and Computer Science Engineering 459 Available Online at ISSN Autonomic Data Replication in Cloud Environment Dhananjaya Gupt, Mrs.Anju Bala Computer Science and Engineering, Thapar University, Patiala, India Abstract-- Cloud computing is an emerging practice that offers more flexibility in infrastructure and reduces cost than our traditional computing models. Cloud providers offer everything from access to raw storage capacity resources to complete application services. The services that are provided by the cloud can be accessed from anywhere and data flows from one place to another. Since data is moving via network, there are chances of data loss. So we need to keep multiple copies of data and thus data replication is one of the main issues in cloud computing. In this paper we have implemented automatic replication of data from local host to cloud environment. Data replication is implemented by using HADOOP which stores the data at various nodes. If one node goes down then data can be retrieved from other node seamlessly. Keywords - Cloud Computing, Fault tolerance, Data Replication. I.INTRODUCTION Cloud computing is an emerging practice that offers more flexibility in infrastructure and reduces cost than our traditional computing models. Cloud computing software frameworks manage cloud resources and provide scalable and fault tolerant computing utilities with globally uniform and hardware-transparent user interfaces [1]. The cloud provider takes the responsibility of managing the infrastructural issues. These days, Cloud providers offer everything from access to raw storage capacity resources to complete application services in many areas such as payroll and customer relationship management etc. Data flows through the network from one location to another while using the services provided by the cloud. Thus it becomes critical task to secure and maintain copies of data as it flows through the network. There are fault tolerance techniques available that replicates data at different location to tolerate data losses and ensures continued service. Replication is a key mechanism to achieve scalability, availability and fault-tolerance. It can be used to create and maintain copies of data at different sites [13]. When events affecting a primary location where the data resides occur, data can be recovered from the secondary location to provide continued service, fault tolerance, higher availability. Though it s a performance overhead as it takes time to recover data from other sites and restart the service again but fault can be tolerated and availability can be increased. The aim of our research wok is to implement the data replication from local machine to cloud environment. In this paper we implement data replication from local machine to cloud environment. Hadoop has been used to replicate data on different site. Hadoop which is an Apache project; all components are available via the Apache open source license. Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets [2].

2 IJECSE,Volume2, Number 2 Dhananjaya Gupt and Mrs. Anju Bala 460 Figure-1: HDFS Architecture [4]. An HDFS cluster consists of a single NameNode, a master server that manages the file system namespace and regulates access to files by clients. In addition, there are a number of DataNodes, usually one per node in the cluster, which manage storage attached to the nodes that they run on. HDFS exposes a file system namespace and allows user data to be stored in files. Internally, a file is split into one or more blocks and these blocks are stored in a set of DataNodes [4]. The rest of the paper is organized as follows: section II describes some work related to our research and challenges in replicated environment. Section III shows the Hybrid Virtualized Architecture. Section IV includes the implementation of this Architecture in which data is replicated on two different sites using Hadoop s HDFS and the experimental results. Section V concludes the paper. II.RELATED WORK As mentioned in the introduction data flows in the network then it becomes critical to secure and maintain data at multiple sites so that if there is any data loss then it could be recovered without much overhead. Persistent data stored in distributed file systems ranges in size from small to large, is likely read multiple times, and is typically long-lived. In comparison, intermediate data generated in cloud programming paradigms has uniquely contrasting characteristics [6]. There are many fault tolerance techniques available that deals with virtual machine (VM) migration, process migration, application migration to overcome the impact of fault [9][10]. Time series based precopy technique migrate VM from one source host to target host [11]. Data in form of pages is transferred in this approach. Proactive Live Process migration mechanism [12] migrate process before the fault occurs. To tolerate fault, migration would take place which involves large amount of data transfer. This is a performance hit as time would be consumed in data transfer and during that period system would be unavailable. Performance and Availability can be increased if data is placed at more than one site. Replication is one of the most widely studied phenomena in a distributed environment [13]. Replication is a strategy in which multiple copies of some data are stored at multiple sites. When required, data is fetched from the nearest available replica to avoid delay and increase performance. Availability needs to be high in cloud computing paradigm which makes replication of data in cloud environment, a challenging task. Difficulty in providing efficient and correct wide area database replication is that it requires integrating different techniques from several fields including distributed system, databases, network protocols and operating sytem [16]. Data replication schemes over storage providers with a KVS (key-value store) interface are inherently more difficult to realize than replication schemes over providers with richer interfaces [15]. Following are the few challenges in replicated environment: Data Consistency: Maintaining data integrity and consistency in a replicated environment is of prime importance. High precision applications may require strict consistency (e.g. 1SR) of the updates made by transactions [14]. Downtime during new replica creation: If strict data consistency is to be maintained, performance is severely affected if a new replica is to be created. As sites will not be able to fulfill requests due to consistency requirements.

3 Autonomic Data Replication in Cloud Environment 461 Maintenance overhead: If the files are replicated at more than one site, it occupies storage space and it has to be administered. Thus, there are overheads in storing multiple files. Lower write performance: Performance of write operations can be dramatically lower in applications requiring high updates in replicated environment, because the transaction may need to update multiple copies [14]. III.HYBRID VIRTUALIZED ARCHITECTURE Figure-2: Hybrid Virtualized Architecture Figure 2 shows Hybrid Virtualized Architecture which includes Virtual Machine Workstation (VMware Fusion) to provide virtualization, Hadoop framework to provide HDFS functionality. The Eclipse is used as Java Integrated Development Environment to write application code. The virtual environment helped us to analyze the cloud environment for different types of application on a single machine. This is a mater-slave architecture where the master node provides the functionality to the slave node by providing fault tolerance as if one node is failed the data can be retrieved from the other slave nodes. The cluster is setup between both the machines. Virtualized Hybrid Architecture consists of hosting server installing VMware and two hosted VMs (master and slave) on which an Ubuntu OS. Following are the components of the Hybrid Architecture: Local Machine: An Application is executed on the local machine running on windows 7 32-bit platform. This application is developed using java Eclipse API. The data generated by this application is sent to HDFS which stores it on multiple locations. While performing experiments, we kept replication factor to be 2. Application access the file system using the HDFS client, that exports the HDFS file system interface. Master VM: It runs over Ubuntu platform. Hadoop is set up on this VM. NameNode and the DataNode runs on this VM. Data is received by the NameNode and replicated to the DataNodes depending on the replication factor. To increase reliability and availability the replication factor can be increased. NameNode keeps track of which DataNode is live so that when one DataNode is down data can be fetched from the other one. Slave VM: It runs over the same Ubuntu platform. Same Hadoop is set up on this VM also and it runs the second Data node. This node receives data from the master VM. Whenever Data node on the master VM fails then Name node automatically fetches data from this data node. IV.IMPLEMENTATION AND EXPERIMETAL RESULTS We have implemented an application in java using Eclipse API. Data required by the application is retrieved from HDFS. Operations are performed on data by retrieving it on the local machine. When transactions are done, data is sent to the HDFS. We have implemented Asynchronous mechanism to replicate data in virtualized cloud environment. In this Architecture, Hadoop is set up on both the virtual machines. On first machine we create NameNode and DataNode. On second machine we create only Data node. Hadoop is configured is have one NameNode and two DataNodes. Data generated by the application is pumped into HDFS where it is replicated on two DataNodes. When data is required it is retrieved from the HDFS. At that moment if either of the DataNodes

4 IJECSE,Volume2, Number 2 Dhananjaya Gupt and Mrs. Anju Bala 462 fails, data is automatically recovered from the other one. The experimental platforms and software packages used in this system are as follows: Table-1: Platform Configuration TYPE Processor CPU Speed Memory Platform SPECIFICATION Intel i5 2.5GHz 4.00 GB 32-bit OS Operating System Ubuntu Table-2: Software Package versions SOFTWARE PACKAGES VERSIONS JDK 1.6 Eclipse Hadoop VMware Fusion Hadoop provides a web interface for statistics. Using this interface we can have status of all the nodes. Based on some failure cases, we are able to determine how our data is fetched from the other nodes when the requested node goes down. Case 1: Figure-3 shows the overall status monitoring of the system. Here we can browse through the system to determine which Data nodes are live. In this case, both the Data nodes are live and data can be retrieved from any of these nodes depending on the HDFS. The number of live and dead nodes is determined through this interface. Case 2: Figure-4 shows the detail of all the data nodes upon which the data is present. Data is automatically retrieved from the nearest node available. Case 3: Figure-5 shows the access of files stored in HDFS via any of the Data nodes. Data is replicated on two different nodes. If one node goes down then other node retrieves data and increases availability and fault tolerance.

5 Autonomic Data Replication in Cloud Environment 463 Figure-3 web interface of the Name Node Figure-4: accessing live Data Node containing data

6 IJECSE,Volume2, Number 2 Dhananjaya Gupt and Mrs. Anju Bala 464 Figure-5: accessing data present at the Data Node. V.CONCLUSION AND FUTURE SCOPE In this paper, we have proposed cloud virtualized system architecture based on Hadoop. We have presented highly reliable system that provides data replication in a cloud virtualized environment. Data is replicated on multiple VMs. An application is developed and executed. Experimental results are obtained, that validate the system fault tolerance and replication at multiple nodes. When one node fails then data is recovered via other node. Some future extensions are possible as performance can be improved by replicating data in the real time with higher replication factor to ensure much higher availability and fault tolerance. This data replication mechanism can be combined with other Fault Tolerance techniques to achieve more reliability and Fault Tolerance. REFERENCES [1] Application Architecture for Cloud Computing, white paper, [2] Apache Hadoop. [3] Golam Moktader Nayeem, Mohammad Jahangir Alam, Analysis of Different Software Fault Tolerance Techniques, [4] HDFS (hadoop distributed file system) architecture, design.html, [5] Alain Tchana, Laurent Broto, Daniel Hagimont, Fault Tolerant Approaches in Cloud Computing Infrastructures, The Eighth International Conference on Autonomic and Autonomous System, ICAS [6] Steven Y. Ko, Imranul Hoque, Brian Cho and Indranil Gupta, On Availability of Intermediate Data in Cloud Computations, [7] Geoffroy Vallee, Kulathep Charoenpornwattana, Christian Engelmann, Anand Tikotekar, Stephen L. Scott, A Framework for Proactive Fault Tolerance. [8] Julia Myint, Thinn Thu Naing, Management of Data Replication for PC Cluster-based Cloud Storage System, International Journal on Cloud Computing: Services and Architecture (IJCCSA), Vol.1, No.3, 31-41, November [9] Chao Wang1, Frank Mueller, Christian Engelmann, Stephen L. Scott, Proactive Process-Level Live Migration in HPC Environments, [10] Gang Chen, Hai Jin, Deqing Zou, Bing Bing Zhou, Weizhong Qiang, Gang Hu, SHelp: Automatic Self- healing for Multiple Application Instances in a Virtual Machine Environment, IEEE International Conference on Cluster Computing, [11] Bolin Hu, Zhou Lei, Yu Lei, Dong Xu, Jiandun Li, A Time-Series Based Precopy Approach for Live Migration of Virtual Machines, IEEE 17th International Conference on Parallel and Distributed Systems, [12] Chao Wang, Frank Mueller, Christian Engelmann, Proactive process level live migration and back migration in HPC environments, [13] Sushant Goel, Rajkumar Buyya, data replication strategies in wide area distributed systems. [14] Yu, H., and Vahdat, A. Consistent and automatic replica regeneration. Trans. Storage 1, 1 (2005), [15] Christian Cachin, Birgit Junker, Alessandro Sorniotti, On Limitations of Using Cloud Storage for Data Replication. [16] Yair Amir, Claudiu Danilov, Michal Miskin-Amir, Jonathan Stanton, Ciprian Tutu, Practical Wide-Area Database Replication,CNDS Johns Hopkins University,

Distributed Filesystem

Distributed Filesystem 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributing Code! Don t move data to workers move workers to the data! - Store data on the local disks of nodes in the