DYNAMIC DATA STORAGE AND PLACEMENT SYSTEM BASED ON THE CATEGORY AND POPULARITY

Similar documents
HADOOP BLOCK PLACEMENT POLICY FOR DIFFERENT FILE FORMATS

CATEGORIZATION OF THE DOCUMENTS BY USING MACHINE LEARNING

INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & TECHNOLOGY (IJCET)

Dynamic Data Placement Strategy in MapReduce-styled Data Processing Platform Hua-Ci WANG 1,a,*, Cai CHEN 2,b,*, Yi LIANG 3,c

QADR with Energy Consumption for DIA in Cloud

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Dynamic Data Grid Replication Strategy Based on Internet Hierarchy

SDS: A Scalable Data Services System in Data Grid

CLIENT DATA NODE NAME NODE

Review On Data Replication with QoS and Energy Consumption for Data Intensive Applications in Cloud Computing

6367(Print), ISSN (Online) Volume 4, Issue 2, March April (2013), IAEME & TECHNOLOGY (IJCET)

A Review Approach for Big Data and Hadoop Technology

Optimizing Hadoop Block Placement Policy & Cluster Blocks Distribution

SURVEY ON LOAD BALANCING AND DATA SKEW MITIGATION IN MAPREDUCE APPLICATIONS

Performance Evaluation of Mesh - Based Multicast Routing Protocols in MANET s

Research on Load Balancing in Task Allocation Process in Heterogeneous Hadoop Cluster

Mounica B, Aditya Srivastava, Md. Faisal Alam

Indexing Strategies of MapReduce for Information Retrieval in Big Data

A Novel Data Replication Policy in Data Grid

Nowadays data-intensive applications play a

EDPFRS: ENHANCED DYNAMIC POPULAR FILE REPLICATION AND SCHEDULING FOR DATA GRID ENVIRONMENT

Distributed Face Recognition Using Hadoop

Analyzing and Improving Load Balancing Algorithm of MooseFS

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & TECHNOLOGY (IJCET)

Research Article Mobile Storage and Search Engine of Information Oriented to Food Cloud

Future Generation Computer Systems. PDDRA: A new pre-fetching based dynamic data replication algorithm in data grids

Available online at ScienceDirect. Procedia Computer Science 89 (2016 )

High Performance Computing on MapReduce Programming Framework

Correlation based File Prefetching Approach for Hadoop

IMPLEMENTATION OF INFORMATION RETRIEVAL (IR) ALGORITHM FOR CLOUD COMPUTING: A COMPARATIVE STUDY BETWEEN WITH AND WITHOUT MAPREDUCE MECHANISM *

CS60021: Scalable Data Mining. Sourangshu Bhattacharya

ABSTRACT I. INTRODUCTION

A priority based dynamic bandwidth scheduling in SDN networks 1

Review Article AN ANALYSIS ON THE PERFORMANCE OF VARIOUS REPLICA ALLOCATION ALGORITHMS IN CLOUD USING MATLAB

HDFS Architecture. Gregory Kesden, CSE-291 (Storage Systems) Fall 2017

Cloud Computing. Hwajung Lee. Key Reference: Prof. Jong-Moon Chung s Lecture Notes at Yonsei University

A FAST CLUSTERING-BASED FEATURE SUBSET SELECTION ALGORITHM

Yuval Carmel Tel-Aviv University "Advanced Topics in Storage Systems" - Spring 2013

The Analysis Research of Hierarchical Storage System Based on Hadoop Framework Yan LIU 1, a, Tianjian ZHENG 1, Mingjiang LI 1, Jinpeng YUAN 1

Dept. Of Computer Science, Colorado State University

International Journal of Advance Engineering and Research Development. A Study: Hadoop Framework

A new efficient Virtual Machine load balancing Algorithm for a cloud computing environment

A Fast and High Throughput SQL Query System for Big Data

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

Directory Structure and File Allocation Methods

SBKMMA: Sorting Based K Means and Median Based Clustering Algorithm Using Multi Machine Technique for Big Data

A Novel Architecture to Efficient utilization of Hadoop Distributed File Systems for Small Files

AES and DES Using Secure and Dynamic Data Storage in Cloud

Self Destruction Of Data On Cloud Computing

A Survey on improving performance of Information Retrieval System using Adaptive Genetic Algorithm

A New HadoopBased Network Management System with Policy Approach

DATA DEDUPLCATION AND MIGRATION USING LOAD REBALANCING APPROACH IN HDFS Pritee Patil 1, Nitin Pise 2,Sarika Bobde 3 1

Workloads Programmierung Paralleler und Verteilter Systeme (PPV)

Ubiquitous and Mobile Computing CS 525M: Virtually Unifying Personal Storage for Fast and Pervasive Data Accesses

CLOUD-SCALE FILE SYSTEMS

A Study of Comparatively Analysis for HDFS and Google File System towards to Handle Big Data

ERASURE-CODING DEPENDENT STORAGE AWARE ROUTING

Hadoop and HDFS Overview. Madhu Ankam

Iteration Reduction K Means Clustering Algorithm

AN EFFICIENT APPROACH FOR PROVIDING FULL CONNECTIVITY IN WIRELESS SENSOR NETWORK

Transaction Processing in Mobile Database Systems

Fast and Effective System for Name Entity Recognition on Big Data

Survey Paper on Traditional Hadoop and Pipelined Map Reduce

Unsupervised learning on Color Images

4/9/2018 Week 13-A Sangmi Lee Pallickara. CS435 Introduction to Big Data Spring 2018 Colorado State University. FAQs. Architecture of GFS

LOAD BALANCING IN CLOUD COMPUTING USING ANT COLONY OPTIMIZATION

Google File System (GFS) and Hadoop Distributed File System (HDFS)

ISSN: (Online) Volume 2, Issue 3, March 2014 International Journal of Advance Research in Computer Science and Management Studies

Enhanced Hadoop with Search and MapReduce Concurrency Optimization

Research on Mass Image Storage Platform Based on Cloud Computing

CLUSTERING BIG DATA USING NORMALIZATION BASED k-means ALGORITHM

Analysis of Extended Performance for clustering of Satellite Images Using Bigdata Platform Spark

The Design of Distributed File System Based on HDFS Yannan Wang 1, a, Shudong Zhang 2, b, Hui Liu 3, c

Proposed System. Start. Search parameter definition. User search criteria (input) usefulness score > 0.5. Retrieve results

Batch Inherence of Map Reduce Framework

Mining Distributed Frequent Itemset with Hadoop

MANAGEMENT AND PLACEMENT OF REPLICAS IN A HIERARCHICAL DATA GRID

Figure 1: Virtualization

Dynamic Replication Management Scheme for Cloud Storage

PROFILING BASED REDUCE MEMORY PROVISIONING FOR IMPROVING THE PERFORMANCE IN HADOOP

Evaluation of Apache Hadoop for parallel data analysis with ROOT

April Final Quiz COSC MapReduce Programming a) Explain briefly the main ideas and components of the MapReduce programming model.

Research Article A Two-Level Cache for Distributed Information Retrieval in Search Engines

A Level-wise Priority Based Task Scheduling for Heterogeneous Systems

A SURVEY ON SCHEDULING IN HADOOP FOR BIGDATA PROCESSING

A New Platform NIDS Based On WEMA

Top 25 Big Data Interview Questions And Answers

Autonomic Data Replication in Cloud Environment

Obtaining Rough Set Approximation using MapReduce Technique in Data Mining

Inverted Index for Fast Nearest Neighbour

A Case Study on Cloud Based Hybrid Adaptive Mobile Streaming: Performance Evaluation

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University

Implementation of Parallel CASINO Algorithm Based on MapReduce. Li Zhang a, Yijie Shi b

Open Access Apriori Algorithm Research Based on Map-Reduce in Cloud Computing Environments

PERSONAL communications service (PCS) provides

ADAPTIVE HANDLING OF 3V S OF BIG DATA TO IMPROVE EFFICIENCY USING HETEROGENEOUS CLUSTERS

Adaptive replica consistency policy for Kafka

Research on Parallelized Stream Data Micro Clustering Algorithm Ke Ma 1, Lingjuan Li 1, Yimu Ji 1, Shengmei Luo 1, Tao Wen 2

Efficient Algorithm for Frequent Itemset Generation in Big Data

Transcription:

Software Metric Trends And Evolution, B Venkata Ramana, Dr.G.Narasimha, Journal Impact Factor DYNAMIC DATA STORAGE AND PLACEMENT SYSTEM BASED ON (2015): 8.9958 (Calculated by GISI) www.jifactor.com THE CATEGORY AND POPULARITY Miss. Radhika Jaju, Prof. Priya Deshpande Volume 6, Issue 6, June (2015), pp. 08-15 Article ID: 50120150606002 International Journal of Computer Engineering & Technology (IJCET) IAEME: www.iaeme.com/ijcet.asp ISSN 0976 6367(Print) ISSN 0976 6375(Online) IJCET I A E M E 1 Student (MITCOE, Pune) 2 Asst. Professor (MITCOE, Pune) ABSTRACT Distributed computing storage and management is widely adopted topic for research now days. Various replication strategies are developed to solve the issues regarding the storage and management. We also developed one system i.e. S&P System which gives better results than HDFS for parameters like performance, access time, memory utilization. In this paper S&P system divides data category wise and store that data on the node assigned to it. This will reduce the access time and total cost. After that replication of data will be done on the basis of access history, popularity and replication factor of the file. Replication factor is calculated by formula. This will achieve memory utilization. Keywords: Category, Access History, Popularity. I. INTRODUCTION Distributed computing and its parameters are very popular and hot topics in recent years. Big data storage is one of them. Many experiments are taking place every day and some of them are really showing effective results. Big data is a data which is of any kind and of any size. And day by day storing and managing of such large amount of data is a challengeable task. And to do so, many strategies and algorithms are derived. Out of them some showed better results and enhanced overall performance. Hadoop is the basic technology where simple strategy is used for storage and placement of data. Hadoop stores data in the form of chunks where all the chunks are of equal size. Large number of data is generating regularly, and this data is needed to be stored properly for efficient data access and also to prevent the data loss, memory loss, data redundancy, data duplications. So How to place the data? Where to place the data? Why to place the data? [1] Hadoop is open source framework, works on the storage issue. Hadoop plays important role in distributed computing system. Here following diagram will show the process of storing data on hadoop. 8

This shows the storage of data in Hadoop framework. Hadoop structure basically has a Namenode which acts as a master and group of different Datanodes are attached to the namenode which acts like slaves. Namenode is connected to the several datanodes on which user performs different storage operations. Client ask for the storing of data to the namenode and then namenode checks for the availability of the datanodes and send acknowledgement accordingly to the clients. The datanodes are pipelined together on which client willing to write the data. And several copies of the stored data are generated and stores on other nodes of the pipeline. Copy count of the data is a constant number in Hadoop most probably the count is 3 or 4. Due to several copies of data, storage is safe and efficient as any of datanode crashes, the data is available on another node so that one can recover it from there. With some advantages basic Hadoop strategy leads to some disadvantages also like redundancy, overheads, data loading time, retrieval time etc. so to overcome these disadvantages of Hadoop number of strategies are implemented and performance of few of them is effective. We are also applied a strategy which is extension of Hadoop. Means with the help of basic Hadoop and adding some new concepts in it we derived new algorithm which showss effective results overcomes the disadvantage of Hadoop. Our strategy will store the data according to category it belongs. And then the copies of the are placed where they will be used more so that consumption time will be less. II. RELATED WORK In [3] Wuqing Zhao and others introduce us to the new strategy DORS (Dynamic Optimal Replication strategy). First of all they decide whether there is need to replicate the file or not, on the basis of some earlier work. After that importance number is given to the files by calculating the occurrence of the file and depending upon the accesss history of the file. And according to the occurrences the files are replaced with the less important file. This algorithm showed better results with the simulator. In [4] Alexis M. Soosai et. Al proposed strategy LVR (Least Value Replacement). This method is based upon the future value prediction. LVR framework automatically decides which file should get deleted in future whenever the data grid storage of sites is full on the basis of information about access 9

frequency of the file, files future value, free space on the storage element. LVR strategy shows the simulated results with the help of OpterSim simulator In [5] Myunghoon Jeon et. Al defined about DRS: Dynamic Replication Strategy is used for improved data access. As the data access and performance is closely related to the data access pattern. Traditional strategies belong to the particular data access pattern which is less effective for other patterns. So DRS is derived in such way thatt strategy changes according to the pattern. Again as the pattern changes dynamically frequency count number of files is adopted dynamically. In [6]Chang has proposed LALW (Latest Access Largest weight), here largest weight will be applied to the file which is accessed most recently. Similarly SATO et. al presented small modification to the simple replication algorithms on the basis of file access pattern and network capacity. DRCP is Dynamic Replica creation and placement proposes the placement policy and the replica selection to reduce the execution time and bandwidth consumption. Their replication is based on the popularity of the file and this strategy is implemented using data grid simulator, OptorSim [7, 8]. In this paper we derived strategy for the data placement where we are storing the data category wise and accordingly we are placing the data on the nodes depending on the access history, replication factor and popularity of the file. Rest of the paper is divided as section III will describe the system architecture and the storage and placement of the data according to the popularity and category of the data. Section IV shows the results and evaluation. Section V gives the conclusion and future work and Section VI suggests the references. III. SYSTEM ARCHITECTUREE Figure gives the overall idea of proposed strategy. Different jobs from different clients are requested to the job broker. Here job broker works as a namenode of hadoop. Then job broker will run the K-means algorithm to find out the category of the data to be submitted. Then according to category data will be stored on the particular categorized node. 10

A. Dynamic Data Distribution As client requests for the data storage. There is need to store the data efficiently so that one can access it easily. So for that we need to store data in such format that one can get the data easily, and ultimately access cost will also get reduced as files are easily accessible. In our strategy to get the information easily Jobs are submitted to the job broker to store the data. After that job broker will divide the data category wise. And then one category get sub-category and so on. And then data will be stored in appropriate datanode assigned for that category. Due to this fragmentation it is suitable to store and access the data. Data will be retrieved easily and the file transfer traffic ratio will be low ultimately it will effect on the performance and cost of the file transfer. Data will be divided category wise. For example, Plastic factory data will have different category like vendors data, supplier s data, There are different strategies are studied to divide such a data category wise. K-means is one of the algorithms used for the data categorization purpose. It is a well known partitioning algorithm where the objects are categorized as they belong to the one of K-groups, here K is priori. Depending on the mean multidimensional version i.e. centroid of the cluster, the belonging of that object to the particular cluster is finalized. It means object is assigned to the group having closest centroid [9, 10].K-means works particularly by calculating centroid of each cluster. And it is cost effective. Basic k-means algorithm is Whenever data will come to the job tracker, job tracker will invoke the K-means algorithm. K- means will divide that data category wise and that categorized data will be stored on to the node assigned for that category. In case of storage full, the new arriving data will automatically propagate to the nearer node. B. Data Placement When actually we storing data, what is the need of data placement? As clients are storing unstructured any size and any kind of data storage may get full. What if storage element of particular site gets full? How to store new data on the sites? How to prevent already stored important data from deleting. As per Hadoop, when such situation occurs Hadoop deletes the files randomly and replace new 11

one instead of. Hadoop may delete one or two files as per requirement of free size to store new file. But here, there are chances of deleting a file which will highly require in future. So to deal with this problem, our strategy is implemented. This is based on the popularity and access history of the file. The storage of the new file will be depends on the replication factor. If the file having copies less than the replication factor then they are copied and if they are having number more than it they will not get copied. The replication factor will be decided on the basis of the capacity of all the nodes to the total size of all the files. R= C/W -------[11] R= Replication factor, C= capacity of all the nodes, W= Total size of all files in data grid. Here R will decide whether to replicate the file or not. If the copies of the files are less than R then files will be replaced otherwise not. [12] Our system there is one replication manager who maintains all the replica count log files and access history count log files which is centrally situated and connected to all the nodes. These logs are cleared after each fixed interval of time. Here we have taken a fix time interval. Due to this we get the updated replica count and new replication factor after every interval. Once we got the popularity count of all the files at that interval, according to the count we replace the files. The file having more popularity will get first chance of replication and so on. IV. PERFORMANCE EVALUATION To show improved results of our system (S&P System), we compared our system with HDFS. A. Popularity Comparison Popularity comparison graph between the two systems is shown in the fig 3 and fig 4. We have taken 4 same size files for both the system. S & P shows files requested count at every interval. So depending upon the count, the chance for replication gets to the highest popular file. Where HDFS does not work on popularity, that mean all the files replicated randomly. B. Data Placement Replica comparison is shown in the graph 5& 6 at fix time interval. At every interval Hadoop replicates all the files which are requested at that time, where S&P system replaces only those files which are popular. This will result in efficient access of time. 12

C. If Memory Gets Full In case if any node storage gets full, the new file will be replaced with the old one. In HDFS the file to be deleted is selected randomly, whereas in S& P system the least popular file will be deleted. The scenario is explained in fig 7&8. D. Mean Access Time As we started to upload different type and size of files, we got the different results. We have compared our results with the basic Hadoop. And that shows our system s access time is less than Original HDFS. Mean access Time will be calculated according to size. We got two different Values for two different sizes of data sets. They are as follows. 13

E. Memory Utilization Memory Utilization is based on the strategy used for the system. This relates to the storage capacity of the system. Here, we are comparing Original HDFS system with our data storage and placement system. The Utilization factor depends upon the number of jobs executed V. CONCLUSION In This paper, our data distribution strategy helps to improve the data access time. K-means algorithm will divide the data category wise and then it will send to respective node assigned for the particular category. So when the user request for the file or stores the file, K-Means will run and will go to particular category node and will perform on it. As our Replication strategy is based on the popularity and access history of the file. This strategy shows the better results than traditional HDFS system. In future we will also consider the scheduling criteria, load balancing, recovering so that we can perform on whole system and will give better results. VI. REFERENCES 1. W. Zhao, et al., A Dynamic Optimal Replication Strategy in Data Grid Environment, International Conference on Internet Technology and Applications, pp. 1-4, 2010. 2. White, Tom. Hadoop The Definitive Guide. Sebastopol : O'Reilly, 2010. 3. A Dynamic Optimal Replication Strategy in Data Grid Environment, Wuqing Zhao, Xianbin Xu, Zhuowei Wang, Yuping Zhang, Shuibing He, School of Computer, Wuhan University, Wuhan- China, @2012 IEEE. 4. Alxis M Sosaai et. Al Dynamic Replica replacement strategy in data grids. Department of CS, University of Malaysia. 5. MyunghoonJeon, Kwang-Ho Lim, Hyun Ahn, Byoung-Dai Lee, Dynamic Data Replication Scheme in cloud Computing Environment, @2012 IEEE. 6. R.S. Chang, H.P. Chang, "A Dynamic Data Replication Strategy Using Access-Weights in Data Grids," Supercomputing, Vol. 45, No. 3, pp. 277-295, 2008. 7. K. Sashi, A. Selvadoss Thanamani, A New Replica Creation and Placement Algorithm for Data Grid Environment, IEEE International Conference on Data Storage and Data Engineering (2010). 8. K. Sashi, A. Selvadoss Thanamani, Dynamic Replication in a Data Grid using a Modified BHR Region Based Algorithm, Elsevier Future Generation Computer Systems (2011). 14

Prof. Priya Deshpande, Journal Impact Factor (2015): 8.9958 (Calculated By Gisi) www.jifactor.com 9. Chen G., Jaradat S., Banerjee N., Tanaka T., Ko M., and Zhang M., Evaluation and Comparison of clustering algorithms in analyzing ES cell Gene Expression Data, Statistica Sinica, vol. 12, pp. 241-262, 2002. 10. Osama Abu Abbas, Comparisons between data clustering Algorithms, The international Arab journal of information technology, Vol. 5, No. 3, July 2008. 11. Wolfgang Hoschek, Francisco Javier Jaén-Martínez, Asad Samar, Heinz Stockinger, and Kurt Stockinger, Data Management in an International Data Grid Project, Proceedings of the First IEEE/ACM International Workshop on Grid Computing, Springer-Verlag, 2000, pp. 77-90 12. Radhika Jaju, et. Al. Dynamic Data Storage and Replication Based on the Category and Data Access Patterns, MITCOE, Pune University IJSWS 2015. 13. D.Sai Anuhya and Smriti Agrawal, 3-D Holographic Data Storage International journal of Computer Engineering & Technology (IJCET), Volume 4, Issue 6, 2013, pp. 232-239, ISSN Print: 0976 6367, ISSN Online: 0976 6375. 14. Yaser Fuad Al-Dubai and Dr. Khamitkar S.D, A Proposed Model For Data Storage Security In Cloud Computing Using Kerberos Authentication Service International journal of Computer Engineering & Technology (IJCET), Volume 4, Issue 6, 2013, pp. 62-69, ISSN Print: 0976 6367, ISSN Online: 0976 6375. 15. D.Pratiba and Dr.G.Shobha, Privacy-Preserving Public Auditing For Data Storage Security In Cloud Computing International journal of Computer Engineering & Technology (IJCET), Volume 4, Issue 3, 2013, pp. 441-448, ISSN Print: 0976 6367, ISSN Online: 0976 6375. 15