Software Metric Trends And Evolution, B Venkata Ramana, Dr.G.Narasimha, Journal Impact Factor DYNAMIC DATA STORAGE AND PLACEMENT SYSTEM BASED ON (2015): 8.9958 (Calculated by GISI) www.jifactor.com THE CATEGORY AND POPULARITY Miss. Radhika Jaju, Prof. Priya Deshpande Volume 6, Issue 6, June (2015), pp. 08-15 Article ID: 50120150606002 International Journal of Computer Engineering & Technology (IJCET) IAEME: www.iaeme.com/ijcet.asp ISSN 0976 6367(Print) ISSN 0976 6375(Online) IJCET I A E M E 1 Student (MITCOE, Pune) 2 Asst. Professor (MITCOE, Pune) ABSTRACT Distributed computing storage and management is widely adopted topic for research now days. Various replication strategies are developed to solve the issues regarding the storage and management. We also developed one system i.e. S&P System which gives better results than HDFS for parameters like performance, access time, memory utilization. In this paper S&P system divides data category wise and store that data on the node assigned to it. This will reduce the access time and total cost. After that replication of data will be done on the basis of access history, popularity and replication factor of the file. Replication factor is calculated by formula. This will achieve memory utilization. Keywords: Category, Access History, Popularity. I. INTRODUCTION Distributed computing and its parameters are very popular and hot topics in recent years. Big data storage is one of them. Many experiments are taking place every day and some of them are really showing effective results. Big data is a data which is of any kind and of any size. And day by day storing and managing of such large amount of data is a challengeable task. And to do so, many strategies and algorithms are derived. Out of them some showed better results and enhanced overall performance. Hadoop is the basic technology where simple strategy is used for storage and placement of data. Hadoop stores data in the form of chunks where all the chunks are of equal size. Large number of data is generating regularly, and this data is needed to be stored properly for efficient data access and also to prevent the data loss, memory loss, data redundancy, data duplications. So How to place the data? Where to place the data? Why to place the data? [1] Hadoop is open source framework, works on the storage issue. Hadoop plays important role in distributed computing system. Here following diagram will show the process of storing data on hadoop. 8
This shows the storage of data in Hadoop framework. Hadoop structure basically has a Namenode which acts as a master and group of different Datanodes are attached to the namenode which acts like slaves. Namenode is connected to the several datanodes on which user performs different storage operations. Client ask for the storing of data to the namenode and then namenode checks for the availability of the datanodes and send acknowledgement accordingly to the clients. The datanodes are pipelined together on which client willing to write the data. And several copies of the stored data are generated and stores on other nodes of the pipeline. Copy count of the data is a constant number in Hadoop most probably the count is 3 or 4. Due to several copies of data, storage is safe and efficient as any of datanode crashes, the data is available on another node so that one can recover it from there. With some advantages basic Hadoop strategy leads to some disadvantages also like redundancy, overheads, data loading time, retrieval time etc. so to overcome these disadvantages of Hadoop number of strategies are implemented and performance of few of them is effective. We are also applied a strategy which is extension of Hadoop. Means with the help of basic Hadoop and adding some new concepts in it we derived new algorithm which showss effective results overcomes the disadvantage of Hadoop. Our strategy will store the data according to category it belongs. And then the copies of the are placed where they will be used more so that consumption time will be less. II. RELATED WORK In [3] Wuqing Zhao and others introduce us to the new strategy DORS (Dynamic Optimal Replication strategy). First of all they decide whether there is need to replicate the file or not, on the basis of some earlier work. After that importance number is given to the files by calculating the occurrence of the file and depending upon the accesss history of the file. And according to the occurrences the files are replaced with the less important file. This algorithm showed better results with the simulator. In [4] Alexis M. Soosai et. Al proposed strategy LVR (Least Value Replacement). This method is based upon the future value prediction. LVR framework automatically decides which file should get deleted in future whenever the data grid storage of sites is full on the basis of information about access 9
frequency of the file, files future value, free space on the storage element. LVR strategy shows the simulated results with the help of OpterSim simulator In [5] Myunghoon Jeon et. Al defined about DRS: Dynamic Replication Strategy is used for improved data access. As the data access and performance is closely related to the data access pattern. Traditional strategies belong to the particular data access pattern which is less effective for other patterns. So DRS is derived in such way thatt strategy changes according to the pattern. Again as the pattern changes dynamically frequency count number of files is adopted dynamically. In [6]Chang has proposed LALW (Latest Access Largest weight), here largest weight will be applied to the file which is accessed most recently. Similarly SATO et. al presented small modification to the simple replication algorithms on the basis of file access pattern and network capacity. DRCP is Dynamic Replica creation and placement proposes the placement policy and the replica selection to reduce the execution time and bandwidth consumption. Their replication is based on the popularity of the file and this strategy is implemented using data grid simulator, OptorSim [7, 8]. In this paper we derived strategy for the data placement where we are storing the data category wise and accordingly we are placing the data on the nodes depending on the access history, replication factor and popularity of the file. Rest of the paper is divided as section III will describe the system architecture and the storage and placement of the data according to the popularity and category of the data. Section IV shows the results and evaluation. Section V gives the conclusion and future work and Section VI suggests the references. III. SYSTEM ARCHITECTUREE Figure gives the overall idea of proposed strategy. Different jobs from different clients are requested to the job broker. Here job broker works as a namenode of hadoop. Then job broker will run the K-means algorithm to find out the category of the data to be submitted. Then according to category data will be stored on the particular categorized node. 10
A. Dynamic Data Distribution As client requests for the data storage. There is need to store the data efficiently so that one can access it easily. So for that we need to store data in such format that one can get the data easily, and ultimately access cost will also get reduced as files are easily accessible. In our strategy to get the information easily Jobs are submitted to the job broker to store the data. After that job broker will divide the data category wise. And then one category get sub-category and so on. And then data will be stored in appropriate datanode assigned for that category. Due to this fragmentation it is suitable to store and access the data. Data will be retrieved easily and the file transfer traffic ratio will be low ultimately it will effect on the performance and cost of the file transfer. Data will be divided category wise. For example, Plastic factory data will have different category like vendors data, supplier s data, There are different strategies are studied to divide such a data category wise. K-means is one of the algorithms used for the data categorization purpose. It is a well known partitioning algorithm where the objects are categorized as they belong to the one of K-groups, here K is priori. Depending on the mean multidimensional version i.e. centroid of the cluster, the belonging of that object to the particular cluster is finalized. It means object is assigned to the group having closest centroid [9, 10].K-means works particularly by calculating centroid of each cluster. And it is cost effective. Basic k-means algorithm is Whenever data will come to the job tracker, job tracker will invoke the K-means algorithm. K- means will divide that data category wise and that categorized data will be stored on to the node assigned for that category. In case of storage full, the new arriving data will automatically propagate to the nearer node. B. Data Placement When actually we storing data, what is the need of data placement? As clients are storing unstructured any size and any kind of data storage may get full. What if storage element of particular site gets full? How to store new data on the sites? How to prevent already stored important data from deleting. As per Hadoop, when such situation occurs Hadoop deletes the files randomly and replace new 11
one instead of. Hadoop may delete one or two files as per requirement of free size to store new file. But here, there are chances of deleting a file which will highly require in future. So to deal with this problem, our strategy is implemented. This is based on the popularity and access history of the file. The storage of the new file will be depends on the replication factor. If the file having copies less than the replication factor then they are copied and if they are having number more than it they will not get copied. The replication factor will be decided on the basis of the capacity of all the nodes to the total size of all the files. R= C/W -------[11] R= Replication factor, C= capacity of all the nodes, W= Total size of all files in data grid. Here R will decide whether to replicate the file or not. If the copies of the files are less than R then files will be replaced otherwise not. [12] Our system there is one replication manager who maintains all the replica count log files and access history count log files which is centrally situated and connected to all the nodes. These logs are cleared after each fixed interval of time. Here we have taken a fix time interval. Due to this we get the updated replica count and new replication factor after every interval. Once we got the popularity count of all the files at that interval, according to the count we replace the files. The file having more popularity will get first chance of replication and so on. IV. PERFORMANCE EVALUATION To show improved results of our system (S&P System), we compared our system with HDFS. A. Popularity Comparison Popularity comparison graph between the two systems is shown in the fig 3 and fig 4. We have taken 4 same size files for both the system. S & P shows files requested count at every interval. So depending upon the count, the chance for replication gets to the highest popular file. Where HDFS does not work on popularity, that mean all the files replicated randomly. B. Data Placement Replica comparison is shown in the graph 5& 6 at fix time interval. At every interval Hadoop replicates all the files which are requested at that time, where S&P system replaces only those files which are popular. This will result in efficient access of time. 12
C. If Memory Gets Full In case if any node storage gets full, the new file will be replaced with the old one. In HDFS the file to be deleted is selected randomly, whereas in S& P system the least popular file will be deleted. The scenario is explained in fig 7&8. D. Mean Access Time As we started to upload different type and size of files, we got the different results. We have compared our results with the basic Hadoop. And that shows our system s access time is less than Original HDFS. Mean access Time will be calculated according to size. We got two different Values for two different sizes of data sets. They are as follows. 13
E. Memory Utilization Memory Utilization is based on the strategy used for the system. This relates to the storage capacity of the system. Here, we are comparing Original HDFS system with our data storage and placement system. The Utilization factor depends upon the number of jobs executed V. CONCLUSION In This paper, our data distribution strategy helps to improve the data access time. K-means algorithm will divide the data category wise and then it will send to respective node assigned for the particular category. So when the user request for the file or stores the file, K-Means will run and will go to particular category node and will perform on it. As our Replication strategy is based on the popularity and access history of the file. This strategy shows the better results than traditional HDFS system. In future we will also consider the scheduling criteria, load balancing, recovering so that we can perform on whole system and will give better results. VI. REFERENCES 1. W. Zhao, et al., A Dynamic Optimal Replication Strategy in Data Grid Environment, International Conference on Internet Technology and Applications, pp. 1-4, 2010. 2. White, Tom. Hadoop The Definitive Guide. Sebastopol : O'Reilly, 2010. 3. A Dynamic Optimal Replication Strategy in Data Grid Environment, Wuqing Zhao, Xianbin Xu, Zhuowei Wang, Yuping Zhang, Shuibing He, School of Computer, Wuhan University, Wuhan- China, @2012 IEEE. 4. Alxis M Sosaai et. Al Dynamic Replica replacement strategy in data grids. Department of CS, University of Malaysia. 5. MyunghoonJeon, Kwang-Ho Lim, Hyun Ahn, Byoung-Dai Lee, Dynamic Data Replication Scheme in cloud Computing Environment, @2012 IEEE. 6. R.S. Chang, H.P. Chang, "A Dynamic Data Replication Strategy Using Access-Weights in Data Grids," Supercomputing, Vol. 45, No. 3, pp. 277-295, 2008. 7. K. Sashi, A. Selvadoss Thanamani, A New Replica Creation and Placement Algorithm for Data Grid Environment, IEEE International Conference on Data Storage and Data Engineering (2010). 8. K. Sashi, A. Selvadoss Thanamani, Dynamic Replication in a Data Grid using a Modified BHR Region Based Algorithm, Elsevier Future Generation Computer Systems (2011). 14
Prof. Priya Deshpande, Journal Impact Factor (2015): 8.9958 (Calculated By Gisi) www.jifactor.com 9. Chen G., Jaradat S., Banerjee N., Tanaka T., Ko M., and Zhang M., Evaluation and Comparison of clustering algorithms in analyzing ES cell Gene Expression Data, Statistica Sinica, vol. 12, pp. 241-262, 2002. 10. Osama Abu Abbas, Comparisons between data clustering Algorithms, The international Arab journal of information technology, Vol. 5, No. 3, July 2008. 11. Wolfgang Hoschek, Francisco Javier Jaén-Martínez, Asad Samar, Heinz Stockinger, and Kurt Stockinger, Data Management in an International Data Grid Project, Proceedings of the First IEEE/ACM International Workshop on Grid Computing, Springer-Verlag, 2000, pp. 77-90 12. Radhika Jaju, et. Al. Dynamic Data Storage and Replication Based on the Category and Data Access Patterns, MITCOE, Pune University IJSWS 2015. 13. D.Sai Anuhya and Smriti Agrawal, 3-D Holographic Data Storage International journal of Computer Engineering & Technology (IJCET), Volume 4, Issue 6, 2013, pp. 232-239, ISSN Print: 0976 6367, ISSN Online: 0976 6375. 14. Yaser Fuad Al-Dubai and Dr. Khamitkar S.D, A Proposed Model For Data Storage Security In Cloud Computing Using Kerberos Authentication Service International journal of Computer Engineering & Technology (IJCET), Volume 4, Issue 6, 2013, pp. 62-69, ISSN Print: 0976 6367, ISSN Online: 0976 6375. 15. D.Pratiba and Dr.G.Shobha, Privacy-Preserving Public Auditing For Data Storage Security In Cloud Computing International journal of Computer Engineering & Technology (IJCET), Volume 4, Issue 3, 2013, pp. 441-448, ISSN Print: 0976 6367, ISSN Online: 0976 6375. 15