An Approach to Enhance the Performance of Hadoop MapReduce Framework for Big Data

Size: px

Start display at page:

Download "An Approach to Enhance the Performance of Hadoop MapReduce Framework for Big Data"

Elijah Carson
5 years ago
Views:

1 2016 International Conference on Micro-Electronics and Telecommunication Engineering An Approach to Enhance the Performance of Hadoop MapReduce Framework for Big Data Subhash Chandra Department of Computer Science ITM University Gwalior, India Deepak Motwani Department of Computer Science ITM University Gwalior, India Abstract Data analysis is becoming one of the highest research topic among researchers. Information is the baseline of every small and big organization. Everyone wants relevant information for their business to grow faster and bigger. Every organization wants to know what their customers like and dislike. This desirable information requires analysis of very large information stored in various places in different format. Hadoop MapReduce framework becoming a popular platform for processing so large amount of data in very efficient manner. It is used by organizations to process their customers information data sets. Hadoop process datasets in distributed parallel processes by using its HDFS and MapReduce model. Hadoop optimization is requiring more attention from researchers and programmers. Many approaches is already developed to make Hadoop framework optimized. These approaches includes performances tuning and efficient clustering formation. In this research work we have developed Optimal Approach to Improve the Performance of Hadoop framework. K-Means and K- Medoids are well known clustering approaches for clustering inside Hadoop. In proposed approach a modified K- Medoids clustering algorithm has been developed which gives better result for processing inside Hadoop. The research work is tested inside multi node Hadoop environment. Keywords K-mean clustering; MapReduce; Hadoop; HDFS I. INTRODUCTION Hadoop is an open source framework for processing and analyzing of big data with the help of HDFS and MapReduce. Apache Hadoop is developed for not only structured datasets but it can also process unstructured datasets. The data storing capacity of devices has becoming advanced. It make it possible to store a large data inexpensive. The Large data storage is not a big problem nowadays. It gives advantage to Hadoop in processing large data. Firstly any one can buy storage easily in reasonable amount and second Hadoop is an open source and does not require any money to run it. These combination makes it possible to make Hadoop so famous for data processing. Hadoop is not only used for research but also developed for commercial usage with modification. Hadoop Distributed File System (HDFS) is designed for storing large amount of data. When Hadoop process large data set by dividing it into smaller parts then it is highly possible that data loss will happen but HDFS makes reliable data storage which means data loss is avoidable while processing. Hadoop HDFS uses a distributed file system which uses additional nodes by simply adding them into network if requires. It makes it more scalable in terms of data storage. MapReduce is a programming model for processing large data set in distributed and parallel processes stored inside the Hadoop distributed file system. The large amount of data storage is not so big problem but processing so much data and getting desired result require more effort. MapReduce is a model which is used by Hadoop to solve this problem. It uses parallelization technique on smaller parts of data set by dividing from large data set. MapReduce perform execution model for processing smaller data set. Hadoop can store, process and analyze not only terabyte of data but petabyte of data. It is designed for efficient scaling capacity. Hadoop does not other if datasets becoming larger and larger. It can added more and more resources and processing capability by adding more processing nodes and storages. Hadoop is a very powerful tool for efficiently processing large amount of data by connecting multiple computer to each other. Hadoop MapReduce is a java program written as two different and distinct tasks. Hadoop splits datasets into blocks and process these small datasets and result is being merged and return to Hadoop. Map and reduce are two different programmes written in java. The input pair is in the format of key value <key.value>. Map takes these values and process according to given logic and send the output to reduce step. Reduce uses the output of the Map as input and process the result by combining into final output. Hadoop storage is deviled into two components Name node and Data node. Hadoop keeps watch on all the execution with the help of task tracker and job tracker. Hadoop can be run on single node to multiple node. Single node setup is not able to efficiently process the large data set and is only suitable for research purpose while multi node setup is designed to include multiple nodes. Multi node setup can add more node as the datasets increases. A single node Hadoop cluster includes a single Name Node and a single Data Node while a multi-node Hadoop cluster includes 1 Name Node and more than one Data Node. Hadoop clusters are designed such that it is avoid failure at any cost by making each piece of data duplicated on other cluster nodes. Hadoop can increase the capability by just adding additional cluster nodes. II. HADOOP:MAPREDUCE FARMEWORK The two main modules of Hadoop are HDFS and MapReduce. Hadoop Distributed File System (HDFS) is /16 $ IEEE DOI /ICMETE

2 designed for reliable data storage. HDFS can span to a large cluster of computing nodes. MapReduce is a framework which designed to execute an application for processing large amounts of both unstructured and structured data by dividing into smaller datasets in distributed and parallel processing environment on cluster of machines. Hadoop MapReduce process datasets in a fault-tolerant and reliable way. Three major components plays important roles in a Hadoop framework. These are Name Node, Data Node and computer machines. Name Node is created on the master node for storing the file system metadata such as keeps the record of which file and blocks are to be stored on which Data Node. Hadoop startup makes Name Node to reads HDFS state from fs image. Name node uses Job tracker a daemon to keep watch on jobs status. The secondary Name Node is connected to the Name Node. It keeps snapshot the metadata of the file system in local storage. The Data Node designed to work as the slave. It is repository for actual data storage while processing. The Data Node keeps on sending the signal to the Name Node periodically at regular intervals to indicate its presence in the Hadoop system. The Data Nodes communicate with other Data Nodes to keep the replication high and to balance the data by moving copies the data around. The Data Node is responsible for all the client requests for read and write. The daemon which is known as Task Tracker deployed on the Data Node for executing the individual tasks allocated by the Job Tracker. HDFS client machines have Hadoop installed. It has all the Hadoop settings. It does not work as a Master or a Slave. The role of the client machine is to store data into the Hadoop cluster, submit Map Reduce jobs with information of data processing and then retrieve the results of the job after finished. It can read, write, delete files and perform the operations to create and delete from Name Node. MapReduce is the heart of Hadoop. There are two distinct tasks for Hadoop to perform operation. The first task is to map job and second task is to reduce job. The map function selects an input data and divides into multiple set of data. The individual datasets is divided into small set of tuples in the form of keyvalue pair. The reduce function takes the output of the map as an input and then merges those output data. The map job is always performed first before the reduce job. III. RELEATED WORK Hadoop MapReduce is designed as a flexible solution for big data processing. It has facility of adjustment of flexible parameters as suitable to requirements. Hadoop framework depends on various parameters. These parameters are responsible for Hadoop framework performance. The parameters configuration is a topic of research. For efficient result Result in time, the better combination of MapReduce parameters must be known. The parameters must be effectively managed for every tasks and schedule for maximum performance[5]. Hadoop is developed for processing large data set inside the large number of nodes as clusters. The Hadoop applications can be different in aspect of resources, datasets size and other constrains. Hadoop applications are getting several problems such as ineffective CPU utilizations and memory utilizations. Hadoop configuration should need to update the parameters on to many conditions such as resource requirement of particular application. Small changes into Hadoop configuration parameters will make huge difference to performance for the same application with same data set. [6] The big data is processed in distributed and parallel programming model In Hadoop MapReduce. The K-Medoids algorithm HK-Medoids. HK Medoids is implemented in Hadoop MapReduce framework. Every scheduled job follows strict steps in MapReduce. There are various steps for scheduled job.. map phase, combine phase and reduce phase are steps required by any applications. Each input data sample is allocated to one cluster in map phase. The center is being calculated for every cluster in combine phase. In reduce phase, the center is re-calculated. All these phases continuously repeated until the new centre and old centre doesn t have any variation. [9] The data processing and analyzing is very complex job In Big Data. Hadoop MapReduce gives efficient solution for Big Data analysis. Hadoop MapReduce primarily depends on parameters selections and tuning of Hadoop MapReduce parameters for a better result. The tuning of Hadoop MapReduce is an efficient way to improve performance of job completion in respect to time and disk utilizations. The performance tuning uses network traffic, memory usage, CPU usage and many other parameters. The several performance tuning methods have been derived for a optimum result. [14] Data analysis approach in data mining have significant depends on the clustering approaches. In cluster formation, a data set is being divided into datasets. Clustering approach has been used on non similar data types. The data sets classification done when datasets do not have any predefined category. The pattern recognition, image processing, text mining and many more has been analyzed by the clustering approaches. Several algorithms of cluster formations has been proposed. Data clustering approaches are well defined area for research. It includes widely used approach like K-means. K-means is not providing better results into the research in many cases. [12] K-Medoids has advantages over k-means. Hadoop is increasingly being used in various industries. The organization which deals with consolidates and analyze data. Hadoop can be beneficial for such organizations. [11] Hadoop includes HDFS for storage system and MapReduce for processing. Hadoop MapReduce is used for analyzing a large datasets on multiple numbers of nodes. Hadoop framework divides into in to one master node and many slaves nodes. In many engineering and science domain big data is emerging area to analyze big data set. Finding useful data by analyzing from a huge data set is challenging. A large datasets need longer time compare to smaller datasets [15]. IV. PROPOSED APPROACH K-means and k-medoids are widely known clustering approach in academics and scientific researches. In these clustering approaches main purpose is to divide data into partitions. Cluster formation algorithm is designed to provide clusters of smaller datasets segment of similar types. In the research, K-means is not perform well in many scenario. For data partitions in many cases such as Absolute Pearson K

3 means does not perform well. Initial Centroids are selected first in K-means. The selection process of initial centroids is random. They are picked up randomly. This approach does not suitable for optimum result due to its randomness and can lead to ineffective output and low quality result. K-means uses can be expensive and result into waiting of time as number of clusters, iterations and data items increases. Many improved K-means has been proposed but they are lacking sometimes. Hadoop framework has more scope to improvement. The research is going on to find out effective initial centroids selection strategy. We have developed an improved version of K-Medoids algorithm. Our developed approach performs better than existing k-means algorithm in Hadoop MapReduce framework V. EXISTING ALGORITHAM Input: D = {d1, d2,...,dn} // set of n data items. K // Number of desired clusters. Output: A set of k clusters. Steps : 1. Initialize: randomly select k of the n data points as the Medoids 2. Associate each data point to the closest Medoid. 3. For each Medoid m a) For each non-medoid data point o b) Swap m and o and compute the total cost of the configuration 4. Select the configuration with the lowest cost. Repeat steps 2 to 4 until there is no change in the Medoid. VI. PROPOSED ALGORITHM Input: D = {d1, d2,...,dn} // set of n data items. K // Number of desired clusters. Output: A set of k clusters. Steps: 1. Calculate the initial Centroids. 2. Set the cluster with that Centroids. 3. Initially assign the each data point to the cluster. 4. Calculate the mean value of distance of the all data points of that cluster. 5. Define new Centroids with mean value. 6. Update the Centroids value. 7. Repeat steps 3 & 6 until all data points are assigned to any one of the clusters. 8. Initialize: randomly select k of the n data points as the Medoids 9. Associate each data point to the closest Medoid. 10. For each Medoid m a) For each non-medoid data point o b) Swap m and o and compute the total cost of the configuration 11. Select the configuration with the lowest cost. 12. Repeat steps 9 to 11 until there is no change in the Medoids. VII. EXPERIMENTAL SETUP AND RESULT We have developed an optimal approach to improve the performance of Hadoop MapReduce Framework for Big Data analysis. In this research work we have developed a modified clustering algorithm for Hadoop MapReduce framework. We have selected hadoop as our Hadoop MapReduce framework. For implementation, we have selected multi-node Hadoop installation on ubuntu operating system. Four machines have been configured for Hadoop MapReduce framework. One machine has been configured as master node while other machines perform as slave nodes. We have tested execution time for various sample datasets on Hadoop MapReduce framework for existing approach and proposed approach. Our approach outperform the existing approaches The table show the Results of K-means approach and Proposed approach in this table we are given the results of Sample 1, Sample 2, Sample 3, Sample 4 Sample 5 data these results generated by the K-mean approach and proposed approached these Samples run by the Hadoop Multi node Framework in this table Proposed approached Result Better then K-means approach So results show the improve the performance of the proposed approach. Because exertion time low of proposed approach. TABLE1. HADOOP MULTI NODE MAPREDUCE Datasets K-means approach Proposed approach Sample Sample Sample Sample Sample Fig.1(a) show the Comparision between the K-means approach and propsed approach. Graph show the time of Sample 1,Sample 2, Sample 3, Sample 4, Sample 5, in this graph propsed approach exetution time is reduce compare to k mean approach.. so propsed approached batter then the k means approach.. graph show the performance of the k means and propsed approach

Framework for Big Data. We have proposed improved k- Medoids algorithm. Our proposed algorithm performs better then k-means clustering approach.

4 Framework for Big Data. We have proposed improved k- Medoids algorithm. Our proposed algorithm performs better then k-means clustering approach. We have tested sample datasets and our approach outperforms the existing approaches. In future, we test our work against different types data sets such as image files datasets, small file datasets with varying configuration parameters for Hadoop MapReduce framework. Fig.1(a)Hadoop MapReduce Framework Comparision Result Fig 1(b) show the different Samples of data these Samples results improve the perforance of the propesed approach because execution time is low compare to k means approach show in the graph.graph show the comparion between k means approach and propesed approached. Fig.1 (b)hadoop MapReduce Framework Comparision Result VIII. CONCLUSION AND FUTURE WORK Hadoop is widely well known framework for data analysis for large datasets. Hadoop gives performance due to its capability of datasets analysis in parallel and distributed environment. Hadoop is an open source framework. Hadoop Distributed File System (HDFS) and the MapReduce are the modules of Hadoop. HDFS is responsible for data storages while MapReduce is responsible for data processing. Huge data set such as web logs can be processed for analysis by Hadoop. In this research work we have developed an optimal approach to improve the performance of Hadoop MapReduce REFERENCES [1] Parth Gohil, Bakul Panchal, J. S. Dhobi A Novel Approach to Improve the Performance of Hadoop in Handling of Small Files in ieee International Conference on Electrical, Computer And Communication Technologies [2] Dweepna Garg, Parth Gohil, Khushboo Trivedi Modified Fuzzy K- mean Clustering using MapReduce in Hadoop and Cloud in ieee International Conference on Electrical, Computer And 1Communication Technologies [3] E. Raju, M. A. Hameed & K. Sravanthi Detecting Communities in Social Networks using Unnormalized Spectral Clustering incorporated with Bisecting k Means in ieee International Conference on Electrical, Computer And Communication Technologies 2015 [4] Manish Kumar Sharma & Mahesh M. Bundele Design & Analysis of K-means Algorithm for Cognitive Fatigue Detection in Vehicular Driver using Respiration Signal in ieee International Conference on Electrical, Computer And Communication Technologies [5] Pramod Bide & Rajashree Shedge Improved Document Clustering using K-means Algorithm in ieee International Conference on Electrical, Computer And Communication Technologies [6] Rong Gu, Xiaoliang Yang, Jinshuang Yan, Yuanhao Sun, Bing Wang, Chunfeng Yuan, Yihua Huang Hadoop: Improving Map Reduce performance by optimizing job execution mechanism in Hadoop clusters in Journal of Parallel and Distributed Computing Volume 74, Issue 3, March 2014, Pages [7] Garvit Bansal, Anshul Gupta, Utkarsh Pyne, Manish Singhal and Subhasis Banerjee. A Framework for Performance Analysis and Tuning in Hadoop Based Clusters, Workshop on Smarter Planet and Big Data Analytics (SPBDA 2014) held in conjunction with ICDCN [8] Ji Wentian, Guo Qingju, Zhong Sheng "Improved K-medoids Clustering Algorithm under Semantic Web" in Proceedings of the 2nd International Conference on Computer Science and Electronics Engineering (ICCSEE 2013). [9] Vasiliki Kalavri, Vladimir Vlassov Map Reduce: Limitations, Optimizations and Open Issues in 12th IEEE International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom), DOI: /TrustCom , July [10] Swathi Prabhu, Anisha P Rodrigues, Guru Prasad M S, Nagesh H R Performance Enhancement of Hadoop MapReduce Framework for Analyzing Big Data in IEEE International Conference one Electrical, Computer And Communication Technologies, ISBN: , 5-7 March [11] Josepha. issa Performance Evaluation and Estimation Model Using Regression Method for Hadoop Word Count in IEEE Access Volume3, 18 December 2015, Page(s): , ISSN: [12] Subhashree Comparison of k-means and k-medoids Clustering Algorithms for Big Data Using Map Reduce Techniques IJISET - International Journal of Innovative Science, Engineering & Technology, Vol. 2 Issue 4, April [13] Gopi Gandhi, Rohit Srivastava Analysis and Implementation of Modified K-Medoids Algorithm to Increase Scalability and Efficiency for Large dataset in International Journal of Research in Engineering and Technology, Volume: 03 Issue: 06 Jun-2014,eISSN: PISSN: [14] Yaobin Jiang, Jiongmin Zhang Parallel K-Medoids Clustering Algorithm Based on Hadoop in 5th IEEE International Conference on Software Engineering and Service Science (ICSESS), June 2014 ISSN:

5 [15] Juwei Shi, Jia Zou, Jiaheng Lu, Zhao Cao, Shining Li and Chen Wang "MRTuner: A Toolkit to Enable Holistic Optimization for Map Reduce Jobs" in Proceedings of the VLDB Endowment, Vol. 7, No [16] Dili Wu A Self Tuning System Based on Application profiling and performance analysis for Optimizing Hadoop Map Reduce Cluster Configuration IEEE [17] J. Dean and S. Ghemawat. Map Reduce: A Flexible Data Processing Tool. CACM, 53(1):72 77,

Global Journal of Engineering Science and Research Management

Global Journal of Engineering Science and Research Management A FUNDAMENTAL CONCEPT OF MAPREDUCE WITH MASSIVE FILES DATASET IN BIG DATA USING HADOOP PSEUDO-DISTRIBUTION MODE K. Srikanth*, P. Venkateswarlu, Ashok Suragala * Department of Information Technology, JNTUK-UCEV