SURVEY OF MAPREDUCE OPTIMIZATION METHODS

Size: px

Start display at page:

Download "SURVEY OF MAPREDUCE OPTIMIZATION METHODS"

Norma Morrison
6 years ago
Views:

SURVEY OF MAPREDUCE OPTIMIZATION METHODS 1 Parmeshwari P. Sabnis, 2 Chaitali A.Laulkar Computer Department Sinhgad College of Engineering,Pune,India Email : 1 sabnis.parmeshwari@gmail.

1 SURVEY OF MAPREDUCE OPTIMIZATION METHODS 1 Parmeshwari P. Sabnis, 2 Chaitali A.Laulkar Computer Department Sinhgad College of Engineering,Pune,India 1 sabnis.parmeshwari@gmail.com, 2 calaulkar.scoe@sinhgad.edu Abstract MapReduce is a widely used data-parallel programming model for large-scale data analysis. The framework is shown to be scalable to thousand of computing nodes and reliable on commodity clusters. MapReduce provides simple programming interfaces with two functions: map and reduce. The functions can be automatically executed in parallel on a cluster without requiring any intervention from the programmer. Moreover, MapReduce offers other benefits, including load balancing, high scalability, and fault tolerance. The challenge escalates when we consider that data are dynamically and continuously produced, from different geographical locations. For dynamically-generated data, an efficient online algorithm is desired, for timely guiding the transfer of data into the cloud over time for geo-dispersed data sets, there is need to select the best data center to aggregate all data onto given that a MapReduce like framework is most efficient when data to be processed are all in one place, and not across data centers due to the enormous overhead of inter-data center data moving in the stage of shuffle and reduce. Recently, many researchers tend to implement and deploy data-intensive or computation-intensive algorithms on MapReduce parallel computing framework for high processing efficiency. Keywords- Mapreduce, Hadoop, Optimization, Mapreduce Framework, MapReduce Performance I. INTRODUCTION MapReduce is a simple but efficient solution towards large-scale data processing and analysis. Apache Hadoop is an open-source implementation of GFS and MapReduce. Hadoop s MapReduce framework consists of a job scheduler (JobTracker) running on the master node and a task manager (TaskTracker) is running on each slave node.0 Although MapReduce framework frees the users from the labor of cluster management and job scheduling, there are many problems with mapreduce performance such as Hadoop uses a unified master server to control sub servers tasks executing, leading to shortcomings like fatal single point of failure and lacking of space capacity, which seriously affect its scalability. HDFS data are stored in object form, and each object occupies about 150 byte.0 If there is a large number of these small files for storage, NameNode will request for a lot of space. That will severely restrict the scalability of the cluster. JobTracker may be over loaded since it is responsible for monitoring and dispatching simultaneously. Hadoop is similar to the database. It requires specialized optimization according to actual application needs. Many experiments show that there is still much room for the improvement of processing performance. To be able to process large-scale datasets, the fundamental design of the standard Hadoop places more emphasis on high throughput of data than on job execution performance. This causes performance limitation when Hadoop MapReduce is used to execute short jobs that require quick responses. Under a widely distributed environment with high network heterogeneity, Hadoop does not always perform well.[3] The main reason for this performance degradation is the interaction and heavy dependency across different MapReduce phases. This happens because the data placement and task execution are highly coupled in the MapReduce paradigm. [4] Users usually expect short execution or quick response time from a short MapReduce job. To provide SQL like 14

2 queries or analysis, some query systems are available, such as Google s Sawzall [22], Facebook s Hive and Yahoo! s Pig. These systems execute users requests by converting SQL-like queries to a series of MapReduce jobs which are usually short. Obviously, these systems are very sensitive to the execution time of underlying MapReduce jobs. Therefore, reducing the executing time of short jobs is very important to these types of applications. So for this an optimized version of Hadoop which is designed to reduce the time consumed in the execution process of a job. For improving the performance of MapReduce due to above reason, many performance optimization methods are introduced. The model of remote computing is not new and has been very successful in supporting computationally intensive jobs. One of the challenges to apply this model for running a Hadoop cluster is the data transfer cost in and out of the dynamically constructed cluster. For an ondemand Hadoop cluster, data to be processed must first be imported into the cluster. Because processing large amounts of data is central to the typical Hadoop program, the cost of importing those data into a cluster is extremely relevant to overall performance.0 Map-Reduce have inherent limitations on its performance and efficiency. Therefore, many studies have endeavored to overcome the limitations of the Map- Reduce framework. In the replica selection of the input files for map tasks, the Map-Reduce framework does not take into account the distribution of the input data blocks in the distributed file system and the load of the computing nodes themselves, which leading to increase the amount of network data transfer and system load when running map tasks. 0 Especially when the framework uses the FIFO job scheduling strategy to deal with a large number of small jobs, the performance of the framework will be very low. II. BACKGROUND AND WORKING OF MAPREDUCE MapReduce framework is scheduled by JobTracker and TaskTracker. [16] The relationship of tasks allocation is shown in Fig JobTracker is the only master control, which can run on any computer in the cluster for scheduling and managing other TaskTrackers, allocating Map task and Reduce task to free TaskTrackers for parallel running and monitoring the condition of the tasks. There can be more than one TaskTracker. TaskTracker is in charge of the implementation of the tasks [15]. It must run on DataNode, which means that DataNode is not only a data storage node, but also a computing node. If a TaskTracker s task fails, JobTracker will allocate the task to one of other free TaskTrackers, and rerunning. 0 TABLE I shows the process of Hadoop dealing with large data sets. MapRecude model abstracts the parallel computing process on the large clusters into two functions, Map function and Reduce function. Map function accepts a key-value pairs set as input, and outputs one or more intermediate state key-value pairs set. Fig. 1.1 Working of mapreduce When a job is submitted to the MapReduce framework, MapReduce will divide it into several Map tasks and assign them to different nodes for running. Every Map task only deals with a part of the input data. After Map task processing, the results, those intermediate state keyvalue pairs, will be sent to the Reduce function. Reduce function will merge the pairs based on a specific key, then generate and output the value-keys that client requires. TABLE I MapReduce I/O Function Input Output Directions Map (K1,V1) (K2,V2) The input keys (K1,V1)is mapped to keys of an intermediate format (K2,V2) collection Reduce (K1,V1) (K2,V2) Reduce a group of middle set values associated with k2 to smaller set of values III. POSSIBLE OPTIMIZATIONS List below are the some optimization method to increase the performance of Mapreduce. A. From the perspective of application for optimization[26]: Since MapReduce parses the data file iteratively and line by line, programming application programs with high efficiency under this circumstance is a way to optimization. Performance of MapReduce can be improved form the following aspects: 15 Avoid unnecessary Reduce tasks. Pull in external file. Add a Combiner to Job. Reuse Writable type.

3 Use StringBuffer instead of String to track program bottlenecks. B. Hadoop system parameter optimization: There are over 190 configuration parameters in current Hadoop system. How to adjust these parameters so that jobs can run as fast as possible is also a kind of optimization idea. 0 Hadoop system parameters optimization can start from the following three aspects: Parameters of Linux file system. General parameters of Hadoop. Hadoop jobs parameters. C. Hadoop job scheduling algorithm optimization: The fact that Hadoop configuration based on cluster hardware information and the number of nodes can greatly improve the performance of Hadoop cluster has been proved. However, this method just optimizes the performance statically. It cannot modify the configuration files, load them or put them into force dynamically during the running time. Optimizing the job scheduling algorithm can solve this problem well. The scheduler is a pluggable module in Hadoop, and users can design their own dispatchers according to their actual application requirements.0 Here are three task dispatchers. 1. The default dispatcher: This dispatcher adopts FIFO algorithm, which is simple and clear, and the burden of Jobtracker is not so heavy. 2. Dispatcher with computational capability: This kind of dispatcher supports multiple queues. Each queue has a certain amount of resources and uses FIFO scheduling policy. Jobs are scheduled in accordance with job priority and the order of submitted time. 3. Fair share scheduling algorithms: This solution is proposed by Facebook. The design philosophy is to ensure that all jobs can obtain the amount of resources as equal as possible. When there is only one task running in the system, it will monopolize all the resources of the cluster. When there is more than one task, there will be TaskTracker being released and assigned to the newly submitted job to ensure all the tasks can obtain the same computing resources roughly. D. Data Transfer Bottlenecks: Big challenge is to how to minimize the cost of data transmission for cloud user. Map-Reduce-Merge [8] is a new model that adds a Merge phase after Reduce phase that combines two reduced outputs from two different MapReduce jobs into one, Map-Join-Reduc adds Join stage before Reduce stage. T. Nykiel proposed MRShare[29] is a sharing framework that transforms a batch of queries into a new batch that can be executed more efficiently by merging jobs into groups. Further it evaluates each group as a single query. Data skew is also an important factor that affects data transfer cost. In order to overcome this, a method that divides a MapReduce job into two phases was proposed: sampling MapReduce job and expected MapReduce job was proposed. The first phase is to sample the input data, gather the inherent distribution on keys frequencies and then make a partition scheme. In the second phase, expected MapReduce job to group the intermediate keys quickly applies partition scheme to every mapper. E. Iterative Optimization: For iterative problems MapReduce need lots of input-outputs and unnecessary computations while solving it. Twister proposed by J. Ekanayake is an enhanced MapReduce runtime that supports iterative MapReduce computations efficiently, which adds an extra Combine stage after Reduce stage, which results in data output from combine stage which results into next iteration s Map stage. It avoids instantiating workers repeatedly and previously instantiated workers are reused for the next iteration with different inputs. HaLoop is similar to Twister, which is a modified version of the MapReduce framework that supports for iterative applications by adding a Loop control. F. Online: MapReduce Online raises an issue that frequent checkpointing and shuffling of intermediate results limit pipelined processing. MapReduce framework was modified by making Mappers push their data temporarily stored in local storage to Reducers in the same MR job periodically. In addition, Map-side pre-aggregation is used to reduce communication. Tyson Condie proposed Hadoop Online Prototype (HOP) is similar to MapReduce Online. HOP is a modified version of MapReduce framework that helps users to get returns from a job as it is being computed. D. Jiang et al [28] found that the merge sort in MapReduce costs lots of I/Os and seriously affects the performance of MapReduce. G. Short Job Optimizations: The focus is on improving the execution performance of short jobs on Hadoop. After analyzing shortcomings of the job execution mechanism in the standard Hadoop, implement an optimized version of Hadoop MapReduce which is designed to reduce the time consumed in the execution process of a job. The first optimization is to reduce the time cost during the initialization and termination stages of a job by removing the constant time cost of 4 heartbeats for its setup and cleanup tasks. For the second optimization, instead of using the heartbeat-based pull-model task assignment, we design and implement a push-model task assignment mechanism. For the third optimization, design and implement an instant message communication mechanism for events notification between the JobTracker and TaskTrackers to separate the message communication from heartbeats. 0 16

4 H. Optimization of Data Import: In the traditional HDFS architecture the Client uses a single input stream and buffer to import a local file, making the transfer process sequential. That is, the Client passes the input stream to the Datanode to copy the first block of the file and then waits for a response from the Datanode indicating the transfer was a success before continuing on to the second block. This situation creates a bottleneck because all data from the file must pass through the Client before it is transferred to the Datanodes. The last block of every file must wait for all of the previous blocks to finish before it is copied. The sequential transfer process is a hindrance, particularly when extremely large files are transferred from the local file system to HDFS. If the data can be accessed directly by the Datanodes, propose an alternate approach that maintains much of the traditional process while allowing for the initial data transfesr into HDFS to bypass the buffer in the Client. 0 In implementation, the initial part of the process in which the Client and Namenode communicate to determine where the file will be stored on the Datanodes occurs normally. But instead of opening an input stream to the local file and passing it along to the first Datanode through a socket, the new Client sends the file path, along with the offset in the file and the amount of bytes of data to be copied, in a byte array. The Datanode then parses the incoming byte array to determine the path to the file on the local file system, how far to offset within that file and how many bytes of the file it will transfer to itself. I. Task Assignment: In order to overcome the shortcomings of the FIFO scheduling strategy, we add multiple FIFO queues to the Map- Reduce framework in Hadoop. With this, several jobs will be able to run at the same time in the Map-Reduce framework. 0The optimized map task assignment strategy consists of two parts: 1. Data locality scheduling strategy: add several job queues into the Map-Reduce framework, so the JT will be able to schedule more than one job into running state at the same time to take full advantage of the computing capacity of the nodes and shorten the average execution time of the jobs. Ultimately, it will improve the performance and efficiency of the system 2. Replica selection strategy: On the premise of all map tasks scheduled to execute locally, we should consider the load of the system. Load balancing can further enhance the performance and efficiency of the entire system. All the nodes in our Hadoop cluster is isomorphic, so we can assume that the overhead of the execution of the operating system and hardware on each node is a constant value, referred to as A. Parameters used to describe the load information includes: number of tasks in the run queue, speed of the system call, CPU context switching rate, percentage of CPU idle time, the idle memory size and so on. IV. PROPOSED METHOD To be able to process large-scale datasets, the fundamental design of the standard Hadoop places more emphasis on high throughput of data than on job execution performance. This causes performance limitation when we use Hadoop MapReduce to execute short jobs that requires quick responses. In order to speed up the execution of short jobs optimization methods are required to improve the execution performance of MapReduce jobs. This can be achieved by improving communication between jobtracker and tasktracker. For comparison of previous working of mapreduce and this one, this system needs to be tested on an application. For this K-means clustering algorithm would be considered. V. COMPARATIVE STUDY With the task assignment strategy, all the map tasks can be assigned to the TaskTracker containing the data input fragments for the tasks. Taking into account the load balancing, improved model can increase the throughput and reduce the average response time of the system effectively. The process of importing data, as discussed in optimization of data import, into Hadoop data has drawn significant attention in the high performance computing industry. Short Job optimization has successful increased the execution speed of Standard Hadoop. Data transfer optimization helps to reduce the transformation cost. Combine stage for iterative optimization has increased the data processing speed for the iterative process. Proper selection of dispatcher and parameters for Hadoop can increase the execution speed and performance of Hadoop. Various optimizations are available of increasing the performance of Hadoop. We can choose according to our application need. VI. CONCLUSIONS As an open source implementation of cloud computing system, Hadoop achieves more and more attention by the academia industry. And its application is more and more widespread. Though Hadoop shows good performance in dealing with large data sets concurrently, there are still some shortcomings. This paper describes the working of MapReduce and analyzes existing problems of Hadoop data processing platform, and gives some suggestions of Hadoop cluster optimization. ACKNOWLEDGMENT We take this opportunity to express my deep sense of gratitude towards those, who have helped us in various ways, for preparing this paper. We would like to thank the reviewers of this paper for their constructive comments. REFERENCES [1] Jinshuang Yan, Xiaoliang Yang, Rong Gu, Chunfeng Yuan, and Yihua Huang, Performance Optimization for Short MapReduce Job Execution in Hadoop, 2012 Second 17

5 International Conference on Cloud and Green Computing [2] Changqing Ji, Yu Li, Wenming Qiu, Uchechukwu Awada, Keqiu Li, Big Data Processing in Cloud Computing Environments, 2012 International Symposium on Pervasive Systems, Algorithms and Networks. [3] Benjamin Heintz, Chenyu Wang, Abhishek Chandra, and Jon Weissman, Cross-Phase Optimization in MapReduce, 2013 IEEE International Conference on Cloud Engineering. [4] XiaohongZhang, GuoweiWang, ZijingYang, YangDing, A Two-phase Execution Engine of Reduce Tasks In Hadoop MapReduce, 2012 International Conference on Systems and Informatics (ICSAI 2012). [5] Congchong Liu and Shujia Zhou, Local and Global Optimization of MapReduce Program Model, 2011 IEEE World Congress on Services [6] Huang Lu, Hu Ting-ting and Chen Hai-shan, Research on Hadoop Cloud Computing Model and its Applications, 2012 Third International Conference on Networking and Distributed Computing [7] Lijie Xu, MapReduce Framework Optimization via Performance Modeling, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum R. Nicole. [8] H. Yang, A. Dasdan, R. Hsiao, and D. Parker, Map-reduce-merge: simplified relational data processing on large clusters, in Proceedings of the 2007 ACM SIGMOD international conference on Management of data. ACM, 2007, pp [9] D. Jiang, A. Tung, and G. Chen, Map-Join- Reduce: Toward scalable and efficient data analysis on large clusters, Knowledge and Data Engineering, IEEE Transactions on, vol. 23, no. 9, pp , 2011 [10] C. Wang, J. Wang, X. Lin, W. Wang, H. Wang, H. Li, W. Tian, J. Xu, and R. Li, Mapdupreducer: detecting near duplicates over massive datasets, in Proceedings of the 2010 international conference on Management of data. ACM, 2010, pp [11] B. He, W. Fang, Q. Luo, N. Govindaraju, and T. Wang, Mars: a mapreduce framework on graphics processors, in Proceedings of the 17th international conference on Parallel architectures and compilation techniques.acm,2008,pp [12] C. Zhang, F. Li, and J. Jestes, Efficient parallel knn joins for large data in mapreduce, in Proceedings of the 15th International Conference on Extending Database Technology. ACM, 2012, pp [13] T. Condie, N. Conway, P. Alvaro, J. Hellerstein, J. Gerth, J. Talbot, K. Elmeleegy, and R. Sears, Online aggregation and continuous query support in mapreduce, in ACM SIGMOD, 2010, pp [14] S. Das, Y. Sismanis, K. Beyer, R. Gemulla, P. Haas, and J. McPherson, Ricardo: integrating r and hadoop, in Proceedings of the 2010 international conference on Management of data. ACM, 2010, pp ] J. Dean, and S. Ghemawat, MapReduce: simplified data processing on large clusters, Commun. ACM, vol. 51, no. 1,pp , [16] T. White, Hadoop: The Definitive Guide: O'Reilly Media, [17] Weijia Xu* Wei Luo Nicholas Woodward, Analysis and Optimization of Data Import with Hadoop, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum. [18] K. Shvachko, K. Hairong, S. Radia et al., "The Hadoop Distributed File System."pp [19] Songchang Jin,Shuqiang Yang,Yan Jia, Optimization of Task Assignment Strategy for Map-Reduce, nd International Conference on Computer Science and Network Technology. [20] Guangbin Xu, Load balancing principle and algorithm implementation on the Linux cluster [BE,OL], 1.html, [21] J. Ekanayake, H. Li, B. Zhang, T. Gunarathne, S. Bae, J. Qiu, and G. Fox, Twister: a runtime for iterative mapreduce, in Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing. ACM, 2010, pp [22] R. Pike, S. Dorward, R. Griesemer, S. Quinlan. Interpreting the Data: Parallel Analysis with Sawzall, Scientific Programming Journal, vol. 13, no. 4, Oct. 2005, pp [23] K. Morton, A. Friesen, M. Balazinska, and D. Grossman. Estimating the progress of MapReduce pipelines, in ICDE, [24] Sangwon Seo, Ingook Jang, Kyungchang Woo, Inkyo Kim, Jin-Soo Kim, Seungryoul Maeng. HPMR:Prefetching and Pre-Shuffling in Shared MapReduce Computation Environment. IEEE 18

6 International Conference on Cluster Computing and Workshops, 2009 [25] Shubin Zhang, Jizhong Han, Zhiyong Liu, Kai Wang, Shengzhong Feng. Accelerating MapReduce with Distributed Memory Cache. IEEE, th International Conference on Parallel and Distributed Systems [26] Xin Daxin, Liu Fei. Research on optimization techniques for Hadoop cluster performance [J]. Computer Knowledge and Technology, 2011,8(7):5484~5486. [27] G. Malewicz, M. Austern, A. Bik, J. Dehnert, I. Horn, N. Leiser, and G. Czajkowski, Pregel: a system for largescale graph processing, in Proceedings of the 2010 international conference on Management of data. ACM, 2010, pp [28] D. Jiang, B. Ooi, L. Shi, and S. Wu, The performance of mapreduce: An in-depth study, Proceedings of the VLDB Endowment, vol. 3, no. 1-2, pp , [29] T. Nykiel, M. Potamias, C. Mishra, G. Kollios, and N. Koudas, Mrshare: Sharing across multiple queries in mapreduce, Proceedings of the VLDB Endowment, vol. 3, no. 1-2, pp ,

Performance Optimization for Short MapReduce Job Execution in Hadoop

2012 Second International Conference on Cloud and Green Computing Performance Optimization for Short MapReduce Job Execution in Hadoop Jinshuang Yan, Xiaoliang Yang, Rong Gu, Chunfeng Yuan, and Yihua Huang