Research on Load Balancing in Task Allocation Process in Heterogeneous Hadoop Cluster

Size: px

Start display at page:

Download "Research on Load Balancing in Task Allocation Process in Heterogeneous Hadoop Cluster"

Suzan Sharp
5 years ago
Views:

1 nd International Conference on Artificial Intelligence and Engineering Applications (AIEA 2017) ISBN: Research on Load Balancing in Task Allocation Process in Heterogeneous Hadoop Cluster ZHIHAO TENG and ZHENGPING JIN ABSTRACT The original task scheduling algorithm of Hadoop cannot meet the performance requirements of heterogeneous clusters. The existing Hadoop task scheduling algorithm assumes that the performance of each node is consistent. This algorithm can perform well in homogeneous clusters. However, in the heterogeneous Hadoop cluster, due to different performance between CPU, disks and memory, the load imbalance will occur in the cluster. In the view of the unbalanced load of Hadoop cluster in heterogeneous environment, this paper proposes a new algorithm named Load balancing algorithm based on heterogeneous environment (LBAHE), which takes into account the performance differences of each node. And when measuring the performance of nodes, the number of slots is no longer the only criterion. We also add the CPU, disk memory and other factors. Experiments show the efficiency of the new algorithm and it can perform tasks faster in heterogeneous clusters than original algorithm. KEYWORDS load balancing, heterogeneous cluster, task scheduling, Hadoop INTRODUCTION Hadoop is one of the most important processing frameworks for big data. Load balancing has always been an important factor affecting Hadoop cluster s performance. Increasing data poses a challenge to processing capabilities of Hadoop cluster[1]. Inappropriate load balancing strategy will lead to the waste of computing resources, increase execution time, and even lead to system downtime. On the other hand, the appropriate load balancing strategy cannot satisfy the demand of users and make rational use of resources to complete tasks as soon as possible. The current load balancing strategy of Hadoop cluster perform well in the homogeneous cluster. However, in heterogeneous environments, due to the differences between data processing capacity, disk usage, and file read frequency, the load imbalance will occur in Hadoop clusters. In the heterogeneous environment, the performance of nodes is different. The current Hadoop cluster load balancing strategy does not take into account the differences in nodes, resulting in unreasonable allocation of tasks, which may result in unreasonable use of resources. Zhihao Teng, @163.com, Zhengping Jin, zhpjin@bupt.edu.cn, State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, Beijing, , China 1013

2 And when allocating task in the Hadoop s MapReduce stage, the current strategy only considers whether the node has enough memory. The CPU, disk, and I/O capabilities are not used as a reference factor for node load capacity. This could lead to some unreasonable allocation of tasks. Therefore, Hadoop load balancing strategy for heterogeneous environment needs further study [2]. In this paper, a novel Hadoop load balancing strategy is proposed for heterogeneous environment. The main research results are as follows: 1. Deeply analyze and understand the existing Hadoop load balancing strategy, and improve the Hadoop load balancing strategy in heterogeneous environment. 2. Through experimental evaluation, the same task is performed under the new and original load balancing strategies, and the task runs shorter under the new strategy compared to the original one. In the second section, we will introduce the relative work of load balancing, and the third section is about the existing problems of load balancing in heterogeneous environment and the new load balancing strategy proposed in this paper. In the fourth section, we evaluate the efficiency of the strategy through the relevant experiments. The fifth section is the summary. RELATED WORK Load balancing has always been an important factor affecting the performance of Hadoop clusters. In this section, we will introduce the development of Hadoop load balancing strategy in heterogeneous environments. Matei Zaharia et al.[3] proposed a LATE scheduling algorithm to improve MapReduce performance in an unbalanced environment. Guo et al.[4] and others put forward a new resource scheduling algorithm where tasks are assigned to those nodes whose resources cannot be fully utilized. At this stage, the main study of load balancing: Smriti R. Ramaknshnan et al.[5] studied the load balancing of Reduce. Venkata Swamy et al.[6] proposed a h-mapreduce model. It evaluates Reduce overloaded node. Yuanquan Fan et al.[7] proposed a LBVP algorithm, this strategy can ensure that each Reduce to obtain roughly the same amount of data, so as to achieve the purpose of load balancing. In [8], the heterogeneous cluster load balancing and file response time are studied, but the heterogeneity of node capacity in heterogeneous clusters is neglected. In [9], a strategy of prorating data is proposed which takes into account the heterogeneity of nodes, but ignores the influence of the heterogeneity of node storage space on data storage. In [10], the overload load efficiency of Hadoop data is proposed, which can balance the data load of each frame in a short time, but does not consider the heterogeneity of nodes. LOAD BALANCING STRATEGY Hadoop consists of two core components: HDFS and MapReduce. MapReduce is a parallel computing framework and it is mainly responsible for data processing. JobTracker is the master of MapReduce, which is responsible for scheduling the tasks. These tasks are distributed to different TaskTracker nodes. TaskTracker is MapReduce Slave that is only responsible for running the master Task execution. JobTracker assigns tasks to TaskTracker according to a certain scheduling algorithm. Default scheduling algorithm are shown in Figure 1: 1014

3 Figure 1. Task-obtaining process of the default Hadoop scheduling algorithm. [1] 1. Tasktracker count the number of currently executing tasks. 2. Tasktrackers determine whether the number of current executing tasks is less than the fixed number of slots for tasks or not. The fixed number of slots for tasks limits the number of tasks simultaneously running on a tasktracker, which is marked as FixedTasksCapacity. 3. AskForNewTask is a flag to indicate whether to obtain a new task or not. If RunningTasks is less than Fixed-TasksCapacity, it means that this tasktracker could accept new tasks. The tasktracker will set the flag to true, otherwise, to false. We can see when JobTracker allocate the task, it collect the number of nodes through the heartbeat mechanism firstly. The implementation of a MapReduce job consists of three main processes: Map, Shuffle, Reduce. In the Map phase, the Partition function is called to assign tasks to Reduce: PartitionNum = Hash (key)% ReduceNum (3-1) PartitionNum indicates the partition number and ReduceNum represents the number of Reduce. As can be seen from the above formula, the data will be evenly distributed to each Reduce. In isomorphic environment, it is reasonable to call the Partition function to assign the results of Map calculations to Reduce equally. However, for heterogeneous clusters, the network bandwidth, CPU frequency, memory size, disk read and write speed of each node is inconsistent. Those will lead to the performance of each node is quite different, and the load of each node is dynamic changes, which May cause some problems: For the same task, the load on different nodes is different. Assume that A and B have X and 2X memory, respectively, and take up 50%. But the remaining amount is different. So the amount that can continue to accept new tasks is different. This results in an irrational allocation of tasks in heterogeneous clusters. If the Tasktracker is a high performance computing node and cannot get more tasks. This means a hunger situation. If the task tracker is a low performance computing node and continues to get new tasks. However, it cannot run more tasks, which means "saturation", resulting in overloading. And in the course of the operation of various tasks, the requirements for resources is not the same. Some tasks can be heavily calculated, i.e., CPU intensive tasks, while others require a large number of I/O operations. However, it can be seen from the above that the default scheduling algorithm for Hadoop is only based on TaskTracker node number to assign task which can cause unreasonable assignment. In addition, the 1015

4 following situation may occur: the node has enough slots to continue to accept the new task, but this node does not have enough CPU resources. But the CPU is the necessary resources for it, this situation will exacerbate CPU resource consumption, resulting in node load imbalance and the implementation of the overall slow task. A new algorithm is proposed for the problem of default task allocation. The new algorithm takes into account the heterogeneous situation of the cluster, so that it can assign different numbers of tasks to different nodes. The new algorithm also allocates cpu, disk, memory and others as the evaluation criteria for node s load capacity. This can be more comprehensive than the default method. The algorithm steps are as follows: 1. JobTracker accepts the job and starts the task. 2. TaskTracker detects the node's resource information (slot number, CPU, disk, memory, etc.), and each node task execution status. 3. Pass the detected information to the JobTracker via the heartbeat mechanism. 4. Under the premise of heterogeneous consideration, JobTracker judges whether TaskTracker continues to assign new tasks according to the information transmitted, the actual load situation of each node and the ability to run tasks. When the new algorithm detects the resource information, it not only tests whether there are enough slots, but also has a more comprehensive detection of node information: CPU, disk, memory and so on. All the information is combined to judge the load capacity of the nodes. This avoids the situation when the number of slots is sufficient, but other resources are insufficient to cause load imbalance. When the node information is delivered to the JobTracker, the node assignment is judged according to the information obtained and the actual situation of each node. This process takes into account the difference in performance of each node, the load capacity of different nodes. Therefore, different performance nodes corresponding to the overload of the threshold is also different. The new algorithm judges the conditions of the node under the conditions of the load. In isomorphic clusters, Hash functions are assigned to each Slave node on average. But in the heterogeneous environment, we must first evaluate the performance of each node. For CPU, because the kernel number of each node is different, we must first determine the number of nodes, then detect the load of CPU, and use the kernel and each load to get the whole node about the use of CPU. For CPU nodes with different kernel numbers, different thresholds are set, and when the corresponding threshold is exceeded, the node is considered to be overloaded. Similarly, for the disk, you cannot simply use a unified standard to judge different node. A larger threshold is set for a larger node on the disk, a smaller threshold is set for a smaller disk. This allows you to have different criteria for the nodes that have different disks. The Slave node periodically reports the heartbeat to the Master node through the RPC protocol, which includes memory, CPU, disk, and so on. Master node assigns tasks to slave nodes according to the information of each node and a certain scheduling algorithm. After the Master receives the node confidence of the Slave, it will determine the memory, CPU, disk and other information of the node. If each option is not overloaded, a new task will be assigned to this node. If one of the returned nodes is overloaded, i.e., the threshold is exceeded, a new task will not be assigned to this node for the time being, but it will continue to monitor the situation of this node. For the next heartbeat message, if the load of this node reaches its normal level and has enough resources to run the new task, it will get a new task. 1016

5 EXPERIMENTAL EVALUATION In order to verify the performance of the new algorithm. Running time in the heterogeneous Hadoop system will be the criterion. The running time of the task includes the response time and the actual running time. Response time is from the submission of tasks to the beginning. This time indicator reflects the system's ability to provide services. The shorter the execution time, the stronger the system's ability to handle the task. The shorter the time, the more reasonable the dispatch of the system task is, and the better the load condition. Experimental environment: Experiments require sufficient tasks with heterogeneous cluster environments. Because the experimental environment requires an unbalanced cluster, a heterogeneous Hadoop cluster is built with a virtual machine. Each VM is a Hadoop node, the node's hardware status is different, there are three nodes with 2G memory, two nodes with 1G memory. One node is set to JobTracker, and the other four nodes are set to TaskTracker. All hard disk space is 20 GB. Virtual machines installed CentOS system, JDK version jdk-7u79-linux-x64, Hadoop version of The configuration of the cluster environment is shown in the following table. Which will be running Slave nodes with different shell scripts to consume the limited resources. In the above experimental environment, different tasks are executed in different algorithms, and each task is executed many times. The following experimental results are obtained: Figure 2 shows the running time of different algorithms under different raw data. The new line represents the running time of the improved algorithm, and original is the running time of the original algorithm. As can be seen from the diagram, the two algorithms run under the data of 100M, 200M, 500M, 1024M and 2048M respectively, and the improved algorithm runs much shorter. This shows that in the same heterogeneous environment, the new algorithm can be more reasonable allocation of tasks, and more rational use of resources. And make the task completed more quickly.this proves that the new algorithm has better load balancing capability and the effectiveness of the algorithm. Table 1. Hadoop cluster communication information. Host name memory disk Master 2G 20G Slave1 2G 10G Slave2 2G 20G Slave3 1G 20G Slave4 1G 10G Execution time(s) Size of tak(m) Original Figure 2. experimental results. 1017

6 task ratio/performance ratio size of task(mb) node1 Figure 3. Experimental results of the original algorithm. 1.5 task ratio/performance size of task(mb) node1 Figure 4. Experimental results of the new algorithm (LBAHE). The percentage of tasks represents the ratio of the actual task and the total amount of tasks that are actually running throughout the job. The performance ratio represents the weight of the performance of the running task node across the cluster performance, which reflects the amount of task that the node actually should allocate. This ratio is used to measure the load status of nodes in a cluster. In a reasonably loaded system, the ratio of the task to capacity ratio and the performance ratio should be close to 1. When the node approaches 1, the node that represents a certain capability runs a corresponding number of tasks. Fig. 3 is the experimental result of the original algorithm, and the experimental result selects the proportion of two nodes. As can be seen from the experimental data, in the result of the original algorithm, the scale values are all greater than 1 or less than 1. It shows that the task assignment is unreasonable and the load is unbalanced. In the experiment results of the new algorithm (LBAHE), the scale value is fluctuating between 1 and above, which shows that the task assignment is reasonable and the load is better. Experimental results show that the new algorithm (LBAHE) is effective for heterogeneous cluster load balancing. SUMMARY In this paper, we solve the Hadoop load balancing problem in heterogeneous environment. Firstly, we analyze the factors of load imbalance caused in unbalanced cluster: the performance of the cluster is inconsistent and the judgment factor is simple. In the view of the above two problems, a new algorithm is proposed, which can adapt to the Hadoop cluster of heterogeneous environment. It can be seen from the experiment 1018

7 that in the case of heterogeneous and different load conditions of each node. The new algorithm shows good performance. ACKNOWLEDGEMENTS This work is supported by NSFC (Grant No ), and the Fundamental Research Funds for the Central Universities (Grant No.2015RC23). REFERENCES 1. Apache. (2012, Aug.). Hadoop, The Apache Software Foundation, ForrestHill, MD, USA. [Online]. Available: 2. Xiaolong Xu, Lingling Cao, and Xinheng Wang, Adaptive Task Scheduling Strategy Based on Dynamic Workload Adjustment for Heterogeneous Hadoop Clusters. IEEE Systems Journal, 2016, 10 (2): Zaharia M., Konwinski A., Joseph A.D., et al. Improving MapReduce Performance in Heterogeneous Environments [C]/IOSDI. 2008, 8(4): Guo Z., Fox G.: Improving MapReduce performance in heterogeneous network environments and resource utilization [C]//Proceedings of the th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012). IEEE Computer Society, 2012: Ramakrishnan S.R., Swart G., Urmanov A. Balancing Reducer skew in MapReduce workloads using progressive sampling [C]//Proceedings of the Third ACM Symposium on Cloud Computing_ ACM, 2012: Martha V.S., Zhao W., Xu X. h-mapreduce: A Framework for Workload Balancing in MapReduce [C]//Advanced Information Networking and Applications (AINA), 2013 IEEE 27th International Conference on. IEEE, 2013: Fan Y., Wu W., Cao H., et al. LBVP.: A load balance algorithm based on Virtual Partition in Hadoop cluster [C]//Cloud Computing Congress (APC1oudCC), 2012 IEEE Asia Pacific. IEEE, 2012: Liu Kun, Niu Wenliang. An improved Hadoop data load balancing algorithm [J]. Journal of Henan Polytechnic University: Natural Science Edition, 2013, 32(3): Xie J.,.Yin S., Ruan X., et al. Improving map reduce performance through data placement in heterogeneous hadoop clusters [C]/ /Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW), 2010 IEEE International Symposium on. IEEE, 2010: Liu Kun, Xiao Lin, Zhao Haiyan. Research and optimization of cloud data load balancing in Hadoop.Microellectronics & Computer,2012,29(9):

Dynamic processing slots scheduling for I/O intensive jobs of Hadoop MapReduce

Dynamic processing slots scheduling for I/O intensive jobs of Hadoop MapReduce Shiori KURAZUMI, Tomoaki TSUMURA, Shoichi SAITO and Hiroshi MATSUO Nagoya Institute of Technology Gokiso, Showa, Nagoya, Aichi,