Available online at ScienceDirect. Procedia Computer Science 89 (2016 )

Similar documents
Survey on MapReduce Scheduling Algorithms

Available online at ScienceDirect. Procedia Computer Science 89 (2016 )

Mitigating Data Skew Using Map Reduce Application

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

High Performance Computing on MapReduce Programming Framework

ADVANCES in NATURAL and APPLIED SCIENCES

Cloud Computing CS

Dynamic processing slots scheduling for I/O intensive jobs of Hadoop MapReduce

Survey Paper on Traditional Hadoop and Pipelined Map Reduce

Jumbo: Beyond MapReduce for Workload Balancing

Chapter 5. The MapReduce Programming Model and Implementation

SAMR: A Self-adaptive MapReduce Scheduling Algorithm In Heterogeneous Environment

A SURVEY ON SCHEDULING IN HADOOP FOR BIGDATA PROCESSING

Global Journal of Engineering Science and Research Management

Optimal Algorithms for Cross-Rack Communication Optimization in MapReduce Framework

Lecture 11 Hadoop & Spark

PROFILING BASED REDUCE MEMORY PROVISIONING FOR IMPROVING THE PERFORMANCE IN HADOOP

Research on Load Balancing in Task Allocation Process in Heterogeneous Hadoop Cluster

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Twitter data Analytics using Distributed Computing

MixApart: Decoupled Analytics for Shared Storage Systems

ADAPTIVE HANDLING OF 3V S OF BIG DATA TO IMPROVE EFFICIENCY USING HETEROGENEOUS CLUSTERS

An Enhanced Approach for Resource Management Optimization in Hadoop

MPJ Express Meets YARN: Towards Java HPC on Hadoop Systems

On Availability of Intermediate Data in Cloud Computations

Introduction to MapReduce

Locality Aware Fair Scheduling for Hammr

On Availability of Intermediate Data in Cloud Computations

Database Applications (15-415)

A priority based dynamic bandwidth scheduling in SDN networks 1

Batch Inherence of Map Reduce Framework

Programming Models MapReduce

Available online at ScienceDirect. Procedia Computer Science 98 (2016 )

Data Analysis Using MapReduce in Hadoop Environment

International Journal of Computer Science Trends and Technology (IJCST) Volume 5 Issue 2, Mar Apr 2017

A Micro Partitioning Technique in MapReduce for Massive Data Analysis

Real-time Scheduling of Skewed MapReduce Jobs in Heterogeneous Environments

An Adaptive Scheduling Technique for Improving the Efficiency of Hadoop

Distributed Face Recognition Using Hadoop

SCHEDULING IN BIG DATA A REVIEW Harsh Bansal 1, Vishal Kumar 2 and Vaibhav Doshi 3

Big Data Processing: Improve Scheduling Environment in Hadoop Bhavik.B.Joshi

A Survey on Job Scheduling in Big Data

Survey on Incremental MapReduce for Data Mining

EFFICIENT ALLOCATION OF DYNAMIC RESOURCES IN A CLOUD

Performance Optimization for Short MapReduce Job Execution in Hadoop

A Platform for Fine-Grained Resource Sharing in the Data Center

Chisel++: Handling Partitioning Skew in MapReduce Framework Using Efficient Range Partitioning Technique

MapReduce: A Programming Model for Large-Scale Distributed Computation

Embedded Technosolutions

Implementation of Aggregation of Map and Reduce Function for Performance Improvisation

Efficient Map Reduce Model with Hadoop Framework for Data Processing

CS370 Operating Systems

Page 1. Goals for Today" Background of Cloud Computing" Sources Driving Big Data" CS162 Operating Systems and Systems Programming Lecture 24

MI-PDB, MIE-PDB: Advanced Database Systems

Hadoop MapReduce Framework

Resource Management for Dynamic MapReduce Clusters in Multicluster Systems

Cloud Computing. Hwajung Lee. Key Reference: Prof. Jong-Moon Chung s Lecture Notes at Yonsei University

HAT: history-based auto-tuning MapReduce in heterogeneous environments

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

50 Must Read Hadoop Interview Questions & Answers

Top 25 Big Data Interview Questions And Answers

Smart Load Balancing Algorithm towards Effective Cloud Computing

Introduction to MapReduce

FAST DATA RETRIEVAL USING MAP REDUCE: A CASE STUDY

Available online at ScienceDirect. Procedia Computer Science 52 (2015 )

Department of Computer Science San Marcos, TX Report Number TXSTATE-CS-TR Clustering in the Cloud. Xuan Wang

Improving MapReduce Performance by Exploiting Input Redundancy *

HPMR: Prefetching and Pre-shuffling in Shared MapReduce Computation Environment

Dynamic Data Placement Strategy in MapReduce-styled Data Processing Platform Hua-Ci WANG 1,a,*, Cai CHEN 2,b,*, Yi LIANG 3,c

HiTune. Dataflow-Based Performance Analysis for Big Data Cloud

OUTLOOK ON VARIOUS SCHEDULING APPROACHES IN HADOOP

Huge Data Analysis and Processing Platform based on Hadoop Yuanbin LI1, a, Rong CHEN2

Open Access Apriori Algorithm Research Based on Map-Reduce in Cloud Computing Environments

L5-6:Runtime Platforms Hadoop and HDFS

CCA-410. Cloudera. Cloudera Certified Administrator for Apache Hadoop (CCAH)

MapReduce. U of Toronto, 2014

LEEN: Locality/Fairness- Aware Key Partitioning for MapReduce in the Cloud

MAGNT Research Report (ISSN ) Vol.5(1). PP , 2018

Available online at ScienceDirect. Procedia Computer Science 89 (2016 )

International Journal of Advance Engineering and Research Development. A Study: Hadoop Framework

CS 61C: Great Ideas in Computer Architecture. MapReduce

Improved MapReduce k-means Clustering Algorithm with Combiner

The MapReduce Abstraction

Dynamic Replication Management Scheme for Cloud Storage

AN INTEGRATED PROTECTION SCHEME FOR BIG DATA AND CLOUD SERVICES

Parallel HITS Algorithm Implemented Using HADOOP GIRAPH Framework to resolve Big Data Problem

CS427 Multicore Architecture and Parallel Computing

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

Hadoop and HDFS Overview. Madhu Ankam

Available online at ScienceDirect. Procedia Computer Science 56 (2015 )

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

An improved MapReduce Design of Kmeans for clustering very large datasets

Efficient Dynamic Resource Allocation Using Nephele in a Cloud Environment

Introduction to MapReduce Algorithms and Analysis

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ.

I ++ Mapreduce: Incremental Mapreduce for Mining the Big Data

Big Data for Engineers Spring Resource Management

BigData and Map Reduce VITMAC03

A Glimpse of the Hadoop Echosystem

STATS Data Analysis using Python. Lecture 7: the MapReduce framework Some slides adapted from C. Budak and R. Burns

Transcription:

Available online at www.sciencedirect.com ScienceDirect Procedia Computer Science 89 (2016 ) 203 208 Twelfth International Multi-Conference on Information Processing-2016 (IMCIP-2016) Tolhit A Scheduling Algorithm for Hadoop Cluster M. Brahmwar, M. Kumar and G. Sikka Dr. B.R. Ambedkar National Institute of Technology, Jalandhar 144 011, India Abstract With the accretion in use of Internet in everything, a prodigious influx of data is being observed. Use of MapReduce as a programming model has become pervasive for processing such wide range of Big Data Applications in cloud computing environment. Apache Hadoop is the most prominent implementation of MapReduce, which is used for processing and analyses of such large scale data intensive applications in a highly scalable and fault tolerant manner. Several scheduling algorithms have been proposed for Hadoop considering various performance goals. In this work, a new scheme is introduced to aid the scheduler in identifying the nodes on which stragglers can be executed. The proposed scheme makes use of resource utilization and network information of cluster nodes in finding the most optimal node for scheduling the speculative copy of a slow task. The performance evaluation of the proposed scheme has been done by series of experiments. From the performance analysis 27% improvement in terms of the overall execution time has been observed over Hadoop Fair Scheduler (HFS). 2016 The Authors. Published by Elsevier B.V. 2016 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/). Peer-review under responsibility of organizing committee of the Twelfth International Multi-Conference on Information Peer-review Processing-2016 under (IMCIP-2016). responsibility of organizing committee of the Organizing Committee of IMCIP-2016 Keywords: Hadoop; MapReduce; Heterogeneous Cluster; Job Scheduling; Big Data. 1. Introduction In this era of data science, data is considered as a key source for promoting growth and wellbeing of the society. Perusing the affinity in data helps to associate information and formulate strategies efficiently. Every day, more than two quintillion bytes of data is being created in this info-centric digitized world from various sources like scientific instruments, web authoring, telecommunication industry, social media, etc. Therefore, the effective storage and analysis of such tremendous amount of data has become a great challenge for the computing industry. In order to solve this crucial problem of analyzing such large data sets various computing paradigms such as grid computing and cloud computing came into existence. However, these computing paradigms were rendered abortive as their debugging, load balancing and scheduling solutions were found to be inefficient when dealing with such large data. As a result of which various solutions 1 3 were introduced to handle the Big Data Applications efficiently. MapReduce 1 is the most approved computational framework that utilize adaptive and scalable approaches of distributed computing for processing large data sets. Programs implemented using this functional style are parallelized implicitly. These are processed on a large cluster built from commodity hardware. The system itself is responsible for the partitioning of input data, scheduling of jobs across a set of commodity machines, handling the machine failures, Corresponding author. Tel.: +918284095097. E-mail address: manvibrahmwar@gmail.com 1877-0509 2016 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/). Peer-review under responsibility of organizing committee of the Organizing Committee of IMCIP-2016 doi:10.1016/j.procs.2016.06.043

204 M. Brahmwar et al. / Procedia Computer Science 89 ( 2016 ) 203 208 and managing the required inter-machine communication at run time. This allows naive programmers in the field of parallel and distributed systems to easily make use of such large resources of distributed systems. Apache Hadoop 4, created by Doug Cutting, is the most universally recommended open source implementation of MapReduce framework. It is used for storing, processing and analyzing large data sets across clusters of commodity hardwares in a reliable and fault tolerant manner. Hadoop version 2.x comprises of the following three components: HDFS 5,YARN 6 & MapReduce. Hadoop Distributed File System (HDFS) implemented by Yahoo is used by Hadoop for storing input and output data. It is also responsible for dividing the input data into fixed sized blocks and then allocating these data blocks to different data nodes. By default Hadoop maintains a replication factor of 3 i.e. it replicates each input block into 3 data nodes. Yet Another Resource Negotiator (YARN) also known as MapReduce version 2 is the foundation of new generation of hadoop. The fundamental idea of YARN is to split up the prominent functionalities of the Job Tracker which are resource management and job scheduling/monitoring, into 2 different daemons. The overall performance of a Hadoop cluster depends upon its scheduler. However, the performance of Hadoop s default scheduler is observed to deprecate in a non homogeneous environment. The scheduler is also not efficient enough in identifying the slow tasks which protract the overall execution time. Many scheduling algorithms 7 13 have been proposed as important extensions of Hadoop s default scheduling algorithm. In this work, a new scheduling approach has been introduced which assists the Hadoop scheduler in finding the most optimal nodes on which a speculative copy of stragglers can be executed in a heterogeneous hadoop cluster. The scheduling scheme named Tolhit makes use of resource utilization and cluster network information in identifying the most appropriate choice for executing the slow tasks so that the delay in overall execution time can be reduced. From the experiments an improvement of approx. 27% in terms of execution time is observed over Hadoop Fair Scheduler (HFS). The remaining paper is scripted as follows. Segment number 2 covers the literature insights of Hadoop MapReduce Scheduling. The innovative approach is described in segment number 3. Segment number 4 explores the performance analysis of the proposed method followed by conclusion in segment number 5. 2. Related Work By default Hadoop comes with three configurable scheduler policies; these are FIFO scheduler, Capacity scheduler 7 & Hadoop Fair Scheduler (HFS) 8. First In First out (FIFO) scheduler does not assure fair sharing among users. It also lacks in performance for small jobs in terms of response time. Capacity Scheduler was proposed by Yahoo to make the cluster sharing among organizations possible. This was done by setting the minimum guaranteed capacity of the queues. Facebook also proposed HFS to fairly share the cluster among various users and applications. Several scheduling algorithms have been proposed to cater the needs of different kind of workloads in different environments. As an important improvement of Hadoop s default scheduler, Zaharia et al. proposed LATE 9 for efficiently launching speculative tasks. Longest Approximate Time to End (LATE) scheduling scheme considers cluster heterogeneity for the first time. It also succeeds in improving the locality constraint by compromising fairness slightly. The constraints like deadline of a job are accurately considered in scheduling algorithm 11. Self Adaptive MapReduce (SAMR) 12 scheduling policy as a supplement to LATE algorithm also considers historical information besides hardware heterogeneity to synchronize the weights of map and reduce stages dynamically. It however lacks behind as it does not consider other factors like jobs with different types and sizes which might also affect the stage weights. To subjugate the shortcomings of SAMR algorithm, Enhanced Self Adaptive MapReduce (ESAMR) 13 scheduling algorithm was introduced by Sun et al. The ESAMR algorithm records and reclassifies historical information for each job at every node by adopting K-Means 14 clustering algorithm. It dynamically tunes stage weights and finds slow tasks accurately. 2.1 Motivation Hadoop is designed to follow Master-Slave architecture model. The master and the slave nodes in a Hadoop cluster communicate by exchanging heartbeat messages with each other at regular intervals. Each heartbeat message is considered as a potential scheduling opportunity for an application to run a container. Whenever a heartbeat message is received from a node about having an empty slot, a job is selected according to the configured scheduling policy.

M. Brahmwar et al. / Procedia Computer Science 89 ( 2016 ) 203 208 205 Hadoop s default scheduler assumes the cluster environment to be homogeneous i.e. each node in the cluster has the same configuration and computing capacity. However, in most of the real world applications clusters are often found to work in heterogeneous environment. The overall performance of Hadoop may deteriorate if it keeps on using its default scheduling policy in heterogeneous environment. The major reason behind this is that different computing capacities of nodes can cause the task execution time to differ on different nodes. Another scenario which is often witnessed is, some tasks are found to take longer period of time to execute as compared to other tasks on a node. Such tasks are known as stragglers. These stragglers are responsible for prolonging the execution time of the job. For preventing the slow tasks from prolonging the job s overall execution time, MapReduce runs a backup task for the slow task on another node which has a faster computation. This mechanism of identifying and re-executing the stragglers is called as speculative execution. Many scheduling algorithms 9 13 have been introduced to handle the afore stated problem of speculative execution but none of them is efficient enough in identifying the nodes on which these backup tasks can be scheduled. 3. Proposed Method This section introduces primary insights of Tolhit algorithm. The proposed algorithm maintains historical information in the form of History on every node present in the cluster. Each record in the historical information holds 5 values i.e. map (M1 & M2) and reduce stage weights (R1, R2 & R3). Tolhit assumes Map function execution weight as M1 and Reordering intermediate results weight as M2 of a map task. R1, R2, R3 stand for the copy, sort and reduce stage weights of reduce task. The history information stored on every node is classified using Genetic algorithm based clustering scheme 13 into k no of clusters. If a running job satisfies the TFMD(Threshold for Map Task Detection) on a data node, the algorithm then assigns the job a temporary map phase weight (M1). This temporary map phase weight is used as a search parameter for identifying a cluster with closest map stage weight (M1) among the k clusters in the historical information. The stage weights of the identified cluster are then used to calculate the job s map tasks TTE (Time to End) on the node following the same procedure as in before 9. Whereas if a running job has not completed any map task on a node or it does not meet the threshold, the average of all k clusters stage weights are utilized to estimate the TTE. A similar procedure is carried out, when a running job satisfies TFRD (Threshold for Reduce Task Detection). By making use of the stage weights calculated in the previous step the slow tasks(map tasks & reduce tasks) are identified on the basis of their PS (Progress Score) & TTE (Time To end) as in before 13. The proposed algorithm Tolhit is named so, because of its ability to utilize more precise stage weights in estimating the TTE of running tasks. After identification of the stragglers, the nodes on which the backup copy of stragglers should be re executed are decided on the basis of two parameters which are resource utilization of the node and the minimum spanning distance of the node from the scheduler. The cluster network information is maintained within a data structure named NwGraph. This network information comprises of the no of nodes in the cluster and their minimum spanning distance from the scheduler. The resource utilization of the cluster nodes such as RAM & disk utilization is also maintained and updated periodically in ResourceInfo. The disk utilization parameter for a node is calculated by considering only the input output requests from the tasks running on the node. For simplicity, the tasks which are non local are not taken into consideration while calculating the value of utilization of disk. The non local tasks are the tasks which run on other nodes and access the disk through network. The network information and resource utilization together contribute in finding the most optimal nodes for scheduling the backup tasks. The node which has minimum resource utilization and least spanning distance from the scheduler will be the most optimal choice. The algorithm continues to execute speculative copies of slow tasks till no more stragglers are identified. When a job completes its execution the historical information is reclassified into k clusters using genetic algorithm based clustering technique. The algorithm makes speculative execution mechanism resource aware compromising data locality up to some extent. The pseudo code of Tolhit algorithm is presented in Algorithm 1. Tolhit uses Genetic Algorithm (GA)-based clustering technique 15 for the re classification of historical information into k clusters on every node in the cluster. The use of GA avoids achieving local optima. The pseudo code of GA is stated in Algorithm 2. The procedures to identify slow tasks and to calculate the weights of map tasks and reduce tasks have been kept the same as shown in before 13.

206 M. Brahmwar et al. / Procedia Computer Science 89 ( 2016 ) 203 208 Algorithm 1. BTolhit Algorithm 2. GA-Clustering Algorithm 4. Performance Evaluation 4.1 Experimental setup The performance of the proposed scheduling algorithm for speculative task re-execution has been evaluated by series of extensive experiments. The simulations were done on a 5 node heterogeneous cluster. Table 1 summarizes the configuration of the cluster used as a test bed for evaluating the performance.

M. Brahmwar et al. / Procedia Computer Science 89 ( 2016 ) 203 208 207 Table 1. Test Bed. Nodes Specifications Node A (Master) 8 GB RAM, 500 GB Disk i7 Quad Core Node B (Slave 1) 4 GB RAM, 500 GB Disk i7 Quad Core Node C (Slave 2) 2 GB RAM, 500 GB Disk Pentium Dual Core Node D (Slave 3) 4 GB RAM, 500 GB Disk i5 Quad Core Node E (Slave 4) 2 GB RAM, 500 GB Disk i5 Quad Core Hadoop Version Hadoop Version -2.6 Fig. 1. Time of Wordcount Job using HFS. Fig. 2. Time of Wordcount Job using Tolhit Scheduler. 4.2 In cluster In order to evaluate the performance of the algorithm Wordcount job is run. Wordcount is a MapReduce application which is used for counting the words in the input file. The values of TFMD, TFRD & k are kept as 0.2, 0.2 & 10 to obtain the best results as shown in before 13. The replication factor is set to 3 and the size of the block has been kept 128 MB (default) during the experiment. The experimental data (1 GB & 2 GB) has been collected from Twitter live streaming using Apache Flume. In order to obtain the execution time, 10 Wordcount jobs were run simultaneously. Figure 1 & Fig. 2 show the execution time of 1 GB & 2 GB Wordcount job using HFS and Tolhit Scheduler respectively. The average execution time of 1 GB & 2 GB Wordcount job using HFS and Tolhit scheduler are listed in

208 M. Brahmwar et al. / Procedia Computer Science 89 ( 2016 ) 203 208 Table 2a. Avg Time of Wordcount Job using HFS. Time (s) Time(s) Nodes 2 GB 1 GB Slave Node1 22 12 Slave Node2 88 44 Slave Node 3 33 17 Slave Node 4 62 37 Table 2b. Avg Time of Wordcount Job using Tolhit. Time (s) Time(s) Nodes 2 GB 1 GB 2GB 1GB Slave Node 1 16 9 Slave Node 2 64 32 Slave Node 3 28 14 Slave Node 4 49 25 Table 2a and Table 2b. The wordcount job s average execution time is calculated using the data depicted in Fig. 1 and Fig. 2. 5. Conclusions Inspired by the crucial problem of node heterogeneity in real world, we have analyzed the problem of speculative execution in heterogeneous Hadoop cluster. In this work, a new approach is proposed to assist the scheduler in identifying the nodes on which stragglers can be executed so that the overall delay can be reduced. The performance of Tolhit scheduling algorithm is evaluated by series of experiments and an overall improvement of approx. 27% in terms of execution time over Hadoop Fair Scheduler is observed. In future, various other factors which influence the optimal node selection can be considered. References [1] V. J. Dean and S. Ghemawat, MapReduce: Simplified Data Processing on Large Clusters, In ACM Communications, vol. 51(1), pp. 107 113, (2008). [2] Y. Gu and R. L. Grossman, Sector & Sphere: The Design and Implementation of a High Performance Data Cloud. Philosophical Transactions of the Royal Society a Mathematical, Physical and Engineering Sciences, vol. 367(1897), pp. 2429 2445, (2009). [3] M. Izard, M. Budiu, Y. Yu, A. Birrell and D. Fetterly, Dryad: Distributed Data Parallel Programs from Sequential Building Blocks, In ACM SIGOPS Operating Systems Review, 41(3), pp. 59 72, (2007). [4] Apache Hadoop. http://hadoop.apache.org/. [5] HDFS, http://hadoop.apache.org/hdfs. [6] YARN, https://hadoop.apache.org/docs/r2.7.1/hadoop-yarn/hadoop-yarn-site/yarn.html. [7] Hadoop s Capacity Scheduler an overview, http://hadoop.apache.org/docs/stable/capacity scheduler.html. [8] Hadoop s Fair Scheduler an overview, http://hadoop.apache.org/docs/stable/fair scheduler.html. [9] M. Zaharia, A. Konwinski, A. D. Joseph, R. H. Katz and I. Stoica, Improving MapReduce Performance in Heterogeneous Environments, In 8th USENIX Symposium on Operating Systems Design and Implementation (OSDI), (2008). [10] M. Zaharia, D. Borthakur, J. Sen Sarma, L. Elmeleegy, S. Shenker and I. Stoica, Delay scheduling: A Simple Technique for Achieving Locality and Fairness in Cluster Scheduling, In 5th European Conference on Computer systems (EuroSys), (2010). [11] K. Kc and K. Anyanwu, Scheduling Hadoop Jobs to Meet Deadlines, In Proc. CloudCom, pp. 388 392, (2010). [12] Q. Chen, D. Zhang, M. Guo, Q. Deng and S. Guo, SAMR: A Self Adaptive MapReduce Scheduling Algorithm in Heterogeneous Environment, In 10th IEEE International Conference on Computer and Information Technology, CIT 10, (Washington, DC, USA), IEEE Computer Society, pp. 2736 2743, (2010). [13] X. Sun, C. He and Y. Lu, ESAMR: An Enhanced Self-Adaptive MapReduce Scheduling Algorithm, IEEE 18th International Conference on Parallel and Distributed Systems (ICPADS); pp. 148 155, Decemner (2012). [14] K-Means Method. http://en.wikipedia.org/wiki/k-means clustering. [15] U. Maullik and S. Bandhopadhyay, Genetic Algorithm-Based Clustering Technique. The Journal of Pattern Recognition Society, (2000).