An Improved Hadoop Performance Model For Job Estimation And Resource Provisioning In Public Cloud

Size: px

Start display at page:

Download "An Improved Hadoop Performance Model For Job Estimation And Resource Provisioning In Public Cloud"

Corey Collins
5 years ago
Views:

Volume 118 No. 16 2018, 1369-1389 ISSN: 1311-8080 (printed version); ISSN: 1314-3395 (on-line version) url: http://www.ijpam.eu ijpam.

1 Volume 118 No , ISSN: (printed version); ISSN: (on-line version) url: ijpam.eu An Improved Hadoop Performance Model For Job Estimation And Resource Provisioning In Public Cloud D.M.Kalai Selvi 1 and Dr.P.Ezhumalai 2 1 M,E(CSE),R.M.D Engineering college, kavaraipettai,chennai TamilNadu dmkalai@gmail.com 2 HOD of CSE Department, R.M.D Engineering college, kavaraipettai,chennai TamilNadu January 4, 2018 Abstract Hadoop is an open source implementation of Map Reduce Framework which is mostly used for parallel processing of big data analytics. Hadoop users accept the opportunities/resource plan/offers provided by Cloud Service Providers to lease the resources needed for processing their hadoop jobs and pay as per their use. But there is no efficient resource provisioning mechanism which can complete the job within specified deadline. It is the responsibility of the user who should manually modify the resource required for processing their hadoop jobs. So we proposed an improved HP (Hadoop Performance) model which used Hive bucket and partitioning concept to reduce the long duration of shuffle sort and efficiently complete the job within the desired deadline. Furthermore, efficient resource provisioning can be obtained using feedback system and user profile to determine their exact need and we suggest an optimal plan for their resource needs

2 Key Words : Hadoop; Map Reduce; Cloud Service Providers; Hive; Resource Provisioning; Bucket. 1 Introduction Rapid evolution of raw data from all sources like sensors, blogspots, social networking sites, aviation, etc which explored from our surroundings. This leads to the phenomena called Big Data which moved IT solutions to the other dimension where not only computer science but also all other fields like electronics,civil,transport need IT solutions for their business e.g. In U. S police department installs traffic cameras to read license plates of Automobiles to identify stolen automobiles in real time. Hadoop framework has capability to handle both structured and unstructured data where 80% of the data are unstructured data and only 20% are structured data among the whole big data sets. [1] Based on the challenges faced by Big Data it can be characterized by four dimensions: VOLUME: To manage large volumes of data VELOCITY: To manage rapidly arriving data VARIETY: To manage both structured and unstructured data VERACITY: To validate the correctness of large amount of arriving data. Cloud Service provider like Amazon EC2 enables user to configure the resources needed for their application, but in current form of EC2 cloud does not support Hadoop jobs with deadline constraint. So, it is the sole responsibility of the user to assign necessary resources needed to complete the job with specified deadline which is highly challenging task. It is necessary to focus on Hadoop performance modeling which is also a critical task. Hadoop jobs has multiple processing phases with three core phases Map phase, shuffle phase, reduce phase such that first slot of the shuffle phase is processed in parallel with map phase called overlapping stage. Other slots of the shuffle phase are processed

3 after the map phase is processed called non-overlapping stage. In the existing Hadoop performance model does not has the capability to automate the resource provisioning and complete the Hadoop job within the specified constraints. Furthermore, the number of reduce slots are strictly constant. So we proposed an improved Hadoop performance model which will automatically configures resources required by end user and complete the job within deadline. The contents of this paper are organized as follows. In section 2 we review all related literature. Section 3 will be problem analysis and Section 4 will be our proposed work and finally section 5 & 6 will be our experimental results and conclusion. 2 Literature Review [2] Starfish is MADDER (MAD-Magnetism, Agility, Depth) and Self tuning system for big data analytics on big data. The main goal of starfish is to enable Hadoop user and applications to get good performance automatically throughout the data lifecycle without manually understanding the need of the Hadoop user and manually allocate the needs by tuning various knobs. STARFISH collects detailed information of Hadoop job at a very fine granularity for automation of job estimation which is a burden. [3] The general problem in forming cluster is to determine resource and configure it according to Map Reduce Job in order to meet the desired constraints such as cost, deadline, execution time for given big data analytics referred as cluster sizing problem. In this paper, Elastisizer a system added on the top of Hadoop stack to automate the allocation of resources to meet the desired requirements of massive data analytics by the Elastisizer will provide reliable answers to cluster sizing queries in an automated fashion by using detailed information of Hadoop job profile.in Elastisizer, needs detailed information of job profile which increase the overhead of high execution time. [4] In this paper, the proposed Hadoop performance model considers both overlapping and non-overlapping stages. It uses scaling factor to automatically scale up or scale down the resource allocation within the specified timeframe. Though this Hadoop performance model increase the possibility of automatic resource al

4 location but it suffers simple linear regression. The HP model is restricted to a constant number of reduce tasks. [5][6] CRESP provides automatic estimation of job execution time and resource provisioning in an optimal manner. In general, the multiple waves generate better performance than single wave of the reduce phase. In CRESP, the number of reduce slots should be equal to the number of reduce slots and it consider only single wave of the reduce phase. In general, traditional Hadoop cluster environment assumed as homogeneous clusters. But now there is a explosion of heterogeneous Hadoop cluster due to the evolvement of more parallel processing of data intensive application by many companies. Due to heterogeneity of resources in Hadoop cluster, there is an inefficiency of Hadoop job completion time as well as results in the bottleneck of resources. In [9] they proposed bound-based performance modeling of Map Reduce jobs completion times in heterogeneous Hadoop cluster. But they does not explore the properties of Hadoop job profiles as well as does not analysis the breakdown of Hadoop job execution on different types of heterogeneous nodes in the Hadoop cluster. In this paper [10], they proposed a framework called ARIA (Automatic Resource Inference and Allocation ) for MapReduce environments to solve the problem of automatic resource allocation. It contains three inter-related componenets. First, job profile which summarizes performance characteristics of the underlying application during the map and reduce stage. Second, MapReduce performance model for the given job and its SLO(Service Level Objective) to estimate the amount of resource required to complete the job within deadline. Finally will be SLO scheduler which is based on EDF(Earliest Deadline First) that determines job ordering and allocating resources. The major disadvantage in this framework was that they did not considered node failures. In [11] they proposed a framework called AROMA (Automated Resource Allocation and Configuration of Map Reduce Environment). It was developed to automate resource provisioning in heterogeneous cloud and manipulating configuration of Hadoop parameters for achieving goals but minimizing the incurring cost. The major disadvantage in this system were configuration ineffective and does not support multi-stage job workflows. [12] introduces a PAR- ALLAX PROGRESS ESTIMATOR, which can estimate the per

5 formance by estimating the remaining time of MapReduce pipelines based on thetime had elapsed. The key strategy the author used is dividing the processes of MapReduce into five key pipelines and estimate the remaining time based on the time had elapsed. However, the stated approach does not study an explicit cost function that can be used in optimization problems.[13] also proposed optimal resource provisioning with minimal cost, but there is some prediction error. In order to improve the efficiency and reliability of Hadoop Performance model to provide user friendly automated resource provisioning scheme and effective job completion with the predefined constraints, an improved Hadoop performance model is proposed. The major contributions in this paper are: Improved Hadoop performance work design all the core phases of Map Reduce framework i.e., map phase, shuffle phase, partitioner phase, combiner phase, reduce phase. The improved HP model which used Hive bucket and partitioning concept to reduce the long duration of shuffle sort and efficiently complete the job within the desired deadline. Furthermore, efficient resource provisioning can be obtained using feedback system and user profile to determine their exact need and we suggest an optimal plan for their resource needs. The performance of improved HP model is evaluated in Amazon EC2 cloud and the evaluation results shows that the improved HP model outperforms both HP and Starfish model. 3 Problem Analysis 3.1 Modeling Hadoop Job Phases In Hadoop, any job which needed parallel processing and high performance will enter into the map reduce framework. Usually, the entered job is splitted into three core phases such as Map, shuffle and Reduce phase. In the existing system, fine granularity information about the job is collected [4] which is useful to complete

6 the job within deadline. Map tasks are executed in map slots and reduce tasks are executed in reduce slots. Each slot runs a task at a time. Slots are assigned in terms of CPU and RAM. Map and reduce phase can be executed in single or multiple phases. [7] Map Phase The map phase accepts input dataset from Hadoop Distributed File system (HDFS) and which should be in key value pair format. Mostly datasets that prefer parallel processing supports NOSQL database in which the dataset are stored in key-value database, column family database, graph database, document database. The map phase initially split the input datasets into blocks (by default 64MB) where each block considered as map tasks and assign a map slot. After the map phase is processed, the intermediate key-value pair result will be generated which will be stored in buffer not in disk in order to reduce the complication due to replica. The total execution time for map phase will be T mtotal = T mavg N m N mslot Shuffle Phase The process of transferring the map output in a sorted manner to the reduce phase in order to minimize the work of reducer is called shuffle. In this phase, Hadoop job fetches intermediate map output and copies it in one or more reducers. If N r N slot, then the shuffle phase will be completed in single wave. The total execution time for shuffle phase will be T stotal = T savg N r N rslot 2 Otherwise, the shuffle phase will be completed in multiple waves. Then the total execution time for shuffle phase will be T stotal = (T W 1 avg N W 1)+...+(T W n avg N W n) N rslot 3 Fig 1 describes job execution flow of existing Hadoop performance model

7 Hadoop job execution flow Table 1.1 defines variables

8 3.1.3 Reduce Phase The reduce phase is a user-defined function in which user can customized their processing logic. The customized reduce function process intermediate map output and produces the final output.usually, the final output after the reduce phase will be stored in HDFS. The Hadoop performance model supports both overlapping and non overlapping stages. The total execution time for reduce phase will be T rtotal = T ravg N r N rslot 4 4 Proposed Work 4.1 Improved Hadoop Performance Model Job Estimation Model In improved Hadoop performance model, we handle both overlapping and non overlapping stages. Generally when processing in multiple waves, the first wave of the shuffle phase starts immediately after completion of the first wave of the map phase and the first wave of the shuffle completed only after all waves of the map phase started which creates a long execution time for shuffle phase which can be overcome by HIVE. When an existing data infrastructure based on relational database wants to move in Hadoop which can be achieved using HIVE uses HQL(HIVE Query Language) similar to SQL(Structured Query language) use to fetch and process data from structured data infrastructure. [7][8] The two most important concept in HIVE used to overcome the problem of latency by shuffle phase are bucketing and partitioning. When the input datasets are bucketed based on their column using hash function along with partitioning, the overhead of large datasets are reduced using command line interface or web user interface where bucketing and partitioning commands are queried in HQL based on user needs. Already the datasets are bucketed i.e. sorted based on columns the task of shuffling get reduced which improves performance. Furthermore, it helps to complete job within deadline. The estimated

9 execution time for Hadoop jobs are finalized by user profile collects detailed information about Hadoop jobs such as data size, map tasks, reduce tasks and their duration. When a job processes an increasing size of an input dataset, the number of map tasks is proportionally increased while the number of reduce tasks is specified by a user in the configuration file. The number of reduce tasks can vary depending on user s configurations. When the number of reduce tasks is kept constant, the execution durations of both the shuffle tasks and the reduce tasks are linearly increased with the increasing size of the input dataset as considered in the HP model. This is because the volume of an intermediate data block equals to the total volume of the generated intermediate data divided by the number of reduce tasks. As a result, the volume of an intermediate data block is also linearly increased with the increasing size of the input dataset. However, when the number of reduce tasks varies, the execution durations of both the shuffle tasks and the reduce tasks are not linear to the increasing size of an input dataset. In the improved HP model, we considered varied number of reduce slots based on the user profile which has detailed information about past history input data size and their execution time and the number of map slots and reduce slots which helps to estimate the varied number of reduce slots Resource Provisioning Model Starfish and other Hadoop performance model [2] [3] [4] [5] supports manual or self tuning of the resources needed for their job execution. Resource provisioning plays a key role as a job can be completed within specified deadline say t only when the sufficient resources required for the job for their processing has been provided. In improved HP model supports automatic resource provisioning without manipulating knob parameters using user feedback system. In user feedback system, collects fine granularity information about the history of already processed job and by assigning weight or priority to each job based on previous logs so that high critical task can be provided resources immediately and automatically. Since even high critical task are provided resources actively which helps to complete Hadoop jobs within specified constraint

11 5 Performance Evaluation 5.1 Experimental Setup The performance of improved Hadoop performance model was evaluated on two experimental setup i.e Amazon EC2 cloud and Google cloud.the experimental Hadoop cluster was setup on Amazon EC2 Cloud using 20 m1.large instances. The specifications of the m1.large. In this cluster, we used Hadoop and configured one instance as Name Node and other 19 instances as Data Nodes. The Name Node was also used as a Data Node. The data block size of the HDFS was set to 64MB and the replication level of data block was set to 3. Each instance was configured with one map slot and one reduce slot. We run WordCount applications on Hadoop cluster and employed Starfish to collect the job profiles. For each application running on each cluster, we conducted 10 tests. For each test, we run 5 times and took the average durations of the phases

12 Fig.4. the performance of the WordCount application with a varied number of reduce tasks. The following graph describes the performance of improved HP model versus HP model for job estimation time

13 The following graph describes the estimated resource provisioning and compares resource provisioning performance HP model versus improved HP model The improved Hadoop performance model was also evaluated in Google cloud. The project scenario based on which improved Hadoop performance model evaluated was described here. In Google cloud, file directory showing a list of files representing famous song details was stored (big data).a list of plans regarding the configuration of map and reduce slots are stored in a separate table in Hive. The contents of the file were stored in a table in Hive in comma separated format. The details of user profile regarding configuration of the resource required and user feedback were stored in database in Hive. The file details was analyzed and partitioned big data as buckets using hash function based on major columns and the list of optimal resource plan is displayed along with the optimal resource config

uration. Using compute engine a virtual machine was initialized and Hadoop 1.2.1 was installed in that VM along with Hive 1.2.x and Pig 0.16.0 for manipulating the big data.

14 uration. Using compute engine a virtual machine was initialized and Hadoop was installed in that VM along with Hive 1.2.x and Pig for manipulating the big data. The file directory was transferred from local file system to Hadoop Distributed File System (HDFS). Command The file which are stored in.csv format as 1 9,98993,3,Bengali,Contemporary celtic,3,anjan Dutt,paid,album 2 9,105156,3,Bengali,Contemporary celtic,3,anjan Dutt,paid,album

15 3 9,113452,3,Bengali,Contemporary celtic,3,anjan Dutt,paid,album 4 9,119398,2,Bengali,Contemporary celtic,3,anjan Dutt,free,movie 5 9,130766,3,Bengali,Contemporary celtic,2,anjan Dutt,free,movie 6 9,133420,3,Kannada,Carnival songs,2,anu Malik,free,movie 7 10,11091,4,Kannada,Carnival songs,2,anu Malik,free,movie 8 10,19275,3,Kannada,Carnival songs,2,anu Malik,free,movie 9 10,21140,4,Kannada,Carnival songs,2,anu Malik,free,album 10 10,44487,3,Kannada,Carnival songs,1,anu Malik,free,album 11 10,49147,1,Kannada,Carnival songs,1,anu Malik,free,album 12 10,54923,5,Kannada,Contemporary celtic,1,anu Malik,free,album 13 10,63451,3,Kannada,Contemporary celtic,1,anu Malik,free,album 14 10,75206,3,Kannada,Contemporary celtic,1,anu Malik,paid,movie 15 10,104641,1,Kannada,Contemporary celtic,5,anu Malik,paid,movie 16 10,110054,3,Kannada,Contemporary celtic,5,anu Malik,paid,movie Command to load file.csv LOGS= LOAD /user/root/songs.csv ; LOGS GROUP= GROUP LOGS ALL; LOG COUNT = FOREACH LOGS GROUP GEN- ERATE COUNT(LOGS); STORE LOG COUNT INTO /user/hive/warehouse/resource.db/filedetails /songs.txt ; use resource; drop table if exists filedetails; CREATE EXTER- NAL TABLE filedetails ( totalrow string) ROW FORMAT DE- LIMITED FIELDS TERMINATED BY, LINES TERMINATED BY \n LOCATION hdfs://localhost:9000/user/hive/warehouse/resource.db/filedetails/songs.txt ; Command To Analyze the plan set hive.cli.print.header=true; use resource; drop table if exists analysis;

16 create table analysis as select a.id,a.memory,a.processor,a.ssd,a.transfer,a. price,a.size from plans a JOIN filedetails b where a.size cast(b.totalrow as int); select a.id,a.memory,a.processor,a.ssd,a.transfer,a. price,a.size, b.feedback,b.name, b. from analysis a JOIN feedback b where a.id = b.planid; Sample Contents Stored in Hive Table Analysis 3 2GB 2 Core 40 GB 3 TB 20$ GB 2 Core 60GB 4 TB 40$ GB 4 Core 80GB 5 TB 80$ GB 8 Core 160GB 6TB 160$ GB 12 Core 320GB 7TB 320$ GB 16 Core 480GB 8TB 480$ GB 20 Core 640GB 9TB 640$ Feedback OK NULL NULL Timestamp Name Feedback :00 vaishali.m mvaishali1996@gmail.com It would be even more useful and clear if we had the option of working it live...and practical :00 yukti sharma dana.pink@yahoo.co.in it was good. need more time to elaborate.next time try for longer instances :00 subhash subhashthaya@gmail.com it is very nice and i have learnt many things about google design cloud during this session :00 akhil vemuri micheal.akhil.king@gmail.com good but expecting more :01 Devang P Patel patell.deva@gmail.com the cloud was good :01 H.VIGNESH vickynarrayanan@gmail.com awesome experience we had but time is not enough!!! NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL :01 vidya varsshini vidyavars1995@gmail.com its very nice but can make it even more better :01 subathra.r subaramachandran7@gmail.com good...but not so effective

17 :01 sruthi r sruthi.rdsk.1996@gmail.com nice..but little bit lag :01 SUSHMITHA.U sushmiudhay@gmail.com could have been more effective if it was more storage :01 vijayashankar vijayashankarenginer@outlook.com good to use this pack. Great eperience :01 venkatraman.d vengatvenkie@gmail.com its nice cloud any very good one for improving our knowledge Plans NULL memory processor ssd transfer price NULL 1 512MB 1 Core 20 GB 1TB 5$ GB 1 Core 30 GB 2 TB 10$ GB 2 Core 40 GB 3 TB 20$ GB 2 Core 60GB 4 TB 40$ GB 4 Core 80GB 5 TB 80$ GB 8 Core 160GB 6TB 160$ GB 12 Core 320GB 7TB 320$ GB 16 Core 480GB 8TB 480$ GB 20 Core 640GB 9TB 640$ Output

18 6 CONCLUSION Running a Map Reduce Hadoop job on a public cloud such as Amazon EC2 necessitates a performance model to estimate the job exe

19 cution time and further to provision a certain amount of resources for the job to complete within a given deadline. This paper has presented an improved HP model to achieve this goal taking into account multiple waves of the shuffle phase of a Hadoop job. The experimental results showed that the improved HP model outperforms both Starfish and the HP model in job execution estimation and resource provisioning. One future work would be to consider dynamic overhead of the VMs involved in running the user jobs to minimize resource overprovisioning. References [1] [2] H. Herodotou, H. Lim, G. Luo, N. Borisov, L. Dong, F. B. Cetin, and S. Babu, Starfish: A Self-tuning System for Big Data Analytics, in In CIDR, 2011, pp [3] H. Herodotou, F. Dong, and S. Babu, No One (Cluster) Size Fits All: Automatic Cluster Sizing for Data-intensive Analytics, in Proceedings of the 2nd ACM Symposium on Cloud Computing (SOCC 11), 2011, pp. 114 [4] A. Verma, L. Cherkasova, and R. H. Campbell, Resource provisioning framework for mapreduce jobs with performance goals, in Proceedings of the 12th ACM/IFIP/USENIX international conference on Middleware, 2011, pp [5] K. Chen, J. Powers, S. Guo, and F. Tian, CRESP: Towards Optimal Resource Provisioning for MapReduce Computing in Public Clouds, IEEE Transcation Parallel Distrib. Syst., vol. 25, no. 6, pp , [6] H. Herodotou, Hadoop Performance Models, [Online]. Available: [Accessed: 22-Oct-2013]. [7] Hadoop Definitive guide Tom White OReilly Press

20 [8] E. Capriolo, D. Wampler, and J. Rutherglen, Programming Hive, O Reilley, 2012 [9] Z. Zhang, L. Cherkasova, and B. T. Loo, Performance Modeling of MapReduce Jobs in Heterogeneous Cloud Environments, in Proceedings of the 2013 IEEE Sixth

21 1389

22 1390

Modelling of Map Reduce Performance for Work Control and Supply Provisioning

Modelling of Map Reduce Performance for Work Control and Supply Provisioning Palson Kennedy.R* CSE&PERIIT, Anna University, Chennai palsonkennedyr@yahoo.co.in Kalyana Sundaram.N IT & Vels University hereiskalyan@yahoo.com