PROFILING BASED REDUCE MEMORY PROVISIONING FOR IMPROVING THE PERFORMANCE IN HADOOP

Similar documents
Survey on MapReduce Scheduling Algorithms

The Fusion Distributed File System

Compressing Intermediate Keys between Mappers and Reducers in SciHadoop

Research on Load Balancing in Task Allocation Process in Heterogeneous Hadoop Cluster

Survey Paper on Traditional Hadoop and Pipelined Map Reduce

Dynamic Data Placement Strategy in MapReduce-styled Data Processing Platform Hua-Ci WANG 1,a,*, Cai CHEN 2,b,*, Yi LIANG 3,c

MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti

Mitigating Data Skew Using Map Reduce Application

ABSTRACT I. INTRODUCTION

High Performance Computing on MapReduce Programming Framework

Indexing Strategies of MapReduce for Information Retrieval in Big Data

Improving MapReduce Performance in a Heterogeneous Cloud: A Measurement Study

HiTune. Dataflow-Based Performance Analysis for Big Data Cloud

MATE-EC2: A Middleware for Processing Data with Amazon Web Services

Chisel++: Handling Partitioning Skew in MapReduce Framework Using Efficient Range Partitioning Technique

A Network-aware Scheduler in Data-parallel Clusters for High Performance

Big Data Processing: Improve Scheduling Environment in Hadoop Bhavik.B.Joshi

A Survey on Job Scheduling in Big Data

Using Global Behavior Modeling to improve QoS in Cloud Data Storage Services

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical

Similarities and Differences Between Parallel Systems and Distributed Systems

MixApart: Decoupled Analytics for Shared Storage Systems. Madalin Mihailescu, Gokul Soundararajan, Cristiana Amza University of Toronto and NetApp

Survey on Incremental MapReduce for Data Mining

Global Journal of Engineering Science and Research Management

ADAPTIVE HANDLING OF 3V S OF BIG DATA TO IMPROVE EFFICIENCY USING HETEROGENEOUS CLUSTERS

Camdoop Exploiting In-network Aggregation for Big Data Applications Paolo Costa

Towards Makespan Minimization Task Allocation in Data Centers

A Platform for Fine-Grained Resource Sharing in the Data Center

A Comparison of Scheduling Algorithms in MapReduce Systems

Towards Makespan Minimization Task Allocation in Data Centers

Lecture 11 Hadoop & Spark

Optimizing the use of the Hard Disk in MapReduce Frameworks for Multi-core Architectures*

Performance Analysis of Hadoop Application For Heterogeneous Systems

Developing MapReduce Programs

Cloud Computing CS

Key aspects of cloud computing. Towards fuller utilization. Two main sources of resource demand. Cluster Scheduling

Implementation of Optimized Mapreduce With Smart Speculative Strategy And Network Levitated Merge

A SURVEY ON SCHEDULING IN HADOOP FOR BIGDATA PROCESSING

Parallelizing Inline Data Reduction Operations for Primary Storage Systems

A Mathematical Computational Design of Resource-Saving File Management Scheme for Online Video Provisioning on Content Delivery Networks

MOHA: Many-Task Computing Framework on Hadoop

A Micro Partitioning Technique in MapReduce for Massive Data Analysis

Jumbo: Beyond MapReduce for Workload Balancing

QADR with Energy Consumption for DIA in Cloud

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

I ++ Mapreduce: Incremental Mapreduce for Mining the Big Data

Efficient Map Reduce Model with Hadoop Framework for Data Processing

MapReduce for Data Intensive Scientific Analyses

Review on Managing RDF Graph Using MapReduce

Towards Scalable Data Management for Map-Reduce-based Data-Intensive Applications on Cloud and Hybrid Infrastructures

A Review Paper on Big data & Hadoop

Available online at ScienceDirect. Procedia Computer Science 89 (2016 )

Programming Models MapReduce

Using Alluxio to Improve the Performance and Consistency of HDFS Clusters

Efficient Load Balancing and Disk Failure Avoidance Approach Using Restful Web Services

CHAPTER 6 ENERGY AWARE SCHEDULING ALGORITHMS IN CLOUD ENVIRONMENT

An Enhanced Approach for Resource Management Optimization in Hadoop

April Final Quiz COSC MapReduce Programming a) Explain briefly the main ideas and components of the MapReduce programming model.

MapReduce-II. September 2013 Alberto Abelló & Oscar Romero 1

Introduction to Hadoop and MapReduce

Embedded Technosolutions

Hybrid MapReduce Workflow. Yang Ruan, Zhenhua Guo, Yuduo Zhou, Judy Qiu, Geoffrey Fox Indiana University, US

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing

SciHadoop: Array Based Query Processing in Hadoop

LEEN: Locality/Fairness- Aware Key Partitioning for MapReduce in the Cloud

Open Access Apriori Algorithm Research Based on Map-Reduce in Cloud Computing Environments

ANONYMIZATION OF DATA USING MAPREDUCE ON CLOUD

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets

Pattern-Aware File Reorganization in MPI-IO

Research Article Apriori Association Rule Algorithms using VMware Environment

OUTLOOK ON VARIOUS SCHEDULING APPROACHES IN HADOOP

BigData and Map Reduce VITMAC03

Optimizing Apache Spark with Memory1. July Page 1 of 14

PACM: A Prediction-based Auto-adaptive Compression Model for HDFS. Ruijian Wang, Chao Wang, Li Zha

LITERATURE SURVEY (BIG DATA ANALYTICS)!

On Availability of Intermediate Data in Cloud Computations

Survey on Novel Load Rebalancing for Distributed File Systems

Storage in HPC: Scalable Scientific Data Management. Carlos Maltzahn IEEE Cluster 2011 Storage in HPC Panel 9/29/11

CATEGORIZATION OF THE DOCUMENTS BY USING MACHINE LEARNING

Resilient Distributed Datasets

Research Works to Cope with Big Data Volume and Variety. Jiaheng Lu University of Helsinki, Finland

MapReduce. U of Toronto, 2014

CLUSTERING BIG DATA USING NORMALIZATION BASED k-means ALGORITHM

CAVA: Exploring Memory Locality for Big Data Analytics in Virtualized Clusters

vhadoop: A Scalable Hadoop Virtual Cluster Platform for MapReduce-Based Parallel Machine Learning with Performance Consideration

BlobSeer: Bringing High Throughput under Heavy Concurrency to Hadoop Map/Reduce Applications

Improving the MapReduce Big Data Processing Framework

Benchmarking Approach for Designing a MapReduce Performance Model

Improving Hadoop MapReduce Performance on Supercomputers with JVM Reuse

Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics

ADVANCES in NATURAL and APPLIED SCIENCES

Key aspects of cloud computing. Towards fuller utilization. Two main sources of resource demand. Cluster Scheduling

Presented by Nanditha Thinderu

High Throughput Data-Compression for Cloud Storage

2/26/2017. For instance, consider running Word Count across 20 splits

Voldemort. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

Shark: Hive on Spark

Optimal Algorithms for Cross-Rack Communication Optimization in MapReduce Framework

Improving MapReduce Energy Efficiency for Computation Intensive Workloads

An Adaptive Scheduling Technique for Improving the Efficiency of Hadoop

Transcription:

ISSN: 0976-2876 (Print) ISSN: 2250-0138 (Online) PROFILING BASED REDUCE MEMORY PROVISIONING FOR IMPROVING THE PERFORMANCE IN HADOOP T. S. NISHA a1 AND K. SATYANARAYAN REDDY b a Department of CSE, Cambridge Institute of Technology, VTU Belgaum, Bangalore-560036, India b Department of ISE, Cambridge Institute of Technology, VTU Belgaum, Bangalore-560036, India ABSTRACT Hadoop Map Reduce framework has become a manageable, scalable and fault tolerant framework for processing big data. The number of Map and Reduce task run decides the performance of the big data computing. Usually the number of Map is decided is based on the number of data blocks available for processing, but there is no mechanism to decide the number of reduce jobs. Currently it is based on the user configuration. Like this many challenges exist in Hadoop and Hadoop needs to be optimized. In this survey paper, we propose a profiling based technique to find the optimum number of reduce slots and the amount of memory for reduce, so that when Hadoop is configured with these optimum settings, the job is able to complete successfully. KEYWORDS: Hadoop Map Reduce, Big Data, Data Blocks. Hadoop is an open source implementation of Map Reduce framework for Big data processing. In the Map phase, the input data, in the form of data blocks, is processed by Map tasks to generate intermediate data. Each Map task processes one data block. After that, the intermediate data is handled by Reduce tasks of Reduce phase to deliver the final results. Reduce tasks bring the intermediate data chunks to memory to process it. Despite the Map phase, where the number of tasks is determined by the number of data blocks, in Reduce phase the number of tasks can be determined by user or cluster administrator. The ability to manage intermediate data, as well as determination of amount of Reduce tasks, significantly affects the performance. If the memory operations can be managed the performance of big data computing environment can be increased. Most of previous works on Hadoop optimization has considered only storage or network regarding intermediate data or solely focused on ratio between slots and disregarded slots internal configuration. While saturation of storage IO operations or network bandwidth can lead to performance degradation of Reduce phase, the good news is that the MapReduce application would not be killed in such cases by the Hadoop framework and continues its execution although slowly. However, in case of out of memory error, the job is killed since the Reduce phase needs to bring large portions of intermediate data in to memory for processing. If there is not enough space left in memory, the Reduce tasks and consequently the Reduce phase will fail which leads to job termination by Hadoop. This is a major difference between shortages of memory vs. other resources such as disk/network IO or CPU in MapReduce applications and makes it a significant challenge to conquer. Albeit similar to other resources if the memory becomes the bottleneck, one will face performance degradation even if the job is not killed. Out of memory is not the only reason for failures of MapReduce jobs; there are also other factors such as disk failure, out of disk, and socket timeouts that might also lead to failure. But such factors are induced by external effects example, by faults in case of disk failure and network timeouts, or lack of enough resources such as disk space. But the problem in deciding the number of reduce is internal to the operation of the application. To meet the goal of optimizing the execution of Map Reduce applications in the presence of failures, while keeping the impact on the job completion time to the minimum, (D. Moise, T.-T.-L. Trieu, 2011) authors relied on a fault-tolerant, concurrency optimized data storage layer based on the BlobSeer data management service. They tested their approach with 2 distributed file systems: HDFS and their BlobSeer- based BSFS. In (G. Ruan, H. Zhang, and B. Plale,2013) authors analyzed the HPC platform. Their study examines two types of applications, a 3D-time series caries lesion assessment focusing on large scale medical image dataset, and a HTRC word counting task concerning large scale text analysis running on XSEDE resources which results demonstrate significant performance improvement in terms of storage space, data stage-in time, and job execution time. In (W. Yu, Y. Wang, X. Que, 2015) authors proposed a novel virtual shuffling strategy to enable efficient data movement and reduce I/O for MapReduce shuffling, thereby reducing power consumption and conserving energy. Their experimental results show that virtual shuffling significantly speeds up data movement in MapReduce 1 Corresponding author

and achieves faster job execution. In (Y. Chen, A. Ganapathi, and R. H. Katz, 2010) authors developed a decision algorithm that helps MapReduce users identify when and where to use compression. This framework would also facilitate detailed exploration of several compression factors not examined in this work, such as a range of data compressibility, different compression codecs, resource contention between compression and the compute function of maps and reduces, to name a few (Adam Crume, Joe Buck, Carlos Maltzahn, Scott Brandt,2012) Authors proposed SciHadoop a slightly modified version of Hadoop. In Hadoop mappers send data to reducers in the form of key/value pairs. Authors proved that with preliminary designs of multiple lossless approaches to compressing intermediate data, one of which results in up to five orders of magnitude reduction the original key/value ratio. In (Zhenhua Guo, Geoffrey Fox, Mo Zhou, 2012) authors investigate data locality in depth. They build a mathematical model of scheduling in MapReduce and theoretically analyze the impact on data locality of configuration factors, such as the numbers of nodes and tasks. Secondly, they found the default Hadoop scheduling is non-optimal.in addition, non-optimality of default Hadoop scheduling has been discussed and an optimal scheduling algorithm based on LSAP has been proposed to give the best data locality. Three scenarios single-cluster, cross cluster and HPC-style setup, have been discussed and real Hadoop experiments were conducted. In (Qi Zhang, 2015) author introduced PRISM, a fine-grained resource-aware MapReduce scheduler that divides tasks into phases, where each phase has a constant resource usage profile and performs scheduling at the phase level. Through experiments using a real MapReduce cluster running a wide-range of workloads, they show PRISM delivers up to 18% improvement in resource utilization while allowing jobs to complete up to 1.3 faster than current Hadoop schedulers. In (M. Zaharia, A. Konwinski, A. D. Joseph, R. H. Katz, 2008) authors analyzed the problem of speculative execution in MapReduce. They designed a simple, robust scheduling algorithm, LATE, which uses estimated finish times to speculatively execute the tasks that hurt the response times the most. In (A. Verma, L. Cherkasova, and R. Campbell,2011) authors introduces a novel framework and technique to address this problem and to offer a new resource sizing and provisioning service in MapReduce environments. They validated the accuracy of models using a set of realistic applications. The predicted completion times of generated resource provisioning options are within 10% of the measured times in our 66-node Hadoop cluster. In (B. Nicolae, D. Moise, G. Antoniu,2010) authors substitute the original HDFS layer of Hadoop with a new, concurrency-optimized data storage layer based on the BlobSeer data management service. Thereby, the efficiency of Hadoop is significantly improved for data-intensive Map-Reduce applications. This work demonstrates that it is possible to enhance it by replacing the default Hadoop Distributed File System (HDFS) layer by another layer, built along different design principles introduced BlobSeer system, which is specifically optimized data accessed under heavy concurrency along with additional features such as efficient concurrent appends toward efficient, fine-grain access to massive, distributed, concurrent writes at random offsets and versioning. In this paper, we analyze the challenges like this which are caused due internal operations and explore different solutions to solve them. We identify the pros and cons of each solution and finally propose a reduce memory optimization solution to execute the job successfully. Our solution is based on the concept of linear modeling based profiling. The job is first executed with different size of data and the intermediate data generated is found. Based on this a linear modeling is done between the input data size and the intermediate data size, based on this, the number of reduce memory needed for a large input dataset is calculated prior and the number of reduce is provisioned accordingly. By this way, the number of reduce jobs and the reduce memory is provisioned prior and the job is able execute successfully with better CPU utilization. We implement the proposed solution and measure the performance in terms of CPU Utilization with the default.hadoop implementation and prove that our solution has better CPU utilization than default Hadoop implementation. PROPOSED SOLUTION The architecture of the proposed solution is shown in Figure 1 Profiler takes the Job jar file as input and test run the jar file for various input size on the Hadoop environment and measure the size of intermediate data generated. Optimizer module takes the input file which the Job jar wants to execute on as input and calculate the optimum number of reduce slots and the reduce

memory that will be consumed and sets it on Hadoop environment. So that when the job jar is executed for the input file, it will execute in the optimum configuration and execute successfully. 2. The implementation steps are shown in Figure 2. CPU Utilization By varying the input file size, we measured these 2 parameters and compared it with Hadoop without optimization and the result is shown in graph as Figure 4 & 5. Figure 1: Architecture of the proposed solution Figure 3: Class Diagram of the Proposed Solution Figure 4: Execution Time Figure 2: Implementation Steps of the Proposed Solution CLASS DIAGRAM The class diagram are shown in Figure 3 RESULTS We implemented the proposed solution in Hadoop and measured two parameters Figure 5: CPU Utilization 1. Execution time

APPENDIX The Summary of Survey Paper listed below Paper Advantages Disadvantages [1] Able to handle Mapper failure and support fault tolerance Not scalable [2] Job execution time is faster Application specific only works well for mathematics and scientific applications [3] Achieves speed up by using shuffling also In case of high volume of intermediate key values pairs the saves power consumption overhead is very high for shuffling [4] Uses compression codes to compress Workload characteristics must be known priori in this intermediate key value and optimizes approach. memory consumption [5] Applied compression to reduce the volume of key value pair Execution time over head due to decompression is high [6] Achieves optimization based on locality Not scalable for multi cluster setup [7] By applying phase wise scheduling speed up The heuristics needed to split for phases is not well is achieved defined [8] Applied speculation to speed up the execution Resource overhead is high [9] Resource are reserved and job execution is started The efficiency of system is better only after long duration of profiling CONCLUSION AND FUTURE ENHANCEMENT We have implemented proposed Hadoop reduce memory optimization based on profiling and sizing. We executed the proposed solution for different volume of input file and showed that proposed system has reduced execution time when compared to nonoptimized Hadoop. Since the system is able to provide dimension values well ahead of execution, the system administrator can also scale up the system if sufficient resources are available. In the base paper estimation of intermediate is done based on one time profiling. But this is not valid for some tasks where the number of intermediate generated is different for different volume of data. To solve this problem, linear modeling is done for different volume of data and the intermediate size is found for each input data volume. After finding it, least square estimation is done to find the best fit for estimation of intermediate. Using this kind of modeling yielded best estimation of intermediate size. REFERENCES Moise D., Trieu T.-T.-L., Boug e L. and Antoniu G., 2011. Optimizing intermediate data management in map reduces computations, in Proceedings of the first international workshop on cloud computing platforms, pp. 1 7. ACM. Ruan G., Zhang H. and Plale B., 2013. Exploiting map reduces and data compression for dataintensive applications, in Proceedings of the Conference on Extreme Science and Engineering Discovery Environment: Gateway to Discovery. pp. 1 8, ACM. Yu W., Wang Y., Que X. and Xu C., 2015. Virtual shuffling for efficient data movement in map reduce, IEEE Transactions on Computers, 64(2):556 568. Chen Y., Ganapathi A. and Katz R.H., 2010. To compress or not to compress-compute vs. io tradeoffs for map reduce energy efficiency, in Proceedings of the first ACM SIGCOMM workshop on Green networking. pp. 23 28, ACM. Adam C., Joe B., Carlos M. and Scott B., 2012. Compressing Intermediate Keys between Mappers and Reducers in SciHadoop, Proceedings of the 2012 SC Companion: High Performance Computing, Networking Storage and Analysis, pp.7-12. [doi>10.1109/sc.companion.2012.12] Zhenhua G., Geoffrey F. and Mo Z., 2012. Investigation of Data Locality in MapReduce,

Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012), pp. 419-426. [doi>10.1109/ccgrid.2012.42]. Qi Z., 2015. PRISM: Fine-Grained Resource-Aware Scheduling for MapReduce, pp.4-8, IEEE Transactions on Cloud Computing. Zaharia M., Konwinski A., Joseph A. D., Katz R. H. and Stoica I., 2008. Improving map reduce performance in heterogeneous environments. In USENIX Symposium on Operating Systems Design and Implementation (OSDI), 8:7-8. Verma A., Cherkasova L. and Campbell R., 2011. Resource Provisioning Framework for Map Reduce Jobs with Performance Goals. ACM/IFIP/USENIX Middleware, pp. 165 186. Nicolae B., Moise D., Antoniu G. and al. BlobSeer, 2010. Bringing high throughput under heavy concurrency to Hadoop Map/Reduce applications. In Procs of the 24th IPDPS 2010, In press. Nisha T.S. and Reddy K.S., 2017. A Survey On Optimization in Hadoop Big data Environment, in IJRDT, 7(6).