Dynamic Data Placement Strategy in MapReduce-styled Data Processing Platform Hua-Ci WANG 1,a,*, Cai CHEN 2,b,*, Yi LIANG 3,c

Similar documents
High Performance Computing on MapReduce Programming Framework

HDFS Architecture. Gregory Kesden, CSE-291 (Storage Systems) Fall 2017

QADR with Energy Consumption for DIA in Cloud

Research Article Mobile Storage and Search Engine of Information Oriented to Food Cloud

Google File System (GFS) and Hadoop Distributed File System (HDFS)

Correlation based File Prefetching Approach for Hadoop

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.

Research on Load Balancing in Task Allocation Process in Heterogeneous Hadoop Cluster

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

Big Data for Engineers Spring Resource Management

CLIENT DATA NODE NAME NODE

Huge Data Analysis and Processing Platform based on Hadoop Yuanbin LI1, a, Rong CHEN2

A priority based dynamic bandwidth scheduling in SDN networks 1

Hadoop and HDFS Overview. Madhu Ankam

The Google File System. Alexandru Costan

A Robust Cloud-based Service Architecture for Multimedia Streaming Using Hadoop

MapReduce. U of Toronto, 2014

TP1-2: Analyzing Hadoop Logs

18-hdfs-gfs.txt Thu Nov 01 09:53: Notes on Parallel File Systems: HDFS & GFS , Fall 2012 Carnegie Mellon University Randal E.

Batch Inherence of Map Reduce Framework

Big Data 7. Resource Management

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Indexing Strategies of MapReduce for Information Retrieval in Big Data

PROFILING BASED REDUCE MEMORY PROVISIONING FOR IMPROVING THE PERFORMANCE IN HADOOP

Distributed Face Recognition Using Hadoop

Distributed Filesystem

BigData and Map Reduce VITMAC03

CAVA: Exploring Memory Locality for Big Data Analytics in Virtualized Clusters

Review On Data Replication with QoS and Energy Consumption for Data Intensive Applications in Cloud Computing

The Analysis Research of Hierarchical Storage System Based on Hadoop Framework Yan LIU 1, a, Tianjian ZHENG 1, Mingjiang LI 1, Jinpeng YUAN 1

Yuval Carmel Tel-Aviv University "Advanced Topics in Storage Systems" - Spring 2013

Sinbad. Leveraging Endpoint Flexibility in Data-Intensive Clusters. Mosharaf Chowdhury, Srikanth Kandula, Ion Stoica. UC Berkeley

CLOUD-SCALE FILE SYSTEMS

HADOOP BLOCK PLACEMENT POLICY FOR DIFFERENT FILE FORMATS

Hadoop/MapReduce Computing Paradigm

Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo! Sunnyvale, California USA {Shv, Hairong, SRadia,

Dynamic Replication Management Scheme for Cloud Storage

SQL Query Optimization on Cross Nodes for Distributed System

The Hadoop Distributed File System Konstantin Shvachko Hairong Kuang Sanjay Radia Robert Chansler

Facilitating Consistency Check between Specification & Implementation with MapReduce Framework

Research on Mass Image Storage Platform Based on Cloud Computing

Clustering Lecture 8: MapReduce

An Indian Journal FULL PAPER ABSTRACT KEYWORDS. Trade Science Inc. The study on magnanimous data-storage system based on cloud computing

Decision analysis of the weather log by Hadoop

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

International Journal of Scientific & Engineering Research, Volume 7, Issue 2, February-2016 ISSN

Optimization Scheme for Small Files Storage Based on Hadoop Distributed File System

Consolidating Complementary VMs with Spatial/Temporalawareness

Processing Technology of Massive Human Health Data Based on Hadoop

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing

MI-PDB, MIE-PDB: Advanced Database Systems

MixApart: Decoupled Analytics for Shared Storage Systems. Madalin Mihailescu, Gokul Soundararajan, Cristiana Amza University of Toronto and NetApp

Hadoop Map Reduce 10/17/2018 1

Big Data and Object Storage

Lecture 11 Hadoop & Spark

Distributed Systems 16. Distributed File Systems II

April Final Quiz COSC MapReduce Programming a) Explain briefly the main ideas and components of the MapReduce programming model.

Cloudera Exam CCA-410 Cloudera Certified Administrator for Apache Hadoop (CCAH) Version: 7.5 [ Total Questions: 97 ]

ABSTRACT I. INTRODUCTION

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Hadoop Virtualization Extensions on VMware vsphere 5 T E C H N I C A L W H I T E P A P E R

Cloud Computing. Hwajung Lee. Key Reference: Prof. Jong-Moon Chung s Lecture Notes at Yonsei University

Fast and Effective System for Name Entity Recognition on Big Data

International Journal of Advance Engineering and Research Development. A Study: Hadoop Framework

RIAL: Resource Intensity Aware Load Balancing in Clouds

Hadoop Distributed File System(HDFS)

Journal of East China Normal University (Natural Science) Data calculation and performance optimization of dairy traceability based on Hadoop/Hive

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University

MapReduce, Hadoop and Spark. Bompotas Agorakis

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

CS60021: Scalable Data Mining. Sourangshu Bhattacharya

A New HadoopBased Network Management System with Policy Approach

DYNAMIC DATA STORAGE AND PLACEMENT SYSTEM BASED ON THE CATEGORY AND POPULARITY

Toward Energy-efficient and Fault-tolerant Consistent Hashing based Data Store. Wei Xie TTU CS Department Seminar, 3/7/2017

Deployment Planning Guide

HADOOP FRAMEWORK FOR BIG DATA

The Establishment of Large Data Mining Platform Based on Cloud Computing. Wei CAI

A Review Approach for Big Data and Hadoop Technology

A BigData Tour HDFS, Ceph and MapReduce

LITERATURE SURVEY (BIG DATA ANALYTICS)!

Nowadays data-intensive applications play a

Spark Over RDMA: Accelerate Big Data SC Asia 2018 Ido Shamay Mellanox Technologies

Workload Aware Load Balancing For Cloud Data Center

Enhanced Hadoop with Search and MapReduce Concurrency Optimization

Distributed File Systems II

Data Prefetching for Scientific Workflow Based on Hadoop

Introduction to Hadoop and MapReduce

HDFS Federation. Sanjay Radia Founder and Hortonworks. Page 1

Improved Balanced Parallel FP-Growth with MapReduce Qing YANG 1,a, Fei-Yang DU 2,b, Xi ZHU 1,c, Cheng-Gong JIANG *

Chapter 5. The MapReduce Programming Model and Implementation

I Tier-3 di CMS-Italia: stato e prospettive. Hassen Riahi Claudio Grandi Workshop CCR GRID 2011

Implementation of Parallel CASINO Algorithm Based on MapReduce. Li Zhang a, Yijie Shi b

New research on Key Technologies of unstructured data cloud storage

An Adaptive Scheduling Technique for Improving the Efficiency of Hadoop

50 Must Read Hadoop Interview Questions & Answers

Mixing and matching virtual and physical HPC clusters. Paolo Anedda

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

The Design of Distributed File System Based on HDFS Yannan Wang 1, a, Shudong Zhang 2, b, Hui Liu 3, c

Top 25 Hadoop Admin Interview Questions and Answers

Mounica B, Aditya Srivastava, Md. Faisal Alam

Transcription:

2016 Joint International Conference on Service Science, Management and Engineering (SSME 2016) and International Conference on Information Science and Technology (IST 2016) ISBN: 978-1-60595-379-3 Dynamic Data Placement Strategy in MapReduce-styled Data Processing Platform Hua-Ci WANG 1,a,*, Cai CHEN 2,b,*, Yi LIANG 3,c 1,2,3 Beijing University of Technology, China a wanghc@emails.bjut.edu.cn, b chencai@bjut.edu.cn, c yliang@bjut.edu.cn *Corresponding author Keywords: Data Placement, MapReduce, HDFS, Replica, CloudSim. Abstract. Data placement is one of the core technologies of the MapReduce-styled data processing platform, which contributes much to the data processing efficiency. A dynamic data placement strategy is proposed in this paper due to that existing data placement techniques are lack of the consideration of the computing load on the data storage node and reduce the ratio of the localized processing of the hot-spot data. The strategy takes the data accessing localization ratio, the remaining computing capacity of nodes as the new factors of the data placement decision. Performance evaluation results show that the proposed data placement strategy outperforms the original strategy in HDFS file system and the average job execution time is reduce by the maximum of 12%. Introduction MapReduce-styled data processing platform ( MapReduce platform for short) is one of the latest technologies in the field of massive data processing [1]. Data locality-aware processing on the MapReduce-styled platform means the massive volume of data is stored on the local disk on the computing nodes and the computing tasks are scheduled to the nodes that the consuming data resides as much as possible, so as to reduce the communication overhead caused by remote data access and improve the data processing efficiency. Consequently, increasing the possibility of local data processing is one of the main objectives of the MapReduce platform. Data placement is one of the core technologies of the data processing platform, which implement the function that the data in the platform can be distribute stored in all storage nodes via a reasonable and effective way. Existing data placement techniques mostly aim at enhancing the efficiency of data access and reducing the data I/O bottleneck. However, when applied to the MapReduce platform, these technologies will lead to the poor data processing efficiency, due to that these technologies are lack of the consideration of the computing load on the data storage node and reduce the ratio of the localized processing of the hot-spot data. Aiming on the problem mentioned above, a dynamic data placement strategy is proposed in this paper. The strategy takes the data accessing localization ratio, the remaining computing capacity of nodes as the new factors of the data placement decision. The strategy has two parts: one part is evaluate nodes and data block replica with decision-making factors, another part is build migration task with the migration data block replica candidate set and the destination node candidate set, and is implement with a Master/Slave framework. Performance evaluation results show that the proposed data placement strategy outperforms the original strategy in HDFS file system (the most popular data management system in MapReduce platform) and the average job execution time is reduce by the maximum of 12%. Background & Related Work Hadoop distributed file system (HDFS) is one of the most widely used file systems in MapReduce platform [2]. The research of dynamic data placement is based on HDFS in this paper.

HDFS is designed as master-slave architecture with a single master. A typical HDFS file system cluster is composed of a Namenode and several Datanodes. Files store in HDFS are consist of several blocks of data. For the fault-tolerance and concurrency, data blocks are separately stored in different nodes in a multi-replica way. In the default placement, there are three replicas. One major replica is stored in the local node or in the local rack where client operates, and two default replicas are stored in a remote rack's two nodes with random way. Data placement strategy is the key to the reliability and performance of file system in the MapReduce platform, and optimized replica placement strategy will effectively improve the performance of the MapReduce platform. A greedy data placement strategy proposed by [3] is to store the data according the storage capacity and data accessing frequency, and the strategy proposed by [4] is to store the data according to the calculation and processing ability of the nodes. Cost-Effective Dynamic Replication Management Scheme (CDRM) is proposed in the paper [5], which considers the availability of data storage nodes and load balancing. According to the node's load situation and the degree of node availability, the strategy to adjust the number of replicas and distribution can be adjusted. Existing data placement techniques mostly aim at enhancing the efficiency of data access and reducing the data I/O bottleneck. In this paper, by adjusting the location of the data block replica to improve the probability of the data to be localized access as a means to reduce the MapReduce job execution time for the design and implementation of dynamic data placement strategy. Overview of Dynamic Data Placement Master Node Decision-making information acquisition Node evaluation Data block replica evaluation Destination node selection Data block replica selection Migration Task Building Namenode Prediction Slave Node HDFS Client Prediction Slave Node HDFS Client Computing Datanode Computing Datanode Collection CPU&RAM info NodeManager Collection CPU&RAM info NodeManager Figure 1. Architecture of Dynamic Data Placement in MapReduce Platform. Fig. 1 shows the dynamic data placement implement with a Master/Slave framework. The light-colored components have existed in HDFS, and the deep-colored components are new- added in HDFS. The function of Slave nodes is to collect, compute and predict the decision-making information. During the processing, the role of information collection is to get the data block replica access information from HDFS client, and the CPU and RAM information from NodeManager. The collected information can be defined as 3-tuple: (replica-access, CPU, RAM); then compute the collected information; and predict the statistical information by using gray prediction. The components in master node includes decision-making information acquisition, that is to gather all decision-making information from slave nodes, data block replica evaluation which is to calculate the migration value of data block replica, node evaluation which is to calculate the migration value of nodes, destination node selection which is to select suitable nodes into migration destination node candidate set, data block replica selection which is to select suitable data block replicas into migration data block replica candidate set, and migration task building is to choose the best destination node from migration destination node candidate set for each data block replica which is in the migration data block replica candidate set.

Dynamic Data Placement Strategy The basic idea of the proposed dynamic data placement strategy is to adjust placement of data block replica to improve the localized processing rate of hot-spot. Dynamic data placement strategy is composed of the definition as follows and the data block replica migration algorithm. Dynamic data placement strategy is based on some new decision factors which can be collected as the decision-making information from the slave nodes, so we first defined the decision-making factors. Definition 1 The accessing frequency of data block replica: The accessing frequency is the times of data block replica accessing in a time period which can be expressed as: =.. = 1,., + 0,., + ; where, r i is an event of data block replica accessing; r i.value is the assignment value to data block replica on event r i ; r i.time is happen time of event r i ; t is the beginning of the time period; T is the interval of the time period. Definition 2 The ratio of localized accessing of data block replica: The ratio of localized accessing of data block replica L can be express as: = / ; where, is the times of data block replica localized accessing in a time period; is the all times of data block replica accessing in a time period. Definition 3 The node remaining computing capacity: The node remaining computing capacity is the average of remaining numbers of CPU core, the average capacity of RAM in a time period and can be expressed as: =., =.., + ; where, e i is an collection event for collecting remaining computing capacity; e i.time is happen time of event; e i.cvalue is the numbers of CPU core that is got by event; e i.mvalue is the capacity of RAM that is got by event; t and T is the same thing which is described in Definition 1. Evaluation of the Migration Value of the Data Block Replica and the Node The migration value of node can be expressed as: = + + =1; where, the normalized value of ; is the normalized value of ; is the weight of while evaluating migration value of node; is the weight of while evaluating migration value of nodes; The migration value of replica can be expressed as: = + + + + =1; where, and are same as and which are described above, besides, and here is based on the node that stores current data block replica;, and are the weights of, and while evaluating migration value of replica; is the normalized value of the accessing value of data block replica which can be defined as: = + / ; where, L is the ratio of localized accessing in the current data block replica; R is the accessing frequency of current data block replica; k is the weight the the access frequency of data block replica while calculate B. The Data Block Replica Migration Algorithm The data block replica migration algorithm mainly includes three step. The first step is to select appropriate data block replica as the migration data block replica candidate set. For each of replicas has a migration value when evaluate processing has be done, we will sort the replicas by their migration value ascendingly and put the replica that its migration value is lower than threshold into migration data block replica candidate set, and the size of above set should be controlled at most MAX which is configured by administrator. The second step is to select suitable migration destination node candidate set. For each of nodes has a migration value after evaluate processing has be done, we put the node that its migration value is higher than threshold into migration destination node candidate set. The third step is to build migration for all replicas in migration data block replica candidate set, and it can be as follows:

In the dynamic data placement strategy, for each replica in migration data block replica candidate set, a migration task should be builded (Line 3). The data block replica's access history contains remote access record as Set<Datanode, count>. There are two situation of data block replica's remote access history, one situation shows that most remote access are request by a small part of remote nodes. In this situation, data block replica should be migrated to those nodes as much as possible(line 6). The another situation shows that remote access are request by remote nodes evenly. In this situation, choose a random node from candidate node set as the destination node(line 7). Besides, each migration should make sure that replicas of one data block should place on different nodes and different racks(line 9~10). Performance Analysis Extension of CloudSim CloudSim[6] is a simulation software. Because of the natural flexibility, it has been popularized as time goes on. This paper simulates the distributed file system based on CloudSim 3.0.3 project by adds HDFSFile class, Block class, BlockReplica class, Namenode class and Datanode class etc. This paper further realizes the simulation of MapReduce Platform based on [7], adds HDFSFile dependence for Job Object, add Block dependence for Map Task etc. Simulation Experiment and Results Analysis Based on extended CloudSim simulation Platform, this paper does experiments on dynamic data placement strategy in this paper and then compares our simulation results with simulation results of HDFS default data placement strategy. Extended CloudSim project's configuration files includes cluster configuration files, job configuration files and data configuration files. According to cluster configuration files, we distribute 144 different performance hosts on 10 racks and set data transmission speed between different racks to 1MB/S, transmission speed between different nodes in same rack to 10MB/S and

disk read speed to 133MB/S. This paper examines the performance of optimizing strategy by using two evaluation indices job average execution time and local Map task ratio. Table 1. Result Comparison of the Proposed Data Placement Strategy to the Original Strategy of HDFS. Job submissions in per hour Default data placement strategy Job average execution time [s] Local Map Task ratio [%] Dynamic data placement strategy added Job average execution time [s] Local Map Task ratio [%] 640 96.50875 90.5982188 92.51971875 95.7611743 780 97.82455128 84.5551336 92.92219231 93.4136296 920 100.539163 76.6660861 93.21406522 89.5662046 1060 106.6621415 68.8642076 93.86072642 86.1862398 1200 111.170225 60.7274257 98.99741667 75.1564778 local Map task ratio(%) Default data placement strategy Default data placement strategy 100.00% 90.00% 80.00% 70.00% 60.00% 50.00% 40.00% 30.00% 20.00% 10.00% Dynamic data placement strategy added Dynamic data placement strategy added 140 640 780 920 1060 1200 job submissions in per hour Figure 2. Performance Comparison of the Proposed Data Placement Strategy to the Original Strategy of HDFS. Jobs in MapReduce Platform are submitted by users to the platform to run during running process. Jobs submission is a random event, and it has fixed frequency. It means that the time interval of two successive events obey exponential distribution. To preferably evaluate the performance of our strategy, this paper compares the HDFS default data placement simulation results with the dynamic data placement strategy extension under different amount of job submissions in per hour. The amount of jobs in the experiment are separately 640, 780, 920, 1060 and 1200. The experiment uses custom programs to create corresponding configuration files of different amount of jobs which satisfy Poisson distribution. Running results of jobs is shown in Table 1 and the figure of running results of jobs is shown in Fig. 2. The results show in Fig. 2 and Table 1 demonstrates that under condition of same amount of jobs submission, MapReduce platform which adds dynamic data placement strategy has better performance than platform without dynamic data placement in job average execution time and local Map task ratio. And the platform has best optimization affect when the amount of jobs submission per hour is 1060. The application of dynamic data placement strategy decreases the execute time by 12% at best and 7.88% in average. 130 120 110 100 90 80 job average execution time(s)

Summary This paper proposes a dynamic data placement strategy, which increases the local execution ratio of the data processing task and reduces the job average execution time in MapReduce platform. It overall improves the performance of MapReduce platform. In the future, we will further consider to add network condition in dynamic placement. References [1] Menon S P, Hegde N P. A survey of tools and applications in big data, Intelligent Systems and Control (ISCO), 2015 IEEE 9th International Conference on. IEEE, 2015. [2] Shvachko K, Kuang H, Radia S, et al. The hadoop distributed file system, Mass Storage Systems and Technologies (MSST), 2010 IEEE 26th Symposium on. IEEE, 2010:1-10. [3] Sharrukh Zaman, Daniel Grosu. A distributed algorithm for the replica placement problem, In Proc. of IEEE Transactions on Parallel and Distributed Systems, 2011: 1455-1468. [4] Xie J, Yin S, Ruan X, et al. Improving MapReduce performance through data placement in heterogeneous Hadoop clusters, Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW), 2010 IEEE International Symposium on. IEEE, 2010:1-9. [5] Q. S. Wei, B. Veeravalli, B. Z. Gong, et al. CDRM: A cost-effective dynamic replication management scheme for Cloud storage cluster, The 2010 IEEE International Conference on Cluster Computing, 2010, 188-196. [6] Goyal T, Singh A, Agrawal A. Cloudsim: simulator for cloud computing infrastructure and modeling, Procedia Engineering, 2012, 38(4):3566-3572. [7] Jung J, Kim H. MR-CloudSim: Designing and implementing MapReduce computing model on CloudSim, ICT Convergence (ICTC), 2012 International Conference on. IEEE, 2012:504-509.