Mitigating Data Skew Using Map Reduce Application

Similar documents
Survey on MapReduce Scheduling Algorithms

Chisel++: Handling Partitioning Skew in MapReduce Framework Using Efficient Range Partitioning Technique

SURVEY ON LOAD BALANCING AND DATA SKEW MITIGATION IN MAPREDUCE APPLICATIONS

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Survey Paper on Traditional Hadoop and Pipelined Map Reduce

MapReduce: A Programming Model for Large-Scale Distributed Computation

Big Data Management and NoSQL Databases

Cloud Computing CS

The MapReduce Framework

Introduction to Data Management CSE 344

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

Map Reduce Group Meeting

CS-2510 COMPUTER OPERATING SYSTEMS

A Micro Partitioning Technique in MapReduce for Massive Data Analysis

Survey on Incremental MapReduce for Data Mining

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

LEEN: Locality/Fairness- Aware Key Partitioning for MapReduce in the Cloud

EFFICIENT ALLOCATION OF DYNAMIC RESOURCES IN A CLOUD

Distributed computing: index building and use

Programming Models MapReduce

A Study of Skew in MapReduce Applications

MapReduce & Resilient Distributed Datasets. Yiqing Hua, Mengqi(Mandy) Xia

CIS 601 Graduate Seminar Presentation Introduction to MapReduce --Mechanism and Applicatoin. Presented by: Suhua Wei Yong Yu

Distributed Computations MapReduce. adapted from Jeff Dean s slides

Distributed computing: index building and use

Announcements. Parallel Data Processing in the 20 th Century. Parallel Join Illustration. Introduction to Database Systems CSE 414

Parallel Computing: MapReduce Jin, Hai

Introduction to MapReduce

FuxiSort. Jiamang Wang, Yongjun Wu, Hua Cai, Zhipeng Tang, Zhiqiang Lv, Bin Lu, Yangyu Tao, Chao Li, Jingren Zhou, Hong Tang Alibaba Group Inc

Announcements. Optional Reading. Distributed File System (DFS) MapReduce Process. MapReduce. Database Systems CSE 414. HW5 is due tomorrow 11pm

Database Systems CSE 414

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

MI-PDB, MIE-PDB: Advanced Database Systems

CHAPTER 4 ROUND ROBIN PARTITIONING

MapReduce: Simplified Data Processing on Large Clusters

Part A: MapReduce. Introduction Model Implementation issues

1. Introduction to MapReduce

HADOOP FRAMEWORK FOR BIG DATA

Cloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe

Parallel Nested Loops

Parallel Partition-Based. Parallel Nested Loops. Median. More Join Thoughts. Parallel Office Tools 9/15/2011

The MapReduce Abstraction

MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark

Databases 2 (VU) ( / )

Frequent Item Set using Apriori and Map Reduce algorithm: An Application in Inventory Management

Introduction to Data Management CSE 344

L22: SC Report, Map Reduce

MALiBU: Metaheuristics Approach for Online Load Balancing in MapReduce with Skewed Data Input

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Survey on Novel Load Rebalancing for Distributed File Systems

CS 61C: Great Ideas in Computer Architecture. MapReduce

Introduction to Hadoop and MapReduce

A Survey on Data Skew Mitigating Techniques in Hadoop MapReduce Framework

ABSTRACT I. INTRODUCTION

Clustering Lecture 8: MapReduce

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

CS MapReduce. Vitaly Shmatikov

Map-Reduce. John Hughes

CompSci 516: Database Systems

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Announcements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems

CS 345A Data Mining. MapReduce

Introduction to Data Management CSE 344

PROFILING BASED REDUCE MEMORY PROVISIONING FOR IMPROVING THE PERFORMANCE IN HADOOP

Comparative Analysis of Range Aggregate Queries In Big Data Environment

Real-time Scheduling of Skewed MapReduce Jobs in Heterogeneous Environments

CSE Lecture 11: Map/Reduce 7 October Nate Nystrom UTA

Developing MapReduce Programs

Introduction to Map Reduce

! Parallel machines are becoming quite common and affordable. ! Databases are growing increasingly large

Chapter 20: Parallel Databases

Chapter 20: Parallel Databases. Introduction

Efficient Entity Matching over Multiple Data Sources with MapReduce

Hadoop Map Reduce 10/17/2018 1

MapReduce: Simplified Data Processing on Large Clusters 유연일민철기

Improving the MapReduce Big Data Processing Framework

Database Applications (15-415)

Introduction to MapReduce

EXTRACT DATA IN LARGE DATABASE WITH HADOOP

Chapter 18: Parallel Databases

Chapter 18: Parallel Databases. Chapter 18: Parallel Databases. Parallelism in Databases. Introduction

Performing MapReduce on Data Centers with Hierarchical Structures

SQL-to-MapReduce Translation for Efficient OLAP Query Processing

Data Analysis Using MapReduce in Hadoop Environment

Map-Reduce (PFP Lecture 12) John Hughes

Lecture 11 Hadoop & Spark

Parallel data processing with MapReduce

MapReduce Algorithms

Jumbo: Beyond MapReduce for Workload Balancing

CS6030 Cloud Computing. Acknowledgements. Today s Topics. Intro to Cloud Computing 10/20/15. Ajay Gupta, WMU-CS. WiSe Lab

AN EFFECTIVE DETECTION OF SATELLITE IMAGES VIA K-MEANS CLUSTERING ON HADOOP SYSTEM. Mengzhao Yang, Haibin Mei and Dongmei Huang

Batch Processing Basic architecture

MapReduce Simplified Data Processing on Large Clusters

STATS Data Analysis using Python. Lecture 7: the MapReduce framework Some slides adapted from C. Budak and R. Burns

Principles of Data Management. Lecture #16 (MapReduce & DFS for Big Data)

Large-Scale GPU programming

Fall 2018: Introduction to Data Science GIRI NARASIMHAN, SCIS, FIU

MapReduce. Stony Brook University CSE545, Fall 2016

Transcription:

Ms. Archana P.M Mitigating Data Skew Using Map Reduce Application Mr. Malathesh S.H 4 th sem, M.Tech (C.S.E) Associate Professor C.S.E Dept. M.S.E.C, V.T.U Bangalore, India archanaanil062@gmail.com M.S.E.C, V.T.U Bangalore, India mhavanur@gmail.com Abstract-There is a growing need for ad-hoc analysis of extremely large data sets, especially at internet companies where innovation critically depends on being able to analyze terabytes of data collected every day. Parallel database products, over a solution, but are usually prohibitively expensive at this scale. Besides, many of the people who analyze this data are entrenched procedural programmers. The success of the more procedural map-reduce programming model, and its associated scalable implementations on commodity hardware, is evidence of the above. However, the map-reduce paradigm is too low-level and rigid, and leads to a great deal of custom user code that is hard to maintain, and reuse. The map reduce is an effective tool for parallel data processing. One significant issue in practical map reduce application is the data skew. The imbalance of the amount of the data assigned to each tasks to take much longer to finish than the others. Keywords: Map Reduce, Data Skew, Sampling, Partitioning. 1 INTRODUCTION Hadoop is an open source framework written in java that allows distributed processing of large datasets across clusters of computers using simple programming models. A Hadoop frameworked application works in an environment that provides distributed storage and computation across clusters of computers. Hadoop is designed to scale up from single server to thousands of machines, each offering local computation and storage. Map reduce is an important programming model for large-scale data-parallel applications such as web indexing, data mining, and scientific simulation. Hadoop is an open source implementation of Map Reduce enjoying wide adoption and is often used for short jobs. The completion of the job in the Hadoop depends on the slowest running task in the job. If one task is significantly running task in the job. If one task is significantly longer to finish than others. It can delay the progress of entire jobs. The straggler can be happen from various region like among which data skew is important one. The data skew refers to the imbalance the amount of work required to process such data. Hadoop s performance is closely tied to its task scheduler, which implicitly assumes that cluster nodes are www.ijcrd.com Page 90

homogeneous and tasks make progress linearly, and uses these assumptions to decide when to speculatively re-execute tasks that appear to be stragglers[2]. 1.1The main objective of this paper summarized as follows. Implement the method for general user defined Map reduce programs. The method has better approximation to the distribution of the intermediate data. The LIBRA can adjust the work load allocation and deliver improved performance even in the absence of data skew when the performance underlying computing platform. Implement the innovative approach to balance the load among the reduce tasks which supports the split of large key when application semantics permit. Implement the LIBRA in the Hadoop and evaluate the performance for the some popular applications. The LIBRA performance is higher than the other. The result will show that LIBRA can improve the job execution time to a factor 4. 2 LITERATURE SURVEY Our implementation of Map Reduce runs on a large cluster of commodity machines and is highly scalable: a typical Map Reduce computation processes many terabytes of data on thousands of machines. Programmers find the system easy to use: hundreds of Map Reduce programs have been implemented and upwards of one thousand Map Reduce jobs are executed on Google's clusters every day[1]. The standard approach to handling skew in parallel systems is to assign an equal number of data values to each partition via hash partitioning or clever range partitioning. These strategies effectively handle data skew, which occurs when some nodes are assigned more data than others. Computation skew, more generally, results when some nodes take longer to process their input than other nodes and can occur even in the absence of data skew the runtime of many scientific tasks depends on the data values themselves rather than simply the data size. These histograms were also shown earlier to be optimal for join selectivity estimation, thus establishing their universality. In this paper, presents a new strategy called LIBRA (Lightweight Implementation of Balanced Range Assignment) to solve the data skew problem for reduce-side applications in Map Reduce. Compared to the previous work, our contributions include the following: We propose a new sampling method for general user defined Map Reduce programs. The www.ijcrd.com Page 91

method has a high degree of parallelism and very little overhead, which can achieve a much better approximation to the distribution of the intermediate data. We use an innovative approach to balance the load among the reduce tasks which supports the split of large keys when application semantics permit. Figure 1 shows that with our LIBRA method, each reducer processes roughly the same amount of data. shuffling (intermediates) (intermediates) (intermediates) (intermediates) Partitioner Partitioner Partitioner Partitioner (intermediates) (intermediates) (intermediates) Reducer Reducer Reducer Figure 1: scenario of proposed system Data skew mitigation in LIBRA consists of the following steps: Small percentage of the original map tasks are selected as the sample tasks. They are issued first whenever the system has free slots. Other ordinary map tasks are issued only when there is no pending sample task to issue. Sample tasks collect statistics on the intermediate data during normal map processing and transmit a digest of that information to the master after they complete. The master collects all the sample information to derive an estimate of the data distribution, makes the partition decision and notifies the worker nodes. Upon receipt of the partition decision, the worker nodes need to partition the intermediate data generated by the sample tasks and already issued ordinary map tasks accordingly. Subsequently issued map tasks can partition the intermediate data directly without any extra overhead. Reduce tasks can be issued as soon as the partition decision is ready. They do not need to wait for all map tasks to finish. In a Map Reduce system, a typical job execution consists of the following steps: 1) After the job is submitted to the Map Reduce system, the input files are divided into multiple parts and assigned to a group of map tasks for parallel processing. 2) Each map task transforms its input (K1, V1) tuples into intermediate (K2, V2) tuples according to some www.ijcrd.com Page 92

user defined map and combine functions, and outputs them to the local disk. 3) Each reduce task copies its input pieces from all map tasks, sorts them into a single stream by a multi-way merge, and generates the final (K3, V3) results according to some user defined reduce function. 2.1 Algorithm: In this section the sampling and partitioning algorithm in LIBRA as shown below. Our goal is to balance the load across reduce tasks. The algorithm consists of three steps: 1) Sample partial map tasks 2) Estimate intermediate data distribution 3) Apply range partition In the following, we will describe the details of these steps. Problem Statement We first give a formulation of the intermediate data between the map and the reduce phases can be represented as a set of tuples: (K 1,C 1 ), (K 2,C 2 ),.,(K n,c n ), where K i represents a distinct key in the map output, and C i represents the number of tuples in the cluster of K i. Without loss of generality, we assume that K i < K i+1 in the above list. Then our goal is to come up with a range partition on keys which minimizes the load of the largest reduce task. Let r be the number of reduce tasks. The range partition can be expressed as: 0 = pt 0 < pt 1 < < pt r = n with reduce task i taking responsibility of keys in the range of. Following the cost model proposed by previous work [3], [4],[5] we define the function Cost(C i ) as the computational complexity of processing the cluster K i in reduce tasks which must be specified by the users. For example, the cost function of the sort application can be estimated as Cost(C i ) = C i (for each cluster K i, reducers only need to output C i tuples directly). For reduce side self-join application, the cost function should be c 2 i since reducers need to output C i tuples for each tuple in cluster K i. By specifying the exact cost function, we can balance the execution time of each reducer one step further. Then the objective function can be expressed as follows: Since the number of unique keys can be large, calculating the optimal solution to the above problem is unrealistic. Therefore, we present a distributed approximation algorithm by sampling and estimation. Sampling Strategy After a specific map task j is chosen for sampling, its normal execution will be plugged in with a lightweight sampling procedure. Along with the map execution, this procedure collects a statistic of (K j i,c j i ) for each key K j i in the output of this task, where C j i is the frequency (i.e., the number of records) of key K j i. Since the number of such (K j i ;C j i ) tuples can be on the same order of magnitude www.ijcrd.com Page 93

as the input data size, we keep only a sample set S sample containing the following two parts: _ S largest : p tuples with the largest C j i _ S normal : q tuples randomly selected from the rest according to uniform distribution (excluding tuples in S largest ) This sampling task then transmits the following statistics to the master: the sample set S sample = S largest U S normal, the total number of records (TR j ) and the total number of distinct clusters (TC j ) generated by this task. The size of the sample set p + q is constrained by the amount of memory and the network bandwidth at the master. The larger p + q is, the more accurate approximation to the real data distribution we will achieve. In practice, we find that a small p+q value has already reached a good approximation and brings negligible overhead. Estimate Intermediate Data Distribution After the completion of all sample map tasks, the master aggregates the sampling information in the above step to estimate the distribution of the data. It first combines all the sample tuples with the same key into one tuple (K i,c i ) by adding up their frequency It then sorts these combined tuples to generate an aggregated list L. Suppose there are m maps for sampling and S j sample is the sample set of map j. Then the aggregated list L is: To calculate the total number of records TR, we simply sum up the record counts in all sample map tasks. However, calculating the total number of distinct clusters TC is hard because clusters processed by different map tasks may share the same key and hence should not be counted twice. For example, assume that there are two sample map tasks and their sample sets are: {(A; 10), (B; 5), (C; 3), (D; 2), (E; 2)}, {(A; 20), (B; 3), (D; 1), (F; 1), (H; 1)}, in which p = 2 and q = 3. By summing up the frequencies of the same key, the merged sample set S sample is {(A; 30), (B; 8), (C; 3), (D; 3), (E; 2), (F; 1), (H; 1)}. Range Partition We adopt the above approximation to the data distribution to get an approximate solution to the range partition. We need to generate a list of partition points in the aggregated list L where and minimize: We use dynamic programming to solve this optimization problem: let F(i; j) represent the minimum value of the largest partition sum of cutting the first i items into j partitions, and www.ijcrd.com Page 94

Keys ---> Partitioner---> International Journal of Combined Research & Development (IJCRD) Then the recursive formulation of F(i; j) is: Grep-LIBRA The partition decision can be derived from optimized decision of F(i; j). 3. PERFORMANCE EVALUATION A LIBRA program can be run on Grep application and analyzed the split of data for various partitions, also checked the job execution time. The histogram and the time for all the partitions are shown below: 750 700 650 600 550 Grep - LIBRA implementation 1 3 5 7 9 11 13 15 Reducers ---> Number of keys processed at reducers- HashPartitioner 2 1 0 5 10 4. RESULTS Time(seconds) Our research work on this paper signify the program for the different partitioner, and found that the new sampling strategy improves the performance significantly, where the data gets evenly distributed across all the reducers, thereby reducing the data skew. Also, the job execution times gets improved significantly. CONCLUSION Time taken- BinaryPartitione r Time taken- KeyFieldBasedP artitioner The Map Reduce programming model has been successfully used at Google for many different purposes. We attribute this success to several reasons. First, the model is easy to use, even for programmers without experience with parallel and distributed systems, since it hides the details of parallelization, fault-tolerance, locality optimization, and load balancing. Second, a large variety of problems are easily expressible as Map Reduce computations. Data skew mitigation is important in improving Map Reduce performance. This paper has presented LIBRA, a system that implements a set of innovative skew mitigation www.ijcrd.com Page 95

strategies in an existing Map Reduce system. One unique feature of LIBRA is its support of large cluster split and its adjustment for heterogeneous environments. In some sense, we can handle not only the data skew, but also the reducer skew. REFERENCES [1] J. Dean and S. Ghemawat, Mapreduce: simplified data processing on large clusters, Commun. ACM, vol. 51, January 2008. [2] Y. Kwon, M. Balazinska, and B. Howe, A study of skew in mapreduce applications, in Proc. of the Open Cirrus Summit, 2011. [3] Y. Kwon, M. Balazinska, B. Howe, and J. Rolia, Skew-resistant parallel processing of featureextracting scientific user-defined functions, in Proc. of the ACM symposium on Cloud computing (SoCC), 2010. [4] S. Ibrahim, J. Hai, L. Lu, W. Song, H. Bingsheng, and Q. Li, Leen: Locality/fairnessaware key partitioning for mapreduce in the cloud, in Proc. of the IEEE International Conference on Cloud Computing Technology and Science (CloudCom), 2010. [5] G. Benjamin, A. Nikolaus, R. Angelika, and K. Alfons, Handling data skew in mapreduce, in Proc. of the International Conference on Cloud Computing and Sesrvices Science (CLOSER), 2011. www.ijcrd.com Page 96