LEEN: Locality/Fairness- Aware Key Partitioning for MapReduce in the Cloud

Similar documents
Handling Partitioning Skew in MapReduce using LEEN

Chisel++: Handling Partitioning Skew in MapReduce Framework Using Efficient Range Partitioning Technique

Locality-Aware Reduce Task Scheduling for MapReduce

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Mitigating Data Skew Using Map Reduce Application

DiskReduce: Making Room for More Data on DISCs. Wittawat Tantisiriroj

MATE-EC2: A Middleware for Processing Data with Amazon Web Services

Cloud Computing CS

Load Balancing for MapReduce-based Entity Resolution

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

K-MEANS METHOD FOR GROUPING IN HYBRID MAPREDUCE CLUSTERS

Correlation based File Prefetching Approach for Hadoop

A Plugin-based Approach to Exploit RDMA Benefits for Apache and Enterprise HDFS

Dynamic processing slots scheduling for I/O intensive jobs of Hadoop MapReduce

Optimizing MapReduce Partitioner Using Naive Bayes Classifier

Database Applications (15-415)

Can Parallel Replication Benefit Hadoop Distributed File System for High Performance Interconnects?

Hadoop/MapReduce Computing Paradigm

A brief history on Hadoop

Distributed Computations MapReduce. adapted from Jeff Dean s slides

Maestro: Replica-Aware Map Scheduling for MapReduce

HiTune. Dataflow-Based Performance Analysis for Big Data Cloud

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

The Fusion Distributed File System

MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti

Forget about the Clouds, Shoot for the MOON

IN organizations, most of their computers are

Camdoop Exploiting In-network Aggregation for Big Data Applications Paolo Costa

Automatic Scaling Iterative Computations. Aug. 7 th, 2012

Distributed Systems CS6421

Ioan Raicu. Everyone else. More information at: Background? What do you want to get out of this course?

Initial Results on Provisioning Variation in Cloud Services

A Network-aware Scheduler in Data-parallel Clusters for High Performance

International Journal of Computer Science Trends and Technology (IJCST) Volume 5 Issue 2, Mar Apr 2017

Facilitating Consistency Check between Specification & Implementation with MapReduce Framework

DiskReduce: Making Room for More Data on DISCs. Wittawat Tantisiriroj

SRM-Buffer: An OS Buffer Management SRM-Buffer: An OS Buffer Management Technique toprevent Last Level Cache from Thrashing in Multicores

PROFILING BASED REDUCE MEMORY PROVISIONING FOR IMPROVING THE PERFORMANCE IN HADOOP

MOHA: Many-Task Computing Framework on Hadoop

Cloud Computing Paradigms for Pleasingly Parallel Biomedical Applications

Crossing the Chasm: Sneaking a parallel file system into Hadoop

Improving MapReduce Performance in a Heterogeneous Cloud: A Measurement Study

FuxiSort. Jiamang Wang, Yongjun Wu, Hua Cai, Zhipeng Tang, Zhiqiang Lv, Bin Lu, Yangyu Tao, Chao Li, Jingren Zhou, Hong Tang Alibaba Group Inc

Clustering Lecture 8: MapReduce

Crossing the Chasm: Sneaking a parallel file system into Hadoop

Department of Computer Science San Marcos, TX Report Number TXSTATE-CS-TR Clustering in the Cloud. Xuan Wang

Real-time Scheduling of Skewed MapReduce Jobs in Heterogeneous Environments

Sinbad. Leveraging Endpoint Flexibility in Data-Intensive Clusters. Mosharaf Chowdhury, Srikanth Kandula, Ion Stoica. UC Berkeley

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team

CAVA: Exploring Memory Locality for Big Data Analytics in Virtualized Clusters

Improving the MapReduce Big Data Processing Framework

Introduction to MapReduce. Adapted from Jimmy Lin (U. Maryland, USA)

Aggregation on the Fly: Reducing Traffic for Big Data in the Cloud

Research on Load Balancing in Task Allocation Process in Heterogeneous Hadoop Cluster

Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics

An Exploration of Designing a Hybrid Scale-Up/Out Hadoop Architecture Based on Performance Measurements

Cloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe

MixApart: Decoupled Analytics for Shared Storage Systems. Madalin Mihailescu, Gokul Soundararajan, Cristiana Amza University of Toronto and NetApp

PARDA: Proportional Allocation of Resources for Distributed Storage Access

Next-Generation Cloud Platform

Accelerating Big Data: Using SanDisk SSDs for Apache HBase Workloads

Gearing Hadoop towards HPC sytems

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets

RIAL: Resource Intensity Aware Load Balancing in Clouds

Ambry: LinkedIn s Scalable Geo- Distributed Object Store

Programming Models MapReduce

HPMR: Prefetching and Pre-shuffling in Shared MapReduce Computation Environment

Configuring A MapReduce Framework For Performance-Heterogeneous Clusters

Introduction to MapReduce

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

Efficient On-Demand Operations in Distributed Infrastructures

A Brief on MapReduce Performance

Huge Data Analysis and Processing Platform based on Hadoop Yuanbin LI1, a, Rong CHEN2

Accelerating Parallel Analysis of Scientific Simulation Data via Zazen

Multi-path based Algorithms for Data Transfer in the Grid Environment

Jinho Hwang and Timothy Wood George Washington University

Introduction to Data Management CSE 344

PACM: A Prediction-based Auto-adaptive Compression Model for HDFS. Ruijian Wang, Chao Wang, Li Zha

Clustering Documents. Case Study 2: Document Retrieval

MapReduce: Recap. Juliana Freire & Cláudio Silva. Some slides borrowed from Jimmy Lin, Jeff Ullman, Jerome Simeon, and Jure Leskovec

High Performance Computing Cloud - a PaaS Perspective

SSD-based Information Retrieval Systems

On Smart Query Routing: For Distributed Graph Querying with Decoupled Storage

Big Data Management and NoSQL Databases

Chapter 5. The MapReduce Programming Model and Implementation

E-Store: Fine-Grained Elastic Partitioning for Distributed Transaction Processing Systems

Databases 2 (VU) ( / )

CIS 601 Graduate Seminar. Dr. Sunnie S. Chung Dhruv Patel ( ) Kalpesh Sharma ( )

Toward Energy-efficient and Fault-tolerant Consistent Hashing based Data Store. Wei Xie TTU CS Department Seminar, 3/7/2017

BIG DATA AND HADOOP ON THE ZFS STORAGE APPLIANCE

GearDB: A GC-free Key-Value Store on HM-SMR Drives with Gear Compaction

Jure Leskovec Including joint work with Y. Perez, R. Sosič, A. Banarjee, M. Raison, R. Puttagunta, P. Shah

Cache Management for In Memory. Jun ZHANG Oct 15, 2018

Announcements. Optional Reading. Distributed File System (DFS) MapReduce Process. MapReduce. Database Systems CSE 414. HW5 is due tomorrow 11pm

TrafficDB: HERE s High Performance Shared-Memory Data Store Ricardo Fernandes, Piotr Zaczkowski, Bernd Göttler, Conor Ettinoffe, and Anis Moussa

TP1-2: Analyzing Hadoop Logs

SSD-based Information Retrieval Systems

Clustering Documents. Document Retrieval. Case Study 2: Document Retrieval

The Hadoop Distributed File System Konstantin Shvachko Hairong Kuang Sanjay Radia Robert Chansler

VIAF: Verification-based Integrity Assurance Framework for MapReduce. YongzhiWang, JinpengWei

Transcription:

LEEN: Locality/Fairness- Aware Key Partitioning for MapReduce in the Cloud Shadi Ibrahim, Hai Jin, Lu Lu, Song Wu, Bingsheng He*, Qi Li # Huazhong University of Science and Technology *Nanyang Technological University # China Development banck

Motivation MapReduce is Becoming very Popular Hadoop is widely used by Enterprise and Academia Yahoo!, Facebook, Baidu,. Cornell, Maryland, HUST,.. The wide diversity of Today s Data Intensive applications : Search Engine Social networks Scientific Application 2

Motivation Some applications experienced Data Skew in the shuffle phase [1,2] the current MapReduce implementations have overlooked the skew issue Results: Hash partitioning is inadequate in the presenese of data skew Design LEEN: Locality and fairness aware key partitioning 1. X. Qiu, J. Ekanayake, S. Beason, T. Gunarathne, G. Fox, R. Barga, and D. Gannon, Cloud technologies for bioinformatics applications, Proc. ACM Work. Many-Task Computing on Grids and Supercomputers (MTAGS 2009), ACM Press, Nov. 2009. 2. J. Lin, The Curse of Zipf and Limits to Parallelization: A Look at the Stragglers Problem in MapReduce, Proc. Work. Large-Scale Distributed Systems for Information Retrieval (LSDS-IR'09), Jul. 2009. 3

Outlines Motivation Hash partitioning in MapReduce LEEN: Locality and Fairness Aware Key Partitioning Evaluation Conclusion 4

Hash partitioning in Hadoop The current Hadoop s hash partitioning works well when the keys are equally appeared and uniformly stored in the data nodes In the presence of Partitioning Skew: Variation in Intermediate Keys frequencies Variation in Intermediate Key s distribution amongst different data node Native blindly hash-partitioning is to be inadequate and will lead to: Network congestion Unfairness in reducers inputs Reduce computation Skew Performance degradation 5

The Problem (Motivational Example) Data Node1 K1 K1 K1 K2 K2 K2 Data Node2 K1 K1 K1 K1 K1 K1 Data Node3 K1 K1 K1 K1 K2 K2 K2 K2 K3 K3 K3 K3 K1 K1 K1 K2 K4 K4 K2 K4 K4 K4 K4 K4 K4 K4 K4 K4 K5 K6 K4 K5 K5 K6 K6 K6 K4 K5 K5 K5 K5 K5 hash (Hash code (Intermediate-key) Modulo ReduceID) K1 K2 K3 K4 K5 K6 Data Node1 Data Node2 Data Node3 Total Data Transfer 11 15 18 Total 44/54 Reduce Input 29 17 8 cv 58% 6

Example: Wordcount Example 6-node, 2 GB data set! Combine Function is disabled Transferred Data is relatively Large Data Distribution is Imbalanced Data Distribution Max-Min 20% Ratio cv 42% Data Size (MB) 1000 900 800 700 600 500 400 300 200 100 0 Map Output Reduce Input Map Output 83% of the Maps output Transferred Data Local Data Data During Failed Reduce Reduce Input Map Output Reduce Input Map Output Reduce Input Map Output Reduce Input Map Output Reduce Input DataNode01 DataNode02 DataNode03 DataNode04 DataNode05 DataNode06

Our Work Asynchronous Map and Reduce execution Locality-Aware and Fairness-Aware Key Partitioning LEEN

Asynchronous Map and Reduce execution Default Hadoop: Several maps and reduces are concurrently running on each data Overlap computation and data transfer Our Approach keep a track on all the intermediate keys frequencies and key s distributions (using DataNode-Keys Frequency Table) Could bring a little overhead due to the unutilized network during the map phase it can fasten the map execution because the complete I/O disk resources will be reserved to the map tasks. For example, the average execution time of map tasks (32 in default Hadoop, 26 Using our approach) 9

LEEN Partitioning Algorithm Extend the Locality-aware concept to the Reduce Tasks Consider fair distribution of reducers inputs Results: Balanced distribution of reducers input Minimize the data transfer during shuffle phase Improve the response time Close To optimal tradeoff between Data Locality and reducers input Fairness Minimum Fairness Locality [0,1] [0,100] 10

LEEN Partitioning Algorithm (details) Keys are sorted according to their Fairness Locality Value Fairness Locality Values FLK i = Fairness in distribution of K i amongst data node Node with Best Locality For each key, nodes are sorted in descending order according to the frequency of the specific Key Partition a key to a node using Fairness-Score Value For a specific Key K i If (Fairness-ScoreN j > Fairness-ScoreN j+1 ) move to the next node Else partition K i to N j 11

LEEN details (Example) K1 K2 k3 k4 k5 k6 Node1 3 5 4 4 1 1 18 Node2 9 1 0 3 2 3 18 Node3 4 3 0 6 5 0 18 Total 16 9 4 13 8 4 FLK 4.66 2.93 1.88 2.70 2.71 1.66 For Data K Node1 1 N 1 N 2 N 3 Data Node1 Node2 If (to N 2 ) 15 25 14 Fairness-Score = 4.9 K1 K1 K1 K2 K2 K2 (to N 3 ) 15 9 30 Fairness-Score = 8.8 K2 K2 K3 K3 K3 K3 Data Transfer = 24/54 cv = 14% K1 K1 K1 K1 K1 K1 K1 K1 K1 K2 K4 K4 Data Node1 Node3 K1 K1 K1 K1 K2 K2 K2 K4 K4 K4 K4 K4 K4 K4 K4 K4 K5 K6 K4 K5 K5 K6 K6 K6 K4 K5 K5 K5 K5 K5 12

Evaluation Cluster of 7 Nodes Intel Xeon two quad-core 2.33GHz 8 GB Memory 1 TB Disk Each node runs RHEL5 with kernel 2.6.22 Xen 3.2 Hadoop version 0.18.0 Designed 6 test sets Manipulate the Partitioning Skew Degree By modifying the existing textwriter code in Hadoop for generating the input data into the HDFS 13

Test sets 1 2 3 4 5 6 Nodes number 6PMs 6PMs 6PMs 6PMs 24VMs 24VMs Data Size 14GB 8GB 4.6GB 12.8GB 6GB 10.5GB Keys frequencies 230% 1% 117% 230% 25% 85% variation Key distribution 1% 195% 150% 20% 180% 170% variation (average ) Locality Range 24-26% 1-97.5% 1-85% 15-35% 1-50% 1-30% Presence of Keys Frequencies Variation Non-uniform Key s distribution amongst Data Nodes Partitioning Skew 14

Keys Frequencies Variation Each key is uniformly distributed among the data nodes Keys frequencies are significantly varying 6% 24-26% Locality Range [, ] 10 x 15

Non-Uniform Key Distribution Each key is non-uniformly distributed among the data nodes Keys frequencies are nearly equal 1-97.5% 9% 16

Partitioning Skew 3 4 5 6 Locality Range 1-85% 15-35% 1-50% 1-30% 17

Conclusion Partitioning Skew is a challenge for MapReduce-based applications: Today, diversity of Data-intensive applications Social Network, Search engine, Scientific Analysis, etc Partitioning Skew is due to two factors: Significant variance in intermediate keys frequencies Significant variance in intermediate key s distributions among the different data. Our solution is to extend the Locality concept to the reduce phase Partition the Keys according to their high frequencies Fairness in data distribution among different data nodes Up to 40% improvement using simple application example! Future work Apply LEEN to different key and values size 18

Thank you! Questions? shadi@hust.edu.cn http://grid.hust.edu.cn/shadi