Real-time Scheduling of Skewed MapReduce Jobs in Heterogeneous Environments

Similar documents
Cloud Computing CS

Storm. Distributed and fault-tolerant realtime computation. Nathan Marz Twitter

Mitigating Data Skew Using Map Reduce Application

CPU Scheduling. Daniel Mosse. (Most slides are from Sherif Khattab and Silberschatz, Galvin and Gagne 2013)

Improving the MapReduce Big Data Processing Framework

B490 Mining the Big Data. 5. Models for Big Data

data-based banking customer analytics

Chapter 5. The MapReduce Programming Model and Implementation

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

CompSci 516: Database Systems

Multimedia Streaming. Mike Zink

Apache Giraph: Facebook-scale graph processing infrastructure. 3/31/2014 Avery Ching, Facebook GDM

A priority based dynamic bandwidth scheduling in SDN networks 1

Statistics Driven Workload Modeling for the Cloud

Announcements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems

Introduction to MapReduce (cont.)

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ.

MapReduce. Stony Brook University CSE545, Fall 2016

Webinar Series TMIP VISION

Performance evaluation and benchmarking of DBMSs. INF5100 Autumn 2009 Jarle Søberg

The amount of data increases every day Some numbers ( 2012):

Large-Scale Duplicate Detection

2/26/2017. The amount of data increases every day Some numbers ( 2012):

Epilog: Further Topics

microsoft

Chisel++: Handling Partitioning Skew in MapReduce Framework Using Efficient Range Partitioning Technique

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

A SURVEY ON SCHEDULING IN HADOOP FOR BIGDATA PROCESSING

CIS 601 Graduate Seminar Presentation Introduction to MapReduce --Mechanism and Applicatoin. Presented by: Suhua Wei Yong Yu

Big Data Using Hadoop

Parallel Patterns for Window-based Stateful Operators on Data Streams: an Algorithmic Skeleton Approach

Chapter 6: CPU Scheduling. Operating System Concepts 9 th Edition

Programming Models MapReduce

BigDataBench-MT: Multi-tenancy version of BigDataBench

The MapReduce Framework

Before proceeding with this tutorial, you must have a good understanding of Core Java and any of the Linux flavors.

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

V Conclusions. V.1 Related work

Introduction to MapReduce. Adapted from Jimmy Lin (U. Maryland, USA)

Outline. Distributed File System Map-Reduce The Computational Model Map-Reduce Algorithm Evaluation Computing Joins

Data Mining and Machine Learning: Techniques and Algorithms

Introduction to MapReduce

C3PO: Computation Congestion Control (PrOactive)

Forget about the Clouds, Shoot for the MOON

CISC 7610 Lecture 2b The beginnings of NoSQL

Introduction to Map Reduce

Camdoop Exploiting In-network Aggregation for Big Data Applications Paolo Costa

Predictable Time-Sharing for DryadLINQ Cluster. Sang-Min Park and Marty Humphrey Dept. of Computer Science University of Virginia

MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti

Big Data Infrastructures & Technologies

Distributed computing: index building and use

/ Cloud Computing. Recitation 3 Sep 13 & 15, 2016

Network Load Balancing Methods: Experimental Comparisons and Improvement

CS / Cloud Computing. Recitation 3 September 9 th & 11 th, 2014

LECTURE 3:CPU SCHEDULING

A Study of Skew in MapReduce Applications

Graph Algorithms using Map-Reduce. Graphs are ubiquitous in modern society. Some examples: The hyperlink structure of the web

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

CS 61C: Great Ideas in Computer Architecture. MapReduce

Geometric data structures:

LEEN: Locality/Fairness- Aware Key Partitioning for MapReduce in the Cloud

Clustering Lecture 8: MapReduce

Multi-tenancy version of BigDataBench

Introduction & Motivation Problem Statement Proposed Work Evaluation Conclusions Future Work

Copyright 2012, Oracle and/or its affiliates. All rights reserved.

Big Data - Some Words BIG DATA 8/31/2017. Introduction

Clustering Documents. Document Retrieval. Case Study 2: Document Retrieval

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

Distributed Computations MapReduce. adapted from Jeff Dean s slides

Multimedia Systems 2011/2012

Chapter 6: CPU Scheduling. Operating System Concepts 9 th Edition

Sinbad. Leveraging Endpoint Flexibility in Data-Intensive Clusters. Mosharaf Chowdhury, Srikanth Kandula, Ion Stoica. UC Berkeley

Coflow. Recent Advances and What s Next? Mosharaf Chowdhury. University of Michigan

A Survey on Job Scheduling in Big Data

Big Data for Engineers Spring Resource Management

Streaming videos. Problem statement for Online Qualification Round, Hash Code 2017

Distributed Web Crawling over DHTs. Boon Thau Loo, Owen Cooper, Sailesh Krishnamurthy CS294-4

Nowcasting. D B M G Data Base and Data Mining Group of Politecnico di Torino. Big Data: Hype or Hallelujah? Big data hype?

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

Lecture 11 Hadoop & Spark

BIG DATA AND HADOOP ON THE ZFS STORAGE APPLIANCE

Towards an Adaptive, Fully Automated Performance Modeling Methodology for Cloud Applications

Clustering Documents. Case Study 2: Document Retrieval

Performance evaluation and. INF5100 Autumn 2007 Jarle Søberg

Map Reduce. Yerevan.

Distributed Systems CS6421

Jure Leskovec Including joint work with Y. Perez, R. Sosič, A. Banarjee, M. Raison, R. Puttagunta, P. Shah

LazyBase: Trading freshness and performance in a scalable database

Large-Scale GPU programming

Batch Processing Basic architecture

Department of Computer Science San Marcos, TX Report Number TXSTATE-CS-TR Clustering in the Cloud. Xuan Wang

CPU Scheduling. Operating Systems (Fall/Winter 2018) Yajin Zhou ( Zhejiang University

Introduction to Data Management CSE 344

Femto-Matching: Efficient Traffic Offloading in Heterogeneous Cellular Networks

Topology Enhancement in Wireless Multihop Networks: A Top-down Approach

Chapter 5: CPU Scheduling

Efficient On-Demand Operations in Distributed Infrastructures

Parallelization Strategy

Diffusing Your Mobile Apps: Extending In-Network Function Virtualisation to Mobile Function Offloading

Transcription:

Real-time Scheduling of Skewed MapReduce Jobs in Heterogeneous Environments Nikos Zacheilas, Vana Kalogeraki Department of Informatics Athens University of Economics and Business 1

Big Data era has arrived! Introduction Facebook processes daily more than 500 TB of data Twitter users generate 500M tweets per day Dublin s city operational center receives over 100 bus GPS traces per minute Wide range of domains Traffic monitoring Inventory management Healthcare infrastructures More data than we can handle with traditional approaches (e.g. relational databases) Novel frameworks were proposed Batch processing Google s MapReduce IBM s BigInsights Microsoft s Dryad Stream processing Storm IBM s Infosphere Streams Nikos Zacheilas 2

The MapReduce Model MapReduce [Dean@OSDI2004] was proposed as a powerful and cost-effective approach for massive scale batch processing Popularized via its open source implementation, Hadoop, is used by some of the major computer companies: Yahoo! Twitter Facebook Intense processing jobs are broken into smaller tasks Two stages of processing map and reduce map(k 1, v 1 ) [k 2, v 2 ] reduce(k 2, [v 2 ]) [k 3, v 3 ] All [k 2, v 2 ] intermediate pairs assigned to the same reduce task are called a reduce task s partition Nikos Zacheilas 3

Processing Big Data with MapReduce Challenges Load imbalances due to skewed data Heterogeneous environments with heterogeneous processing capabilities Real time response requirements 95% of Facebook s MapReduce jobs have average execution time of 30 seconds [Chen@MASCOTS2011] Youtube social graph application Nikos Zacheilas 4

Problem Question: How can we efficiently schedule the execution of multiple MapReduce jobs with real-time response requirements? Challenges: Maximize the probability of meeting end-to-end real-time response requirements Effectively handle skewed data Identify overloaded nodes Deal with heterogeneous environments Nikos Zacheilas 5

DynamicShare System We propose DynamicShare a novel MapReduce framework for heterogeneous environments. Our approach makes the following contributions: New jobs execution times estimation model based on nonparametric regression Distributed least laxity first scheduling of jobs tasks to meet endto-end demands Early identification of overloaded nodes through Local Outlier Factor algorithm Handling data skewness with two approaches: Simple partitions assignment Count-Min Sketch assignment Nikos Zacheilas 6

Partitioning Split File Split File Split File M M M The MapReduce Model Map Phase (k 1, v 1 ) (k 2, v 2 ) (k 3, v 3 ) (k 4, v 4 ) (k 5, v 5 ) (k 5, v 6 ) (k 6, v 7 ) (k 2, v 8 ) (k 1, [v 1 ]) P.1 (k 2, v 2, v 8 ) P.2 (k 5, [v 5, v 6 ]) (k 4, [v 4 ]) P.3 (k 3, [v 3 ]) P.4 (k 6, [v 7 ]) P.5 Reduce Phase (k 6, [v 7 ]) (k 3, [v 3 ]) R.1 R.2 R.3 Nikos Zacheilas 7 Output Output Output

DynamicShare Architecture Master Map Tasks Slots DynamicShare comprises a single Master and multiple Worker nodes Master node responsible for assigning map and reduce tasks to Workers under skewness and real-time criteria monitor jobs performance Worker nodes execute map/reduce tasks report task progress Workers Reduce Task Slots Nikos Zacheilas 8

System Model Each submitted job j comprises a sequence of invocations of map and reduce tasks. Each job j is characterized by: Deadline j is the time interval, starting at job initialization, within which job j must be completed Proj_exec_time j : the estimated amount of time required for the job to complete. Estimation is given by the following Equation: Proj_exec_time j = max m i,t,, m k,t + max {r z,t,, r l,t } Laxity j : the difference between Deadline j and Proj_exec_time j, considered a metric of urgency for job split_size j : the size of a split file Each task t of job j has the following parameters: m i,t, r i,t : estimated execution times of map and reduce tasks in Worker i cpu i,t, memory i,t : average CPU and memory usage of task t in Worker i Nikos Zacheilas 9

DynamicShare: How it works? 1. j 1 2. 3. 7. j 1 j 1 l 1 Master j 1 l 1 4. 8. 9. Partitions Assignment Sizes 6. Laxity values Split File Split File Worker j 1 l 1 j 2 l 2 j 3 l 3 5. 5. M M R R Task Slots 10. 10

Task Scheduling 1. j 1 2. 3. 7. j 1 j 1 l 1 Master j 1 l 1 4. 8. 9. Partitions Assignment Sizes 6. Laxity values Split File Split File j 1 l 1 j 2 l 2 j 3 l 3 5. 5. M M R R Task Slots 10. Nikos Zacheilas 11

Task Scheduling Given the Deadline j and Proj_exec_time j for job j, we compute the Laxity j value with the following formula Laxity j = Deadline j Proj_exec_time j Least laxity scheduling is a dynamic algorithm that allow us to compensate for queueing delays experienced by the tasks executing at the nodes TaskScheduler sorts jobs tasks based on the Laxity j values. Tasks of jobs with the smaller laxity values will be closer to the head of the queue Scheduling decisions are made when: 1. New tasks are assigned to the TaskScheduler s 2. Tasks finish or miss their deadlines Nikos Zacheilas 12

Estimating Task s Execution Time Current solutions such as building job profiles or using debug runs are not adequate Works well for homogeneous environments What happens though in heterogeneous environments where multiple applications may share the same resources? Need to take into account the resource requirements (e.g., CPU, memory usage) of newly submitted tasks x = split_size j, cpu i,t, memory i,t Approximate m x function m x Execution Times Estimator Parametric regression considers the functional form known Non-parametric regression makes no assumption (data-driven technique) Nikos Zacheilas 13 m i,t

Estimating Task s Execution Time x m x = 1 n n i=1 W i Execution Times Estimator (x) y i m i,t x m x = 1 k k i=1 W i Execution Times Estimator (x) y i m i,t Vector Execution Time x 1 y 1 x n Past runs y n Use k closest in Euclidean distance past runs Vector Execution Time x 1 y 1 x n Past runs y n Non-parametric Regression k-nearest Neighbor (k-nn) Smoothing Nikos Zacheilas 14

Identifying Overloaded Nodes Due to the dynamic behavior of the jobs Workers performance may change rapidly. Need to quickly detect overloaded Workers We consider overloaded nodes those that are assigned more tasks than their processing capabilities Key Observation: Laxity values of these tasks will be left behind in relation to the tasks running in different nodes Solution: Applied Local Outlier Factor algorithm (LOF) on the laxity values of the tasks of the same job that run on different Workers lax B N l (lax A ) lrd l (lax B ) LOF l (lax A ): = N l (lax A ) lrd l (lax A ) Compares reachability density of a point with each neighbors Nikos Zacheilas 15

Handling Skewed Data In our system two types of skew frequently occur: Skewed Key Frequencies Skewed Tuple Sizes Idea: Use more partitions than the original MapReduce Problem: How to assign partitions to the reduce tasks in order to minimize the reduce phase execution time? Exploit two approaches: Simple Partitions Assignment Count Min Sketch Assignment 1. 2. 3. 7. Master 4. Partitions Sizes 8. 9. Assignment Laxity 6. values Split File 5. M M 10. Split File 5. R R Task Slots Worker Nikos Zacheilas 16

Map Tasks Simple Partitions Assignment 1. Calculate partitions sizes (P i ) 2. Sort partition sizes 3. Estimate the execution times (r. t i ) of assigning each partition to the available reduce tasks 4. Pick the reduce task (R i ) that requires the minimum execution time Master Dynamic Partitioning Algorithm P 1 + P 1 P 1 P 1 P 2 P 2 + P 2 P 2 P 1 P 2 Reduce Tasks P 2 P 1 P 1 :R 1 P 1 :R 1 r. t 1 r. t 2 Estimated via k-nn smoothing Nikos Zacheilas 17

Count-Min Sketch Assignment 1. Calculate partitions sizes (P i ) for each hash function (h i ) 2. For each hash function apply Simple Partitions Assignment algorithm 3. Pick the hash function (h i ) that minimizes the reduce phase execution time Map Tasks Master Dynamic Partitioning Algorithm h 1 : P 1 + h 1 : P 1 h 1 : P 1 h 1 : P 1 h 2 : P 1 + h 2 : P 1 h 2 : P 1 h 2 : P 1 h 1 : P 1 h 2 : P 1 Reduce Tasks h 2 : P 1 h 1 : P 1 h 1 : P 1 : R 1 h 1 : P 1 : R 1 h 2 : P 1 : R 2 h 1 : P 1 : R 1 Nikos Zacheilas 18

Implementation We implemented and evaluated DynamicShare on Planetlab. Fourteen nodes were used with 82 processing cores. One dedicated node was the Master and the others used as Workers Two MapReduce jobs were issued: A Twitter friendship request query on 2GB of available tweets. 59 map and 23 reduce tasks were used A Youtube friends counting application for a 39MB Youtube social graph. Again 59 map and 23 reduce tasks were used Compared our scheduling proposal with: Earliest Deadline First (EDF) FIFO FAIR Our partitioning algorithms were compared to: Load Balance [Gufler@CLOSER2011] Hadoop Skewtune [Kwon@SIGMOD2012] Nikos Zacheilas 19

Experiments k-nn Smoothing Performance Initially when not enough data are available, the estimated value is larger than the actual Better prediction when more past runs are used LOF Execution time LOF depends on the number of tasks used by a job Even for great number of tasks the algorithm is capable of detecting outliers in respectable amount of time Deadline misses comparison LLF maintains the percentage of deadline misses at the lowest possible level Takes into account the current system conditions for the assignment Nikos Zacheilas 20

Experiments Comparing LB with DP in regards to achieved balance LB has better results because it considers a fair distribution of the partitions to the available reduce tasks DP does not consider balance in the assignment Comparing DP with LB in regards to achieved execution time Balance is not the correct approach for heterogeneous environments DP s opportunistic assignment exploits high performance nodes by assigning extra partitions Nikos Zacheilas 21

Experiments Comparing DP with Skewtune and Hadoop partitioning Hadoop leads to the execution of large partitions to slow nodes Skewtune repartitioning cost is prohibitive for short jobs DP does an appropriate one time assignment Similar results were observed in Youtube job Comparing DP with and without sketches DP with sketches achieves better results than DP without sketches, because more partitions assignments are possible However the overhead of the algorithm is not negligible. When sketches are applied DP requires approximately 200 ms while without sketches only 80 ms 22

Conclusions and Future Work We proposed a new framework for handling MapReduce jobs with real-time constraints in highly heterogeneous environments using: non-parametric regression for estimating tasks execution times Least Laxity First scheduling of jobs tasks in the available slots Local Outlier Factor for detecting overloaded nodes Dynamic Partitioning algorithms for handling skewed data Evaluated our proposal in Planetlab, and the results point out that our system achieves its goals Future work: Dynamically decide the number of partitions and examine the trade-off between the reduce phase execution time and the two partitioning algorithms Nikos Zacheilas 23

Thank you Questions?? Nikos Zacheilas 24