RESTORE: REUSING RESULTS OF MAPREDUCE JOBS. Presented by: Ahmed Elbagoury

Similar documents
I am: Rana Faisal Munir

Towards Automatic Optimization of MapReduce Programs (Position Paper) Shivnath Babu Duke University

Chisel++: Handling Partitioning Skew in MapReduce Framework Using Efficient Range Partitioning Technique

Analysis in the Big Data Era

MixApart: Decoupled Analytics for Shared Storage Systems. Madalin Mihailescu, Gokul Soundararajan, Cristiana Amza University of Toronto and NetApp

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

CIS 601 Graduate Seminar. Dr. Sunnie S. Chung Dhruv Patel ( ) Kalpesh Sharma ( )

MapReduce-II. September 2013 Alberto Abelló & Oscar Romero 1

Dremel: Interactive Analysis of Web- Scale Datasets

Oracle Hyperion Profitability and Cost Management

Camdoop Exploiting In-network Aggregation for Big Data Applications Paolo Costa

Improving the MapReduce Big Data Processing Framework

MATE-EC2: A Middleware for Processing Data with Amazon Web Services

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

Column Stores vs. Row Stores How Different Are They Really?

The Evolution of Big Data Platforms and Data Science

Optimizing the use of the Hard Disk in MapReduce Frameworks for Multi-core Architectures*

4th National Conference on Electrical, Electronics and Computer Engineering (NCEECE 2015)

Big Data Hadoop Stack

FuxiSort. Jiamang Wang, Yongjun Wu, Hua Cai, Zhipeng Tang, Zhiqiang Lv, Bin Lu, Yangyu Tao, Chao Li, Jingren Zhou, Hong Tang Alibaba Group Inc

HYRISE In-Memory Storage Engine

Sempala. Interactive SPARQL Query Processing on Hadoop

P4 Pub/Sub. Practical Publish-Subscribe in the Forwarding Plane

Clustering Billions of Images with Large Scale Nearest Neighbor Search

Advanced Databases: Parallel Databases A.Poulovassilis

SSS: An Implementation of Key-value Store based MapReduce Framework. Hirotaka Ogawa (AIST, Japan) Hidemoto Nakada Ryousei Takano Tomohiro Kudoh

DryadLINQ. Distributed Computation. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India

Efficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488)

Large Scale Complex Network Analysis using the Hybrid Combination of a MapReduce Cluster and a Highly Multithreaded System

Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel Abadi, David DeWitt, Samuel Madden, and Michael Stonebraker SIGMOD'09. Presented by: Daniel Isaacs

Outline. Introduction. Introduction Definition -- Selection and Join Semantics The Cost Model Load Shedding Experiments Conclusion Discussion Points

SDA: Software-Defined Accelerator for general-purpose big data analysis system

Heuristics in Commercial MIP Solvers Part I (Heuristics in IBM CPLEX)

Improving Hadoop MapReduce Performance on Supercomputers with JVM Reuse

Chapter 12: Query Processing

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark

Big Data Systems on Future Hardware. Bingsheng He NUS Computing

Introduction to Hive Cloudera, Inc.

TCEP: Traffic Consolidation for Energy-Proportional High-Radix Networks

Columnstore and B+ tree. Are Hybrid Physical. Designs Important?

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

DryadLINQ. by Yuan Yu et al., OSDI 08. Ilias Giechaskiel. January 28, Cambridge University, R212

GLADE: A Scalable Framework for Efficient Analytics. Florin Rusu University of California, Merced

April Copyright 2013 Cloudera Inc. All rights reserved.

MapReduce Algorithms

Copyright 2012, Oracle and/or its affiliates. All rights reserved.

Locality-Aware Dynamic VM Reconfiguration on MapReduce Clouds. Jongse Park, Daewoo Lee, Bokyeong Kim, Jaehyuk Huh, Seungryoul Maeng

HiTune. Dataflow-Based Performance Analysis for Big Data Cloud

Vendor: Hortonworks. Exam Code: HDPCD. Exam Name: Hortonworks Data Platform Certified Developer. Version: Demo

A2E: Adaptively Aggressive Energy Efficient DVFS Scheduling for Data Intensive Applications

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team

Workloads Programmierung Paralleler und Verteilter Systeme (PPV)

The 7 deadly sins of cloud computing [2] Cloud-scale resource management [1]

Performance Comparison of Hive, Pig & Map Reduce over Variety of Big Data

Evolution of Big Data Facebook. Architecture Summit, Shenzhen, August 2012 Ashish Thusoo

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

CSE 190D Spring 2017 Final Exam Answers

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

CSE 190D Spring 2017 Final Exam

Query Processing & Optimization

Principles of Data Management. Lecture #16 (MapReduce & DFS for Big Data)

Estimating Result Size and Execution Times for Graph Queries


Parallel DBMS. Parallel Database Systems. PDBS vs Distributed DBS. Types of Parallelism. Goals and Metrics Speedup. Types of Parallelism

data parallelism Chris Olston Yahoo! Research

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Performance Optimization for Informatica Data Services ( Hotfix 3)

Introduction to Hadoop. High Availability Scaling Advantages and Challenges. Introduction to Big Data

Automatic Identification of Application I/O Signatures from Noisy Server-Side Traces. Yang Liu Raghul Gunasekaran Xiaosong Ma Sudharshan S.

Performance Modeling and Optimization of Deadline-Driven Pig Programs

Introduction to Data Management CSE 344

Diffusing Your Mobile Apps: Extending In-Network Function Virtualisation to Mobile Function Offloading

Exploring the Throughput-Fairness Trade-off on Asymmetric Multicore Systems

Accelerating Data Warehousing Applications Using General Purpose GPUs

Learning Probabilistic Ontologies with Distributed Parameter Learning

Evaluating Use of Data Flow Systems for Large Graph Analysis

Parallelizing Structural Joins to Process Queries over Big XML Data Using MapReduce

Yuval Carmel Tel-Aviv University "Advanced Topics in Storage Systems" - Spring 2013

Chapter 13: Query Optimization. Chapter 13: Query Optimization

A Pattern-supported Parallelization Approach

Coflow. Recent Advances and What s Next? Mosharaf Chowdhury. University of Michigan

Using Multiple Machines to Solve Models Faster with Gurobi 6.0

PROFILING BASED REDUCE MEMORY PROVISIONING FOR IMPROVING THE PERFORMANCE IN HADOOP

Optimizing Apache Spark with Memory1. July Page 1 of 14

GLADE: A Scalable Framework for Efficient Analytics. Florin Rusu (University of California, Merced) Alin Dobra (University of Florida)

TrafficDB: HERE s High Performance Shared-Memory Data Store Ricardo Fernandes, Piotr Zaczkowski, Bernd Göttler, Conor Ettinoffe, and Anis Moussa

Query processing on raw files. Vítor Uwe Reus

7. Query Processing and Optimization

Design of Parallel Algorithms. Course Introduction

CSE 344 Final Review. August 16 th

CISC 7610 Lecture 2b The beginnings of NoSQL

Segregating Data Within Databases for Performance Prepared by Bill Hulsizer

.. Cal Poly CPE/CSC 369: Distributed Computations Alexander Dekhtyar..

Agenda. Cassandra Internal Read/Write path.

Introduction Alternative ways of evaluating a given query using

Automated Control for Elastic Storage Harold Lim, Shivnath Babu, Jeff Chase Duke University

MASSACHUSETTS INSTITUTE OF TECHNOLOGY Database Systems: Fall 2008 Quiz II

Pathfinder/MonetDB: A High-Performance Relational Runtime for XQuery

Pig A language for data processing in Hadoop

Transcription:

RESTORE: REUSING RESULTS OF MAPREDUCE JOBS Presented by: Ahmed Elbagoury

Outline Background & Motivation What is Restore? Types of Result Reuse System Architecture Experiments Conclusion Discussion

Background MapReduce facilitates large scale data analysis Users have complex tasks to express as one MapReduce job Express complex tasks using high level query languages such as Pig, Hive or Jaql The figure is from "Comparing High Level MapReduce Query Languages" paper 3

Background The compilers of these high-level query languages translate queries into workflows of MapReduce jobs 4

Workflow of MapReduce Jobs Each job produces output that is stored in the distributed file system (DFS) used by the MapReduce system These intermediate results are used as inputs by subsequent jobs in the workflow These intermediate jobs are deleted from the DFS after finishing the workflow 5

Reusing Intermediate Results Saving the intermediate results so that future jobs can use them Similar to the materialized views in RDBMS The figure is from I.Elghandour's Presentation 6

Outline Background & Motivation What is Restore? Types of Result Reuse System Architecture Experiments Conclusion Discussion

Restore Restore improves the performance of workflows of MapReduce jobs by Storing the intermediate results of executed workflows and Reusing them in future workflows The system is not limited to The queries that are executed concurrently Sharing one operator between multiple queries Is sharing results important? Facebook stores the result of any query in its MapReduce cluster for seven days for sharing purposes 8

Example of Two Queries Return the estimated revenue for each user viewing web pages Return the total estimated revenue for each user viewing web pages, grouped by user name 9

Example of Two Queries The MapReduce workflow for query Q2 after rewriting it to reuse the output of query Q1 10

Outline Background & Motivation What is Restore Types of Result Reuse System Architecture Experiments Conclusion Discussion

Types of Result Reuse Time to execute Job n is Two types of reuse opportunities 1. Whole job: Reduces max i Y {T total (Job i )} 2. Operators in jobs (sub jobs): Reduces ET Job n in future jobs T total Job n = ET Job n + max i Y {T total (Job i )} 12

Reusing the Whole Job T total Job n = ET Job n + max i Y {T total (Job i )} If the results of all dependent jobs of Job n are stored in the system, then T total Job n = ET Job n If the results of subset X Yof dependent jobs are not stored then T total Job n is reduced only if max i X {T total (Job i )} is less than max i Y {T total (Job i )} 13

Reusing Sub-job ET Job n = T load + i ET OP i + T sort + T store 14

We need to answer two questions 1. How to rewrite a MapReduce job using stored intermediate results? 2. How to populate the repository with intermediate results? 15

Outline Background & Motivation What is Restore Types of Result Reuse System Architecture Experiments Conclusion Discussion

System Architecture Restore has three main components 1. Plan Matcher and Rewriter 2. Sub-job Enumerator 3. Enumerated Sub-job Selector The input is: Workflow of MapReduce jobs The outputs are Modified MapReduce jobs Job outputs to store in the DFS 17

1-Plan Matcher and Rewriter 18

1-Plan Matcher and Rewriter Restore repository contains Outputs of previous MapReduce jobs Physical query execution plans of these jobs Statistics about these MapReduce jobs The goal is to find physical plans in the repository that can be used to rewrite the jobs of the input workflow 19

Matching Algorithm (1) Matching and rewriting are performed on the physical plan Matching is simple and robust It is easy to adapt Restore to any dataflow system regardless of the input language A physical plan is considered a match if it is contained in the input MapReduce job The matching is based on operator equivalence, two operators are equivalent if: If their inputs are pipelined from two equivalent operators or from the same data sets They perform functions that produce the same output data 20

Matching Algorithm (2) Both plans are traversed simultaneously starting from load operators until Mismatching operators are found All the operators of the repository plan have equivalent matches in the input MapReduce plan The matched part of the input physical plan is replaced by a load operator that reads the output of the matched plan from the DFS More than one plan in the repository can be used to rewrite the input job 21

Matching Algorithm (3) The first match that Restore finds, is used to rewrite the input MapReduce job The matching becomes more efficient The physical plans in the repository must be ordered Ordering physical plans in the repository If plan A subsumes plan B (all operators in plan B have equivalent operators in A) If neither of A and B subsumes the other: The ratio between the size of the input data and the output data The execution time of the MapReduce job 22

We need to answer two questions 1. How to rewrite a MapReduce job using stored intermediate results? 2. How to populate the repository with intermediate results? 23

Generating Candidate Sub-jobs 24

Generating Candidate Sub-jobs The outputs of whole MapReduce jobs and some sub-jobs are saved Materializing the outputs of all sub-jobs is infeasible Substantial amount of storage in the DFS It will slow down the execution Which sub-jobs should we choose? 25

Choosing Candidate Sub-jobs Good sub-job candidates: ET Job n = T load + i ET OP i + T sort + T store Operators that reduce the size of their inputs, like: filter, project Expensive operators: Join and Group 26

Heuristics for Choosing Candidate Sub-jobs 1. Conservative Heuristic Operators that reduce their input size: project and Filter 2. Aggressive Heuristic Operators that reduce their input size and expensive operators: Project, Filter, Join and Group Conservative heuristic imposes less overhead but creates less reusing opportunities 27

Choosing Candidate Sub-jobs For each operator in the physical plan Check the used heuristic Inject a store operator after it (if it is not a store operator) Split the flow into two sub-flows Generated jobs are stored in order the makes the matching efficient 28

Enumerated Sub-job Selector Keeping all generated jobs and sub-jobs is expensive Storage space Too many plans to match in the future workflows Decide which outputs to keep The decision is made after executing the workflow Based on the collected statistics 29

Enumerated Sub-job Selector 30

Enumerated Sub-job Selector Keep the output of a job if It reduces the execution time when it is used: The size of it is output is less than the size of it is input It reduces the execution time of workflows that use it T total Job n = ET Job n + max i Y {T total (Job i )} It is actually used Frequency of usage within time window Deletion or modification of its input 31

Outline Background & Motivation What is Restore Types of Result Reuse System Architecture Experiments Conclusion Discussion

Experiments ReStore implemented as an extension to Pig 0.8 Experiments run on a cluster of 15 nodes, each with four Dual Core AMD Opteron CPUs, 8GB of memory, and a 65GB SCSI disk PigMix benchmark, 150GB data size and 15 GB data size Synthetic data with 40GB data size (200 million rows) 33

Reusing the Output of Whole Jobs Assuming all outputs needed for reusing are available Best results the can be achieved Speed up is 9.2 with 0% overhead (no extra store operators are inserted) 34

Reusing the Output of sub-jobs Average speedup is 24.4 Overhead for injecting stores, on average 1.6 35

Reusing the Output of sub-jobs Overhead of 15GB is 2.5 Overhead of 150GB is 1.6 Speedup of 15GB is 3.0 Speedup of 150GB is 24.4 Reusing outputs of sub-jobs is more beneficial for larger data sizes 36

Comparing the Heuristics Execution time when using sub-jobs Execution time with insert operator 37

Effect of Data Reduction As the amount of data reduction due to projection or filtering decreases, the overhead increases and the speedup decreases. 38

Outline Background & Motivation What is Restore Types of Result Reuse System Architecture Experiments Conclusion Discussion

Conclusion ReStore: a system that reuses intermediate outputs of MapReduce jobs in a workflow to speed up future workflows Creates additional reuse opportunities by storing the results of sub-jobs Aggressive vs. conservative heuristic Implemented as part of the Pig system Significant speedups on the PigMix benchmark 40

Questions? 41

Discussion Will we need load balancing after injecting store operators? 42

Discussion Will we need load balancing after injecting store operators? Sharing between concurrent workflows and keep the output in memory? 43

Discussion Will we need load balancing after injecting store operators? Sharing between concurrent workflows and keep the output in memory? Can we make better decisions if we know the workload? 44

Discussion Will we need load balancing after injecting store operators? Sharing between concurrent workflows and keep the output in memory? Can we make better decisions if we know the workload? How this can be integrated with pay-as-you-go paradigm? Virtual infinite storage Storage cost Computing cost Can t make decision after storing Predict running time and storage need Using two level storage and Instead of removing a record it will be moved to the second level storage 45

Thanks 46

References Elghandour, Iman, and Ashraf Aboulnaga. "ReStore: reusing results of MapReduce jobs." Proceedings of the VLDB Endowment 5.6 (2012): 586-597 Gates, Alan F., et al. "Building a high-level dataflow system on top of Map-Reduce: the Pig experience." Proceedings of the VLDB Endowment 2.2 (2009): 1414-1425 Nguyen, Thi-Van-Anh, et al. "Cost models for view materialization in the cloud."proceedings of the 2012 Joint EDBT/ICDT Workshops. ACM, 2012 Stewart, Robert J., Phil W. Trinder, and Hans-Wolfgang Loidl. "Comparing high level mapreduce query languages." Advanced Parallel Processing Technologies. Springer Berlin Heidelberg, 2011. 58-72. 47