On the Performance of MapReduce: A Stochastic Approach

Similar documents
Around the Web in Six Weeks: Documenting a Large-Scale Crawl

Topology-Based Spam Avoidance in Large-Scale Web Crawls

Chapter 18 Indexing Structures for Files

On Asymptotic Cost of Triangle Listing in Random Graphs

Background: disk access vs. main memory access (1/2)

Processing of big data with Apache Spark

Database Applications (15-415)

Clustering Billions of Images with Large Scale Nearest Neighbor Search

EXTERNAL SORTING. Sorting

Chapter 12: Query Processing

CSE 530A. B+ Trees. Washington University Fall 2013

CSE 373 OCTOBER 25 TH B-TREES

Stochastic Models of Pull-Based Data Replication in P2P Systems

Lower Bound on Comparison-based Sorting

CSE 326: Data Structures B-Trees and B+ Trees

NOTE: sorting using B-trees to be assigned for reading after we cover B-trees.

C SCI 335 Software Analysis & Design III Lecture Notes Prof. Stewart Weiss Chapter 4: B Trees

Building an Inverted Index

Foster B-Trees. Lucas Lersch. M. Sc. Caetano Sauer Advisor

Database Applications (15-415)

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe. Slide 14-1

Main Memory and the CPU Cache

External Sorting. Chapter 13. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1

Indexes as Access Paths

IMR-Pathload: Robust Available Bandwidth Estimation under End-Host Interrupt Delay

6.830 Lecture 8 10/2/2017. Lab 2 -- Due Weds. Project meeting sign ups. Recap column stores & paper. Join Processing:

CS542. Algorithms on Secondary Storage Sorting Chapter 13. Professor E. Rundensteiner. Worcester Polytechnic Institute

Resource and Performance Distribution Prediction for Large Scale Analytics Queries

Algorithms for MapReduce. Combiners Partition and Sort Pairs vs Stripes

Chapter 18 Indexing Structures for Files. Indexes as Access Paths

External Sorting Implementing Relational Operators

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

VIAF: Verification-based Integrity Assurance Framework for MapReduce. YongzhiWang, JinpengWei

Big Data Analytics CSCI 4030

The Fusion Distributed File System

Chapter 18. Indexing Structures for Files. Chapter Outline. Indexes as Access Paths. Primary Indexes Clustering Indexes Secondary Indexes

Database files Organizations Indexing B-tree and B+ tree. Copyright 2011 Ramez Elmasri and Shamkant Navathe

B-Trees and External Memory

External Sorting. Chapter 13. Comp 521 Files and Databases Fall

Improving Hadoop MapReduce Performance on Supercomputers with JVM Reuse

External Sorting. Chapter 13. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1

B-Trees and External Memory

Announcements. Reading Material. Recap. Today 9/17/17. Storage (contd. from Lecture 6)

External Sorting Sorting Tables Larger Than Main Memory

MASSACHUSETTS INSTITUTE OF TECHNOLOGY Database Systems: Fall 2008 Quiz II

Recall from Tuesday. Our solution to fragmentation is to split up a process s address space into smaller chunks. Physical Memory OS.

INDEX-BASED JOIN IN MAPREDUCE USING HADOOP MAPFILES

HiTune. Dataflow-Based Performance Analysis for Big Data Cloud

L02 : 08/21/2015 L03 : 08/24/2015.

Lab Determining Data Storage Capacity

CSIT5300: Advanced Database Systems

Tree-Structured Indexes (Brass Tacks)

Chapter 12: Query Processing

Analytical Modeling of Parallel Programs

Extreme Computing. Introduction to MapReduce. Cluster Outline Map Reduce

Writing Parallel Programs; Cost Model.

Motivation for Sorting. External Sorting: Overview. Outline. CSE 190D Database System Implementation. Topic 3: Sorting. Chapter 13 of Cow Book

Advanced Database Systems

Track Join. Distributed Joins with Minimal Network Traffic. Orestis Polychroniou! Rajkumar Sen! Kenneth A. Ross

The Cost of Address Translation

R16 SET - 1 '' ''' '' ''' Code No: R

Why Study Multimedia? Operating Systems. Multimedia Resource Requirements. Continuous Media. Influences on Quality. An End-To-End Problem

CS433 Homework 6. Problem 1 [15 points] Assigned on 11/28/2017 Due in class on 12/12/2017

CS 310: Memory Hierarchy and B-Trees

Camdoop Exploiting In-network Aggregation for Big Data Applications Paolo Costa

Link Lifetimes and Randomized Neighbor Selection in DHTs

Sorting & Aggregations

CSE 373 OCTOBER 23 RD MEMORY AND HARDWARE

MapReduce Design Patterns

On Sample-Path Staleness in Lazy Data Replication

Demystifying Service Discovery: Implementing an Internet-Wide Scanner

Physical Level of Databases: B+-Trees

Operating System Support for Multimedia. Slides courtesy of Tay Vaughan Making Multimedia Work

CS 245 Midterm Exam Solution Winter 2015

Hashing IV and Course Overview

Parallel Nested Loops

Parallel Partition-Based. Parallel Nested Loops. Median. More Join Thoughts. Parallel Office Tools 9/15/2011

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Indexing. Week 14, Spring Edited by M. Naci Akkøk, , Contains slides from 8-9. April 2002 by Hector Garcia-Molina, Vera Goebel

THE B+ TREE INDEX. CS 564- Spring ACKs: Jignesh Patel, AnHai Doan

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)

Lecture 13. Lecture 13: B+ Tree


Efficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488)

How t-digest works and why

Announcement. Reading Material. Overview of Query Evaluation. Overview of Query Evaluation. Overview of Query Evaluation 9/26/17

A Comparison of Three Methods for Join View Maintenance in Parallel RDBMS

Firebird in 2011/2012: Development Review

Administriva. CS 133: Databases. General Themes. Goals for Today. Fall 2018 Lec 11 10/11 Query Evaluation Prof. Beth Trushkowsky

Tree-Structured Indexes

CSE 444: Database Internals. Lectures 5-6 Indexing

Compressing Integers for Fast File Access

Chapter 7 Multimedia Operating Systems

Project. CIS611 Spring 2014 SS Chung Due by April 15. Performance Evaluation Experiment on Query Rewrite Optimization

PARALLEL ID3. Jeremy Dominijanni CSE633, Dr. Russ Miller

CS433 Homework 6. Problem 1 [15 points] Assigned on 11/28/2017 Due in class on 12/12/2017

JULIA ENABLED COMPUTATION OF MOLECULAR LIBRARY COMPLEXITY IN DNA SEQUENCING

EXTERNAL SORTING. CS 564- Spring ACKs: Dan Suciu, Jignesh Patel, AnHai Doan

Treelogy: A Benchmark Suite for Tree Traversals

Streaming Massive Environments From Zero to 200MPH

Transcription:

On the Performance of MapReduce: A Stochastic Approach Sarker Tanzir Ahmed and Dmitri Loguinov Internet Research Lab Department of Computer Science and Engineering Texas A&M University October 28, 2014 1

Agenda Introduction Background Disk I/O Merge Overhead Runtime 2

Introduction MapReduce is a popular programming model for cluster computing, big data processing Resource constraints and data properties dictate MapReduce performance Specifically, RAM size and distribution of key frequency impact both the run-time and volume of disk I/O Existing literature is missing an accurate performance model for external-memory sorting/merging Common to assume a linear relationship between input size and processing overhead This includes disk spill, number of comparison in sort/merge 3

Introduction (2) Open questions: Is this dependency indeed linear with some constant factor converting input size to various metrics of interest? Does the constant stay the same in the full design space? Is the constant easy to obtain/estimate? Our objective: analyze a shared-memory MapReduce system with a single host for all computation Multiple CPU cores provide parallelism Data distribution is only through disks, not network Due to limited space, we analyze only the external merge-sort as the underlying algorithm 4

Agenda Introduction Background Disk I/O Merge Overhead Runtime 5

Background MapReduce platform (key, value) pairs sorted runs on disk Read input Sort Merge User functions Map Reduce Reduce output 6

Agenda Introduction Background Disk I/O Merge Overhead Conclusion 7

MapReduce Disk I/O Input is a stream of length T Entries are key-value pairs, each K+D bytes At time step t, one pair is processed by MapReduce Keys belong to a finite set V of size n Each key v is repeated I(v) times Disk I/O consists of: Input with T pairs (some duplicate) Output with n unique pairs Sorted runs of size L Total disk overhead is W = (K +D)(T + n + 2L) Our first goal is to derive L 8

MapReduce Disk I/O (2) Suppose RAM can hold m key-value pairs Let S t denote the seen set at time t and ² t = t/t be the fraction of input processed by t Then, k = dt/me is the number of sorted runs, where each contains E[jS m j] pairs on average Theorem 1: The expected size of the seen set at t is: Theorem 2: Disk spill L of a merge-sort MapReduce is: Total I/O overhead is thus: 9

MapReduce Disk I/O (3) Quite a complex function of m and I Verification on real graphs IRLbot host graph (640M nodes, 6.8B edges, 55 GB) WebBase web graph (667M nodes, 4.2B edges, 33 GB) 10

Agenda Introduction Background Disk I/O Merge Overhead Runtime 11

MapReduce Merge Overhead Selection tree for merging sorted runs, where each internal node Executes binary comparison Applies reduce operation De-duplication makes upper nodes perform less work than lower merged data depth d = log 2 k Theorem 3: The number of comparisons in a binary selection tree with k leaf nodes is: sorted runs 12

Merge Rate Evaluation Comparison with naïve model (no de-duplication): Comparison overhead and merge rate m dictated by k Higher k implies more de-duplication 13

Agenda Introduction Background Disk I/O Merge Overhead Runtime 14

MapReduce Runtime Total runtime is: disk speed slight discrepancy due to disk seeking during merge sort rate merge rate 15

Thank you! Questions? 16