SpongeFiles: Mitigating Data Skew in MapReduce Using Distributed Memory. Khaled Elmeleegy Turn Inc.

Similar documents
Distributed Systems 16. Distributed File Systems II

Distributed File Systems II

The Google File System

CA485 Ray Walshe Google File System

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University

Crossing the Chasm: Sneaking a parallel file system into Hadoop

Gearing Hadoop towards HPC sytems

CS 345A Data Mining. MapReduce

Crossing the Chasm: Sneaking a parallel file system into Hadoop

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Yuval Carmel Tel-Aviv University "Advanced Topics in Storage Systems" - Spring 2013

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2016

Distributed Filesystem

Google File System. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google fall DIP Heerak lim, Donghun Koo

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.

MapReduce. Stony Brook University CSE545, Fall 2016

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2017

China Big Data and HPC Initiatives Overview. Xuanhua Shi

MixApart: Decoupled Analytics for Shared Storage Systems. Madalin Mihailescu, Gokul Soundararajan, Cristiana Amza University of Toronto and NetApp

Advanced Database Systems

Fusion iomemory PCIe Solutions from SanDisk and Sqrll make Accumulo Hypersonic

Dept. Of Computer Science, Colorado State University

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team

Tuning Enterprise Information Catalog Performance

Capriccio : Scalable Threads for Internet Services

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

Hadoop MapReduce Framework

CS /30/17. Paul Krzyzanowski 1. Google Chubby ( Apache Zookeeper) Distributed Systems. Chubby. Chubby Deployment.

Tuning Intelligent Data Lake Performance

Hardware and software requirements IBM

FuxiSort. Jiamang Wang, Yongjun Wu, Hua Cai, Zhipeng Tang, Zhiqiang Lv, Bin Lu, Yangyu Tao, Chao Li, Jingren Zhou, Hong Tang Alibaba Group Inc

File system internals Tanenbaum, Chapter 4. COMP3231 Operating Systems

CS 61C: Great Ideas in Computer Architecture. MapReduce

Dremel: Interactice Analysis of Web-Scale Datasets

Software and Tools for HPE s The Machine Project

Map-Reduce. Marco Mura 2010 March, 31th

Lecture 20: WSC, Datacenters. Topics: warehouse-scale computing and datacenters (Sections )

18-hdfs-gfs.txt Thu Nov 01 09:53: Notes on Parallel File Systems: HDFS & GFS , Fall 2012 Carnegie Mellon University Randal E.

Introduction to Data Management CSE 344

Data center requirements

CAVA: Exploring Memory Locality for Big Data Analytics in Virtualized Clusters

FLAT DATACENTER STORAGE. Paper-3 Presenter-Pratik Bhatt fx6568

Google File System (GFS) and Hadoop Distributed File System (HDFS)

BigTable. Chubby. BigTable. Chubby. Why Chubby? How to do consensus as a service

Using Alluxio to Improve the Performance and Consistency of HDFS Clusters

The Google File System

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

The Google File System. Alexandru Costan

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective


How Apache Hadoop Complements Existing BI Systems. Dr. Amr Awadallah Founder, CTO Cloudera,

V. File System. SGG9: chapter 11. Files, directories, sharing FS layers, partitions, allocations, free space. TDIU11: Operating Systems

Apple. Massive Scale Deployment / Connectivity. This is not a contribution

Accelerating Parallel Analysis of Scientific Simulation Data via Zazen

Google File System. Arun Sundaram Operating Systems

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

microsoft

Big Data Management and NoSQL Databases

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

The Hadoop Distributed File System Konstantin Shvachko Hairong Kuang Sanjay Radia Robert Chansler

Chapter 5. The MapReduce Programming Model and Implementation

SFS: Random Write Considered Harmful in Solid State Drives

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES

CLOUD-SCALE FILE SYSTEMS

Presented by: Nafiseh Mahmoudi Spring 2017

EMC XTREMCACHE ACCELERATES VIRTUALIZED ORACLE

Programming Systems for Big Data

A Study of Skew in MapReduce Applications

Pig Latin Reference Manual 1

Evaluating Cloud Storage Strategies. James Bottomley; CTO, Server Virtualization

Optimizing Testing Performance With Data Validation Option

Bigtable. Presenter: Yijun Hou, Yixiao Peng

Camdoop Exploiting In-network Aggregation for Big Data Applications Paolo Costa

CS399 New Beginnings. Jonathan Walpole

<Insert Picture Here> Looking at Performance - What s new in MySQL Workbench 6.2

<Insert Picture Here> MySQL Web Reference Architectures Building Massively Scalable Web Infrastructure

The Google File System

Memory Allocation. Static Allocation. Dynamic Allocation. Dynamic Storage Allocation. CS 414: Operating Systems Spring 2008

Piranha: Optimizing Short Jobs in Hadoop

Foster B-Trees. Lucas Lersch. M. Sc. Caetano Sauer Advisor

Mitigating Data Skew Using Map Reduce Application

Operating Systems. File Systems. Thomas Ropars.

GFS: The Google File System

data parallelism Chris Olston Yahoo! Research

In multiprogramming systems, processes share a common store. Processes need space for:

Locality. CS429: Computer Organization and Architecture. Locality Example 2. Locality Example

2/26/2017. For instance, consider running Word Count across 20 splits

Performance and Optimization Issues in Multicore Computing

Database Applications (15-415)

CSCI-GA Database Systems Lecture 8: Physical Schema: Storage

EI 338: Computer Systems Engineering (Operating Systems & Computer Architecture)

MapReduce, Hadoop and Spark. Bompotas Agorakis

Data Modeling and Databases Ch 10: Query Processing - Algorithms. Gustavo Alonso Systems Group Department of Computer Science ETH Zürich

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

Data Modeling and Databases Ch 9: Query Processing - Algorithms. Gustavo Alonso Systems Group Department of Computer Science ETH Zürich

Azor: Using Two-level Block Selection to Improve SSD-based I/O caches

ΕΠΛ 602:Foundations of Internet Technologies. Cloud Computing

Fast, In-Memory Analytics on PPDM. Calgary 2016

Transcription:

SpongeFiles: Mitigating Data Skew in MapReduce Using Distributed Memory Khaled Elmeleegy Turn Inc. kelmeleegy@turn.com Christopher Olston Google Inc. olston@google.com Benjamin Reed Facebook Inc. br33d@fb.com 1

Background MapReduce are the primary platform for processing web & social networking data sets These data sets tend to be heavily skewed Native : hot news, hot people(e.g. holistic aggregation) Machine learning: unkonw topic unknow city Skew leads to overwhelm that node's memory capacity 2

Data Skew: harmness & solution Harmness The major slow down of the MR job Solution Provide sufficient memory Use data skew avoidance techniques 3

Solution1: Provide Sufficient Memory Method: Providing every task with enough memory Shortcoming: Tasks are only run-time known Required memory may not exist at the node executing the task Conclusion: Although can mitigate spilling, but very wasteful 4

Solution2: Data skew avoidance techniques Method: Skew-resistant partitioning schemas Skew detection and work migration techniques Shortcoming: UDFs may be vulnerable to data skew Conclusion: Alleviating some, but not all 5

Sometimes, We have to resort to Spill 6

Hadoop s Map Phase Spill 0 1 2 kvindeces[] 3.75% By default 100MB p k v kvoffset[] 1.25% p k v p k v k v k v k v kvbuffer[] 95% 7

Hadoop s Reduce Phase Spill Memory buffer InMemFSMergerThread MapOutputCopier MapOutputCopier MapOutputCopier MapOutputCopier In Memory On Disk merge Spill to Disk LocalFSMerger copiers merge Local Disk 8

How we expect that we can share the memory 9

Here comes the SpongeFile Share memory in the same node Share memory between peers 10

SpongeFile Utilize remote idle memory Operates at app level (unlike remote memory) Single spilled obj stored in a single sponge file Composed of large chunks Be used to complement a process's memory pool Much simpler than regular files(for fast read & write ) 11

Design No concurrent single writer and a single Does not persist after it is read lifetime is well defined Do not need a naming service Each chunk can lie in the : machine's local memory, a remote machine's memory, a local file system, or a distributed file system 12

Local Memory Chunk Allocator Effect: Steps: Share memory between tasks in the same node 1. Acquires the shared pool's lock 2. Tries to find a free chunk 3. Release the lock & return the chunk handle (meta data) 4. Return a error if no free chunk 13

Remote Memory Chunk Allocator Effect: Share memory between tasks among peers Steps: 1. Get the list of sponge servers with free memory 2. Find a server with free space (on the same rack) 3. Writes data & gets back a handle 14

Disk Chunk Allocator Effect: Steps: It is the last resort, similar to spill on disk 1. Tries on the underlying local file system 2. If local disks have no free space, then tries the distributed file systems 15

Garbage Collection Tasks are alive: delete their SpongeFiles before they exit Tasks failed: sponge servers perform periodic garbage collections 16

Potential weakness analytics May Increase the probability of task failure But: = 1% per month N :number of machine t :running time MTTF : mean time to failure 17

Evaluation Microbenchmarks 2.5Ghz quad core Xeon CPUs 16GB of memory 7200RPM 300G ATA drives 1Gb Ethernet Red Hat Enterprise Linux Server release 5.3 Ext4 fs Macrobenchmarks Hadoop 0.20.2 of 30 node(2 map task slots & 1 reduce) Pig 0.7 With above 18

Microbenchmarks Spill a 1 MB buffer 10,000 times to disk and mem In Memory On Disk category Time(ms) category Time(ms) Local shared memory 1 Disk 25 Local memory ( through sponge server) Remote memory (over the network 7 9 Disk with background IO Disk with background IO and memory pressure 174 499 19

Microbenchmarks s conclusion 1. spilling to shared memory is the least expensive 2. Then comes spilling locally via the local sponge(more processing and multiple messageexchanges) 3. Disk spilling is two orders of magnitude slower than memory 20

Macrobenchmarks The jobs data sets: Two versions of Hadoop: The original With SpongeFile Two configuration of memory size: 4GB 16GB 21

Test1 When memory size is small, spilling to SpongeFiles performs better than spilling to disk When memory is abundant, performance depends on the amount of data spilled and the time dierence between when the data is spilled and when it is read back 22

Test2 Using SpongeFiles reduces the job1 s runtime by over 85% in case of disk contention and memory pressure(similar behavior is seen for the spam quantiles) For the frequent anchor text job, when memory is abundant and even with disk contention, spilling to disk performs slightly better than spilling to SpongeFiles. 23

Test3 No spilling performs best Spilling to local sponge memory performs second But spilling to SpongeFiles is the only one practical 24

Related work Cooperative caching (for share) Network memory (for small objects) Remote paging systems (not the same level) Conclusion Complementary to skew avoidance Reduce job runtimes by up to 55% in absence of disk contention and by up to 85% 25