Sinbad. Leveraging Endpoint Flexibility in Data-Intensive Clusters. Mosharaf Chowdhury, Srikanth Kandula, Ion Stoica. UC Berkeley

Similar documents
Leveraging Endpoint Flexibility in Data-Intensive Clusters

Leveraging Endpoint Flexibility in Data-Intensive Clusters

Coflow. Big Data. Data-Parallel Applications. Big Datacenters for Massive Parallelism. Recent Advances and What s Next?

Coflow Efficiently Sharing Cluster Networks

Coflow. Recent Advances and What s Next? Mosharaf Chowdhury. University of Michigan

Varys. Efficient Coflow Scheduling. Mosharaf Chowdhury, Yuan Zhong, Ion Stoica. UC Berkeley

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.

Distributed Systems 16. Distributed File Systems II

BigData and Map Reduce VITMAC03

Hadoop and HDFS Overview. Madhu Ankam

Google File System (GFS) and Hadoop Distributed File System (HDFS)

A Network-aware Scheduler in Data-parallel Clusters for High Performance

Improving Distributed Filesystem Performance by Combining Replica and Network Path Selection

Big Data for Engineers Spring Resource Management

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

The Google File System

MapReduce. U of Toronto, 2014

HiTune. Dataflow-Based Performance Analysis for Big Data Cloud

Distributed Filesystem

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

18-hdfs-gfs.txt Thu Nov 01 09:53: Notes on Parallel File Systems: HDFS & GFS , Fall 2012 Carnegie Mellon University Randal E.

Flat Datacenter Storage. Edmund B. Nightingale, Jeremy Elson, et al. 6.S897

Database Applications (15-415)

Clustering Lecture 8: MapReduce

A BigData Tour HDFS, Ceph and MapReduce

CSE 124: Networked Services Lecture-16

Saath: Speeding up CoFlows by Exploiting the Spatial Dimension. Chengkok-Koh

Correlation based File Prefetching Approach for Hadoop

Dynamic Data Placement Strategy in MapReduce-styled Data Processing Platform Hua-Ci WANG 1,a,*, Cai CHEN 2,b,*, Yi LIANG 3,c

Utilizing Datacenter Networks: Centralized or Distributed Solutions?

The Google File System. Alexandru Costan

CSE 124: Networked Services Fall 2009 Lecture-19

Re- op&mizing Data Parallel Compu&ng

Deployment Planning Guide

Big Data 7. Resource Management

MixApart: Decoupled Analytics for Shared Storage Systems. Madalin Mihailescu, Gokul Soundararajan, Cristiana Amza University of Toronto and NetApp

Distributed File Systems II

6.888 Lecture 8: Networking for Data Analy9cs

Advanced Computer Networks. Datacenter TCP

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University

HDFS Federation. Sanjay Radia Founder and Hortonworks. Page 1

Optimizing Network Performance in Distributed Machine Learning. Luo Mai Chuntao Hong Paolo Costa

MI-PDB, MIE-PDB: Advanced Database Systems

Hadoop File System S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y 11/15/2017

The Google File System

Hadoop Virtualization Extensions on VMware vsphere 5 T E C H N I C A L W H I T E P A P E R

CS60021: Scalable Data Mining. Sourangshu Bhattacharya

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Yuval Carmel Tel-Aviv University "Advanced Topics in Storage Systems" - Spring 2013

Programming Models MapReduce

FLAT DATACENTER STORAGE. Paper-3 Presenter-Pratik Bhatt fx6568

Spark Over RDMA: Accelerate Big Data SC Asia 2018 Ido Shamay Mellanox Technologies

The MapReduce Abstraction

TP1-2: Analyzing Hadoop Logs

Lecture 11 Hadoop & Spark

CA485 Ray Walshe Google File System

Distributed System. Gang Wu. Spring,2018

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Pocket: Elastic Ephemeral Storage for Serverless Analytics

The Google File System

HDFS Architecture. Gregory Kesden, CSE-291 (Storage Systems) Fall 2017

Key aspects of cloud computing. Towards fuller utilization. Two main sources of resource demand. Cluster Scheduling

Cloudera Exam CCA-410 Cloudera Certified Administrator for Apache Hadoop (CCAH) Version: 7.5 [ Total Questions: 97 ]

Cloud Computing CS

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ.

FLAT DATACENTER STORAGE CHANDNI MODI (FN8692)

DevoFlow: Scaling Flow Management for High-Performance Networks

Google File System. Arun Sundaram Operating Systems

EC-Cache: Load-balanced, Low-latency Cluster Caching with Online Erasure Coding

CLOUD-SCALE FILE SYSTEMS

Cutting the Cord: A Robust Wireless Facilities Network for Data Centers

Exploiting Inter-Flow Relationship for Coflow Placement in Data Centers. Xin Sunny Huang, T. S. Eugene Ng Rice University

Chapter 5. The MapReduce Programming Model and Implementation

Distributed Computation Models

EC-Cache: Load-balanced, Low-latency Cluster Caching with Online Erasure Coding

Map Reduce Group Meeting

HADOOP 3.0 is here! Dr. Sandeep Deshmukh Sadepach Labs Pvt. Ltd. - Let us grow together!

Dept. Of Computer Science, Colorado State University

Toward Energy-efficient and Fault-tolerant Consistent Hashing based Data Store. Wei Xie TTU CS Department Seminar, 3/7/2017

Google File System. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google fall DIP Heerak lim, Donghun Koo

Camdoop Exploiting In-network Aggregation for Big Data Applications Paolo Costa

Programming Systems for Big Data

The Analysis Research of Hierarchical Storage System Based on Hadoop Framework Yan LIU 1, a, Tianjian ZHENG 1, Mingjiang LI 1, Jinpeng YUAN 1

Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo! Sunnyvale, California USA {Shv, Hairong, SRadia,

Map-Reduce. Marco Mura 2010 March, 31th

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

A brief history on Hadoop

Data Platforms and Pattern Mining

Introduction to MapReduce

Resilient Distributed Datasets

Networking in the Hadoop Cluster

Hadoop Distributed File System(HDFS)

Enhanced Hadoop with Search and MapReduce Concurrency Optimization

Mixing and matching virtual and physical HPC clusters. Paolo Anedda

HDFS Architecture Guide

To Relay or Not to Relay for Inter-Cloud Transfers? Fan Lai, Mosharaf Chowdhury, Harsha Madhyastha

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Efficient Coflow Scheduling with Varys

Transcription:

Sinbad Leveraging Endpoint Flexibility in Data-Intensive Clusters Mosharaf Chowdhury, Srikanth Kandula, Ion Stoica UC Berkeley

Communication is Crucial for Analytics at Scale Performance Facebook analytics jobs spend 33% of their runtime in communication 1 As in-memory systems proliferate, the network is likely to become the primary bottleneck 1. Managing Data Transfers in Computer Clusters with Orchestra, SIGCOMM 2011

Network Usage is Imbalanced 1 Facebook Bing 1 1 Fraction of Time 0.75 0.5 0.25 More than 50% of the time, links have high imbalance (C v > 1). Fraction of Time 0.75 0.5 0.25 0 0 1 2 3 4 5 6 Coeff. of Var. Imbalance 2 of Load Across (Coeff. of Core-Rack Var. 2 of Link Links Utilization) 0 0 1 2 3 4 Coeff. of Var. Imbalance of Load Across (Coeff. Core-Rack of Var. 2 of Link Links Utilization) 1. Imbalance considering all cross-rack bytes. Calculated in 10s bins. 2. Coefficient of variation, C v = (stdev/mean).

What Are the Sources of Cross-Rack Traffic? Facebook Bing DFS Writes 40% DFS Reads 14% DFS Writes 54% DFS Reads 31% Write Sources 1. Ingestion 2. Pre-processing 3. Job outputs Shuffle 46% Shuffle 15% 1. DFS = Distributed File System

Distributed File Systems (DFS) Core Pervasive in BigData clusters E.g., GFS, HDFS, Cosmos Many frameworks interact w/ the same DFS F L E L E I I I L F E Rack 1 Rack 2 Rack 3 F Files are divided into blocks 64MB to 1GB in size F I L E Each block is replicated To 3 machines for fault tolerance In 2 fault domains for partition tolerance. Uniformly placed for a balanced storage Synchronous operations

Core F L E L E I I F Rack 1 Rack 2 Rack 3 How to handle Pervasive in BigData clusters DFS flows? E.g., GFS, HDFS, Cosmos Many frameworks interact w/ the same DFS Files are divided into blocks A few seconds long 64MB to 1GB in size Fixed Flexible Sources Destinations Paths Rates Each block is replicated Hedera, VLB, To 3 machines for fault tolerance Orchestra, Coflow, In 2 fault domains for partition tolerance. MicroTE, DevoFlow, Uniformly placed for a balanced storage Synchronous operations

Distributed File Systems (DFS) Core F L E L E I I F Rack 1 Rack 2 Rack 3 Pervasive in BigData clusters Many frameworks interact w/ the same DFS Files are divided into blocks 64MB to 1GB in size Flexible Fixed Flexible Sources Destinations Paths Rates Each block is replicated To 3 machines for fault tolerance In 2 fault domains for partition tolerance. Uniformly placed for a balanced storage Replica locations do not matter as long as constraints are met Synchronous operations

Sinbad Steers flexible replication traffic away from hotspots 1. Faster Writes By avoiding contention during replication 2. Faster Transfers Due to more balanced network usage closer to edges

The Distributed Writing Problem is NP-Hard F I L E Core Given Blocks of different size, and Links of different capacities, Place blocks to minimize Rack 1 Rack 2 Rack 3 The average block write time The average file write time F I L E Given Jobs of different length, and Machine 1 Machine 2 Machine 3 Machines of different speed, Schedule jobs to minimize Job Shop Scheduling The average job completion time

Online! The ^! Distributed Writing Problem is NP-Hard Lack of future knowledge about the Locations and durations of network hotspots, Size and arrival times of new replica placement requests

How to Make it Easy? T+1 T Assumptions C B 2 A 1 1. Link utilizations are stable Time 2. All blocks have the same size Theorem: Greedy placement minimizes average block/file write times

How to Make it Easy? In Practice Reality 1. Average link utilizations are temporarily stable 1,2 2. Fixed-size large blocks write 93% of all bytes Assumptions 1. Link utilizations are stable 2. All blocks have the same size 1. Utilization is considered stable if its average over next x seconds remains within ±5% of the initial value 2. Typically, x ranges from 5 to 10 seconds. Time to write a 256MB block assuming 50MBps write throughput is 5 seconds

Sinbad Performs two-step greedy replica placement 1. Pick the least-loaded link 2. Send a block from the file with the least-remaining blocks through the selected link

Sinbad Overview Machine Sinbad Master Centralized master-slave architecture DFS Master Agents collocated with DFS agents Slaves periodically report information DFS Slave DFS Slave DFS Slave Sinbad Slave Sinbad Slave Sinbad Slave

Sinbad Master Constraints & Hints Performs network-aware replica placement for large blocks Periodically estimates network hotspots Takes greedy online decisions Adds hysteresis until next measurement Where to put block B? At least r replicas In f fault domains Collocate with block B Sinbad Master Information (from slaves) Static Information Network topology Link, disk capacities Dynamic distributions of loads in links popularity of files { Locations }

Evaluation A 3000-node trace-driven simulation matched against a 100-node EC2 deployment 1. Does it improve performance? YES 2. Does it balance the network? 3. Does the storage remain balanced?

Faster Job Improv. DFS Improv. Sim 1.39X 1.58X 1.79X Exp 1.26X 1.30X 1.60X ^! In-memory! storage!

More Balanced EC2 Deployment Facebook Trace Simulation 1 1 Fraction of Time 0.75 0.5 0.25 0 Default Network-Aware 0 1 2 3 4 Coeff. Imbalance of Var. of Load (Coeff. Across of Var. Rack-to-Host 1 of Link Utilization) Links Fraction of Time 0.75 0.5 0.25 0 Default Network-Aware 0 1 2 3 4 Coeff. Imbalance of Var. of Load (Coeff. Across of Var. Core-to-Rack 1 of Link Utilization) Links

What About Storage Balance? Network is imbalanced in the short term; but, in the long term, hotspots are uniformly distributed

#1 Increase Capacity Fatter links/interfaces Increase Bisection B/W #2 Decrease Load Data locality Static optimization #3 Three Approaches Balance Toward Usage Contention Mitigation Manage elephant flows Optimize intermediate comm. Fat tree, VL2, DCell, BCube, F10, Fair scheduling, Delay scheduling, Mantri, Quincy, PeriSCOPE, RoPE, Rhea, Valiant load balancing (VLB), Hedera, Orchestra, Coflow, MicroTE, DevoFlow,

Sinbad Greedily steers replication traffic away from hotspots Improves job performance by making the network more balanced Improves DFS write performance while keeping the storage balanced Sinbad will become increasingly more important as storage becomes faster We are planning to deploy Sinbad at Mosharaf Chowdhury - @mosharaf

Trace Details Facebook Microsoft Bing Period Oct 2010 Mar Apr 2012 Duration 1 Week 1 Month Framework Hadoop MapReduce SCOPE Jobs 175,000 O(10,000) Tasks 30 Millions O(10 Millions) File System HDFS Cosmos Block Size 256MB 256MB Number of Machines 3,000 O(1,000) Number of Racks 150 O(100) Core : Rack Oversubscription 10 : 1 Better / Less Oversubscribed

Writer Characteristics 42% of the reducers and 91% of other writers spend at least 50% of run time 37% of all tasks write to the DFS Writers spend large fractions of runtime in writing CDF (Weighted by Bytes Written) 1 0.75 0.5 0.25 0 Preproc./Ingest Reducers Combined 0 0.25 0.5 0.75 1 Fraction of Task Duration in Write

Big Blocks Write Most Bytes 30% blocks are full sized, i.e., they are capped at 256MB 35% blocks are of at least 128 MB or more in size Fraction of DFS Bytes One third of the blocks generate almost all replication bytes 1 0.75 0.5 256 MB blocks write 81% of the bytes Blocks 0.25 of size at least 128 MB or more write 93% of the bytes 0 1 100 10000 1000000 Block Size (KB)

Big Blocks Write Most Bytes 30% 81% Blocks are large (256 MB) Bytes are written by large blocks

Hotspots are Stable 1 in the Short Term Probability 1 0.8 0.6 0.4 0.94 0.89 0.8 0.66 0.2 0 5s 10s 20s 40s Duration of Utilization Stability Long enough 2 to write a block even if disk is the bottleneck 1. Utilization is considered stable if its average over next x seconds remains within ±5% of the initial value 2. Time to write a 256MB block assuming 50MBps write throughput is 5 seconds

If the block size is fixed, and hotspots are temporarily stable, the solution is Greedy Placement Thm. Greedy assignment of blocks to the least-loaded link in the least-remaining-blocks-first order is optimal for minimizing the average write times

Utilization Estimator Utilizations updated using EWMA at Δ intervals v new = α * v measured + (1-α) * v old We use α = 0.2 Update interval (Δ) Too small a Δ creates overhead, but too large gives stale data We use Δ = 1 second right now Missing updates are treated conservatively, as if the link is fully loaded

Utilization Estimator Hysteresis after each placement decision Temporarily bump up estimates to avoid putting too many blocks in the same location Once the next measurement update arrives, hysteresis is removed and the actual estimation is used Hysteresis function Proportional to the size of block just placed Inversely proportional to the time remaining till next update

Implementation Implemented and integrated with HDFS Pluggable replica placement policy on https://github.com/facebook/hadoop-20 Slaves are integrated into DataNode and the master into NameNode Update comes over the Heartbeat messages Few hundred lines of Java

Methodology HDFS deployment in EC2 Focus on large (in terms of network bytes) jobs only 100 m1.xlarge nodes with 4x400GB disks 55MBps/disk maximum write throughput 700+Mbps/node during all-to-all communication Trace-driven simulation Detailed replay of a day-long Facebook trace (circa October 2010) 3000-node,150-rack cluster with 10:1 oversubscription

Breakdown [Time Spent] Jobs spend varying amounts of time in writing Jobs in Bin-1 the least and jobs in Bin-5 the most No clear correlation We improve block writes w/o considering job characteristics Factor of Improvement 1.75 1.5 1.25 1 1.11 1.45 1.08 1.27 1.08 1.19 End-to-End WriteTime 1.18 1.24 1.26 1.28 1.19 1.29 Bin 1 Bin 2 Bin 3 Bin 4 Bin 5 ALL

Breakdown [Bytes Written] Jobs write varying amounts to HDFS as well Ingestion and pre-processing jobs write the most No clear correlation We do not use file characteristics while selecting destinations Factor of Improvement 1.75 1.5 1.25 1 1.07 1.20 1.11 1.31 1.08 End-to-End WriteTime 1.15 1.13 1.44 1.22 1.23 [0, 10) [10, 100) [100, 1E3) [1E3, 1E4) [1E4, 1E5) Job Write Size (MB)

Balanced Storage [Simulation] Byte Distribution Block Distribution 1 1 0.9 NetworkAware Default 0.9 NetworkAware Default % of Bytes 0.8 0.7 Reacting to the imbalance isn t always perfect! % of Blocks 0.8 0.7 0.6 0.6 0.5 1 26 51 76 101 126 151 Rank of Racks 0.5 1 26 51 76 101 126 151 Rank of Racks

Balanced Storage [EC2] An hour-long 100-node EC2 experiment Wrote ~10TB of data Calculated standard deviation of disk usage across all machines 10000 1600 GB 1000 100 10 6.7 15.8 Imbalance is less than 1% of the storage capacity of each machine 1 Default Network-Aware Capacity