A Platform for Fine-Grained Resource Sharing in the Data Center

Similar documents
Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center

Mesos: Mul)programing for Datacenters

The Datacenter Needs an Operating System

Key aspects of cloud computing. Towards fuller utilization. Two main sources of resource demand. Cluster Scheduling

Page 1. Goals for Today" Background of Cloud Computing" Sources Driving Big Data" CS162 Operating Systems and Systems Programming Lecture 24

Mesos: A Pla+orm for Fine- Grained Resource Sharing in the Data Center

Key aspects of cloud computing. Towards fuller utilization. Two main sources of resource demand. Cluster Scheduling

Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center

Resilient Distributed Datasets

Ali Ghodsi, Matei Zaharia, Benjamin Hindman, Andy Konwinski, Scott Shenker, Ion Stoica. University of California, Berkeley nsdi 11

A Whirlwind Tour of Apache Mesos

Nexus: A Common Substrate for Cluster Computing

Building a Data-Friendly Platform for a Data- Driven Future

what is cloud computing?

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker

Fast, Interactive, Language-Integrated Cluster Computing

Spark. In- Memory Cluster Computing for Iterative and Interactive Applications

Discretized Streams. An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters

A Common Substrate for Cluster Computing

a Spark in the cloud iterative and interactive cluster computing

Lecture 11 Hadoop & Spark

Practical Considerations for Multi- Level Schedulers. Benjamin

Analytic Cloud with. Shelly Garion. IBM Research -- Haifa IBM Corporation

Today s content. Resilient Distributed Datasets(RDDs) Spark and its data model

Spark. In- Memory Cluster Computing for Iterative and Interactive Applications

Warehouse- Scale Computing and the BDAS Stack

SCALING LIKE TWITTER WITH APACHE MESOS

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

@unterstein #bedcon. Operating microservices with Apache Mesos and DC/OS

Sparrow. Distributed Low-Latency Spark Scheduling. Kay Ousterhout, Patrick Wendell, Matei Zaharia, Ion Stoica

Building/Running Distributed Systems with Apache Mesos

MapReduce. U of Toronto, 2014

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark

RESILIENT DISTRIBUTED DATASETS: A FAULT-TOLERANT ABSTRACTION FOR IN-MEMORY CLUSTER COMPUTING

DRIZZLE: FAST AND Adaptable STREAM PROCESSING AT SCALE

2/4/2019 Week 3- A Sangmi Lee Pallickara

SAMPLE CHAPTER IN ACTION. Roger Ignazio. FOREWORD BY Florian Leibert MANNING

Spark Overview. Professor Sasu Tarkoma.

Spark. Cluster Computing with Working Sets. Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.

Outline. CS-562 Introduction to data analysis using Apache Spark

BigData and Map Reduce VITMAC03

Chapter 5. The MapReduce Programming Model and Implementation

Spark: A Brief History.

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

POWERING THE INTERNET WITH APACHE MESOS

An Enhanced Approach for Resource Management Optimization in Hadoop

Webinar Series TMIP VISION

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Batch Processing Basic architecture

Distributed Computation Models

Summary of Big Data Frameworks Course 2015 Professor Sasu Tarkoma

MapReduce, Hadoop and Spark. Bompotas Agorakis

Scale your Docker containers with Mesos

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

CompSci 516: Database Systems

Coflow. Recent Advances and What s Next? Mosharaf Chowdhury. University of Michigan

Spark, Shark and Spark Streaming Introduction

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Announcements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems

Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics

Research challenges in data-intensive computing The Stratosphere Project Apache Flink

HDFS Federation. Sanjay Radia Founder and Hortonworks. Page 1

Massive Online Analysis - Storm,Spark

DATA SCIENCE USING SPARK: AN INTRODUCTION

PROFILING BASED REDUCE MEMORY PROVISIONING FOR IMPROVING THE PERFORMANCE IN HADOOP

A Glimpse of the Hadoop Echosystem

Reactive App using Actor model & Apache Spark. Rahul Kumar Software

An Introduction to Apache Spark

SSS: An Implementation of Key-value Store based MapReduce Framework. Hirotaka Ogawa (AIST, Japan) Hidemoto Nakada Ryousei Takano Tomohiro Kudoh

Cloud Computing CS

Cloud Computing CS

Spark Streaming: Hands-on Session A.A. 2017/18

COSC 6339 Big Data Analytics. Introduction to Spark. Edgar Gabriel Fall What is SPARK?

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

Spark Over RDMA: Accelerate Big Data SC Asia 2018 Ido Shamay Mellanox Technologies

Intra-cluster Replication for Apache Kafka. Jun Rao

The Datacenter Needs an Operating System

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

April Final Quiz COSC MapReduce Programming a) Explain briefly the main ideas and components of the MapReduce programming model.

Pocket: Elastic Ephemeral Storage for Serverless Analytics

Quobyte The Data Center File System QUOBYTE INC.

MapReduce & Resilient Distributed Datasets. Yiqing Hua, Mengqi(Mandy) Xia

Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo! Sunnyvale, California USA {Shv, Hairong, SRadia,

L3: Spark & RDD. CDS Department of Computational and Data Sciences. Department of Computational and Data Sciences

Awan: Locality-aware Resource Manager for Geo-distributed Data-intensive Applications

PaaS and Hadoop. Dr. Laiping Zhao ( 赵来平 ) School of Computer Software, Tianjin University

Before proceeding with this tutorial, you must have a good understanding of Core Java and any of the Linux flavors.

FairRide: Near-Optimal Fair Cache Sharing

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Exploiting Heterogeneity in the Public Cloud for Cost-Effective Data Analytics

A Parallel R Framework

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

Resource Management for Dynamic MapReduce Clusters in Multicluster Systems

Virtual Machines Disco and Xen (Lecture 10, cs262a) Ion Stoica & Ali Ghodsi UC Berkeley February 26, 2018

Stream Processing on IoT Devices using Calvin Framework

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Harp-DAAL for High Performance Big Data Computing

MOHA: Many-Task Computing Framework on Hadoop

Distributed Systems 16. Distributed File Systems II

Transcription:

Mesos A Platform for Fine-Grained Resource Sharing in the Data Center Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony Joseph, Randy Katz, Scott Shenker, Ion Stoica University of California, Berkeley

Background Rapid innovation in cluster computing frameworks CIEL Pregel Pig Dryad Percolator

Problem Rapid innovation in cluster computing frameworks No single framework optimal for all applications Want to run multiple frameworks in a single cluster» to maximize utilization» to share data between frameworks

Where We Want to Go Today: static partitioning Mesos: dynamic sharing Hadoop Pregel MPI Shared cluster

Solution Mesos is a common resource sharing layer over which diverse frameworks can run Hadoop Pregel Hadoop Pregel Mesos Node Node Node Node Node Node Node Node

Other Benefits of Mesos Run multiple instances of the same framework»isolate production and experimental jobs»run multiple versions of the framework concurrently Build specialized frameworks targeting particular problem domains»better performance than general-purpose abstractions

Outline Mesos Goals and Architecture Implementation Results Related Work

Mesos Goals High utilization of resources Support diverse frameworks (current & future) Scalability to 10,000 s of nodes Reliability in face of failures Resulting design: Small microkernel-like core that pushes scheduling logic to frameworks

Design Elements Fine-grained sharing:»allocation at the level of tasks within a job»improves utilization, latency, and data locality Resource offers:»simple, scalable application-controlled scheduling mechanism

Element 1: Fine-Grained Sharing Coarse-Grained Sharing (HPC): Framework 1 Fine-Grained Sharing (Mesos): Fw. 3 Fw. 23 Fw. 1 Fw. 1 Fw. 2 Fw. 2 Framework 2 Fw. 2 Fw. 1 Fw. 3 Fw. 3 Fw. 31 Fw. 2 Framework 3 Fw. 2 Fw. 1 Fw. 3 Fw. 2 Fw. 21 Fw. 3 Storage System (e.g. HDFS) Storage System (e.g. HDFS) + Improved utilization, responsiveness, data locality

Element 2: Resource Offers Option: Global scheduler»frameworks express needs in a specification language, global scheduler matches them to resources + Can make optimal decisions Complex: language must support all framework needs Difficult to scale and to make robust Future frameworks may have unanticipated needs

Element 2: Resource Offers Mesos: Resource offers»offer available resources to frameworks, let them pick which resources to use and which tasks to launch + Keeps Mesos simple, lets it support future frameworks - Decentralized decisions might not be optimal

Mesos Architecture MPI job MPI scheduler Hadoop job Hadoop scheduler Mesos master Allocation module Resource offer Pick framework to offer resources to Mesos slave MPI executor task Mesos slave MPI executor task

Mesos Architecture MPI job MPI scheduler Hadoop job Hadoop scheduler Mesos slave Resource offer = Mesos list of (node, Allocation availableresources) Pick framework to master module Resource offer resources to offer E.g. { (node1, <2 CPUs, 4 GB>), (node2, <3 CPUs, 2 GB>) } MPI executor task Mesos slave MPI executor task

Mesos Architecture MPI job MPI scheduler Hadoop job Hadoop scheduler task Framework-specific scheduling Mesos master Allocation module Resource offer Pick framework to offer resources to Mesos slave MPI executor task Mesos slave MPI executor task Hadoop executor Launches and isolates executors

Optimization: Filters Let frameworks short-circuit rejection by providing a predicate on resources to be offered»e.g. nodes from list L or nodes with > 8 GB RAM»Could generalize to other hints as well Ability to reject still ensures correctness when needs cannot be expressed using filters

Implementation

Implementation Stats 20,000 lines of C++ Master failover using ZooKeeper Frameworks ported: Hadoop, MPI, Torque New specialized framework: Spark, for iterative jobs (up to 20 faster than Hadoop) Open source in Apache Incubator

Users Twitter uses Mesos on > 100 nodes to run ~12 production services (mostly stream processing) Berkeley machine learning researchers are running several algorithms at scale on Spark Conviva is using Spark for data analytics UCSF medical researchers are using Mesos to run Hadoop and eventually non-hadoop apps

Results» Utilization and performance vs static partitioning» Framework placement goals: data locality» Scalability» Fault recovery

Dynamic Resource Sharing

Mesos vs Static Partitioning Compared performance with statically partitioned cluster where each framework gets 25% of nodes Framework Speedup on Mesos Facebook Hadoop Mix 1.14 Large Hadoop Mix 2.10 Spark 1.26 Torque / MPI 0.96

Data Locality with Resource Offers Ran 16 instances of Hadoop on a shared HDFS cluster Used delay scheduling [EuroSys 10] in Hadoop to get locality (wait a short time to acquire data-local nodes) 1.7

Task Start Overhead (s) Scalability Mesos only performs inter-framework scheduling (e.g. fair sharing), which is easier than intra-framework scheduling Result: Scaled to 50,000 emulated slaves, 200 frameworks, 100K tasks (30s len) 1 0.8 0.6 0.4 0.2 0-10000 10000 30000 50000 Number of Slaves

Fault Tolerance Mesos master has only soft state: list of currently running frameworks and tasks Rebuild when frameworks and slaves re-register with new master after a failure Result: fault detection and recovery in ~10 sec

Related Work HPC schedulers (e.g. Torque, LSF, Sun Grid Engine)»Coarse-grained sharing for inelastic jobs (e.g. MPI) Virtual machine clouds»coarse-grained sharing similar to HPC Condor»Centralized scheduler based on matchmaking Parallel work: Next-Generation Hadoop»Redesign of Hadoop to have per-application masters»also aims to support non-mapreduce jobs»based on resource request language with locality prefs

Conclusion Mesos shares clusters efficiently among diverse frameworks thanks to two design elements:»fine-grained sharing at the level of tasks»resource offers, a scalable mechanism for application-controlled scheduling Enables co-existence of current frameworks and development of new specialized ones In use at Twitter, UC Berkeley, Conviva and UCSF

Backup Slides

Framework Isolation Mesos uses OS isolation mechanisms, such as Linux containers and Solaris projects Containers currently support CPU, memory, IO and network bandwidth isolation Not perfect, but much better than no isolation

Analysis Resource offers work well when:»frameworks can scale up and down elastically»task durations are homogeneous»frameworks have many preferred nodes These conditions hold in current data analytics frameworks (MapReduce, Dryad, )»Work divided into short tasks to facilitate load balancing and fault recovery»data replicated across multiple nodes

Revocation Mesos allocation modules can revoke (kill) tasks to meet organizational SLOs Framework given a grace period to clean up Guaranteed share API lets frameworks avoid revocation by staying below a certain share

Mesos API Scheduler Callbacks resourceoffer(offerid, offers) offerrescinded(offerid) statusupdate(taskid, status) slavelost(slaveid) Executor Callbacks launchtask(taskdescriptor) killtask(taskid) Scheduler Actions replytooffer(offerid, tasks) setneedsoffers(bool) setfilters(filters) getguaranteedshare() killtask(taskid) Executor Actions sendstatus(taskid, status)