Key aspects of cloud computing. Towards fuller utilization. Two main sources of resource demand. Cluster Scheduling

Similar documents
Key aspects of cloud computing. Towards fuller utilization. Two main sources of resource demand. Cluster Scheduling

Mesos: Mul)programing for Datacenters

A Platform for Fine-Grained Resource Sharing in the Data Center

Mesos: A Pla+orm for Fine- Grained Resource Sharing in the Data Center

Ali Ghodsi, Matei Zaharia, Benjamin Hindman, Andy Konwinski, Scott Shenker, Ion Stoica. University of California, Berkeley nsdi 11

what is cloud computing?

Large-scale cluster management at Google with Borg

The Datacenter Needs an Operating System

Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center

Exploiting Heterogeneity in the Public Cloud for Cost-Effective Data Analytics

Today s Papers. Composability is Essential. The Future is Parallel Software. EECS 262a Advanced Topics in Computer Systems Lecture 13

FairRide: Near-Optimal Fair Cache Sharing

A Whirlwind Tour of Apache Mesos

Preemptive, Low Latency Datacenter Scheduling via Lightweight Virtualization

Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center

Sparrow. Distributed Low-Latency Spark Scheduling. Kay Ousterhout, Patrick Wendell, Matei Zaharia, Ion Stoica

Elastic Efficient Execution of Varied Containers. Sharma Podila Nov 7th 2016, QCon San Francisco

Practical Considerations for Multi- Level Schedulers. Benjamin

PaaS and Hadoop. Dr. Laiping Zhao ( 赵来平 ) School of Computer Software, Tianjin University

Choosy: Max-Min Fair Sharing for Datacenter Jobs with Constraints

Nexus: A Common Substrate for Cluster Computing

Lecture 11 Hadoop & Spark

An Enhanced Approach for Resource Management Optimization in Hadoop

Priority Traffic CSCD 433/533. Advanced Networks Spring Lecture 21 Congestion Control and Queuing Strategies

Empirical Evaluation of Latency-Sensitive Application Performance in the Cloud

DRIZZLE: FAST AND Adaptable STREAM PROCESSING AT SCALE

Page 1. Goals for Today" Background of Cloud Computing" Sources Driving Big Data" CS162 Operating Systems and Systems Programming Lecture 24

CS 425 / ECE 428 Distributed Systems Fall 2015

HP AutoRAID (Lecture 5, cs262a)

Router Design: Table Lookups and Packet Scheduling EECS 122: Lecture 13

Cloud Computing CS

SCALING LIKE TWITTER WITH APACHE MESOS

Coflow. Recent Advances and What s Next? Mosharaf Chowdhury. University of Michigan

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University

EP2210 Scheduling. Lecture material:

Resilient Distributed Datasets

Pocket: Elastic Ephemeral Storage for Serverless Analytics

Sinbad. Leveraging Endpoint Flexibility in Data-Intensive Clusters. Mosharaf Chowdhury, Srikanth Kandula, Ion Stoica. UC Berkeley

Networking Acronym Smorgasbord: , DVMRP, CBT, WFQ

6.888 Lecture 8: Networking for Data Analy9cs

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

POWERING THE INTERNET WITH APACHE MESOS

TELE Switching Systems and Architecture. Assignment Week 10 Lecture Summary - Traffic Management (including scheduling)

Building a Data-Friendly Platform for a Data- Driven Future

Batch Processing Basic architecture

Database Architecture 2 & Storage. Instructor: Matei Zaharia cs245.stanford.edu

Locality-Aware Dynamic VM Reconfiguration on MapReduce Clouds. Jongse Park, Daewoo Lee, Bokyeong Kim, Jaehyuk Huh, Seungryoul Maeng

Coflow. Big Data. Data-Parallel Applications. Big Datacenters for Massive Parallelism. Recent Advances and What s Next?

Scale your Docker containers with Mesos

PBS PROFESSIONAL VS. MICROSOFT HPC PACK

CSCI-1680 Transport Layer III Congestion Control Strikes Back Rodrigo Fonseca

COMP/ELEC 429/556 Introduction to Computer Networks

MUD: Send me your top 1 3 questions on this lecture

MixApart: Decoupled Analytics for Shared Storage Systems. Madalin Mihailescu, Gokul Soundararajan, Cristiana Amza University of Toronto and NetApp

Predictable Time-Sharing for DryadLINQ Cluster. Sang-Min Park and Marty Humphrey Dept. of Computer Science University of Virginia

Cluster management at Google john wilkes / Principal Software Engineer

Overview Computer Networking What is QoS? Queuing discipline and scheduling. Traffic Enforcement. Integrated services

SAMPLE CHAPTER IN ACTION. Roger Ignazio. FOREWORD BY Florian Leibert MANNING

Packet Scheduling and QoS

@unterstein #bedcon. Operating microservices with Apache Mesos and DC/OS

Congestion Control In the Network

Locality Aware Fair Scheduling for Hammr

Process Scheduling. Copyright : University of Illinois CS 241 Staff

Chapter 5. The MapReduce Programming Model and Implementation

Discretized Streams. An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters

2013 AWS Worldwide Public Sector Summit Washington, D.C.

TDDD82 Secure Mobile Systems Lecture 6: Quality of Service

Episode 5. Scheduling and Traffic Management

Big Data 7. Resource Management

HP AutoRAID (Lecture 5, cs262a)

MixApart: Decoupled Analytics for Shared Storage Systems

Big Data Processing: Improve Scheduling Environment in Hadoop Bhavik.B.Joshi

Capacity Scheduler. Table of contents

Fairness, Queue Management, and QoS

Scheduling, part 2. Don Porter CSE 506

Twitter Heron: Stream Processing at Scale

Block Device Scheduling. Don Porter CSE 506

Block Device Scheduling

238P: Operating Systems. Lecture 14: Process scheduling

Comp 310 Computer Systems and Organization

Core-Stateless Fair Queueing: Achieving Approximately Fair Bandwidth Allocations in High Speed Networks. Congestion Control in Today s Internet

Resource Management for Dynamic MapReduce Clusters in Multicluster Systems

Operating Systems. Process scheduling. Thomas Ropars.

Dynamic processing slots scheduling for I/O intensive jobs of Hadoop MapReduce

Cost-efficient VNF placement and scheduling in cloud networks. Speaker: Tao Gao 05/18/2018 Group Meeting Presentation

Lecture 5: Performance Analysis I

Managing Performance Variance of Applications Using Storage I/O Control

Programming Models MapReduce

Fair Queueing. Presented by Brighten Godfrey. Slides thanks to Ion Stoica (UC Berkeley) with slight adaptation by Brighten Godfrey

Best Practices and Performance Tuning on Amazon Elastic MapReduce

Hierarchical Scheduling for Diverse Datacenter Workloads

Lecture 24: Scheduling and QoS

Cloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark

2/4/2019 Week 3- A Sangmi Lee Pallickara

CSE 444: Database Internals. Lecture 23 Spark

Fast, Interactive, Language-Integrated Cluster Computing

CS268: Beyond TCP Congestion Control

BigDataBench-MT: Multi-tenancy version of BigDataBench

PROFILING BASED REDUCE MEMORY PROVISIONING FOR IMPROVING THE PERFORMANCE IN HADOOP

Transcription:

Key aspects of cloud computing Cluster Scheduling 1. Illusion of infinite computing resources available on demand, eliminating need for up-front provisioning. The elimination of an up-front commitment COS 418: Distributed Systems Lecture 3 3. The ability to pay for use of computing resources on a short-term basis Michael Freedman [Heavily based on content from Ion Stoica] From Above the Clouds: A Berkeley View of Cloud Computing Two main sources of resource demand Services External demand, scale supply to match demand Towards fuller utilization Data analysis Tradeoff scale & completion time E.g., use 1 server for 10 hours vs. 10 servers for 1 hour Source of demand elasticity! Type of contract Spot - 1 hr duration Spot 6 hr duration On-demand Price (m4.xlarge) $0.139 / hour $0.176 / hour $0.15 / hour 3 Source of variable demand? Search, social networks, e-commerce, usage have diurnal patterns Apocryphal story: AWS exists because Amazon needed to provision for holiday shopping season, wanted to monetize spare capacity But if provision for peak, what around remaining time? Fill-in with non-time-sensitive usage, e.g., various data crunching E.g., Netflix using AWS at night for video transcoding 4 1

Scheduling: An old problem Today s lecture Metrics / goals for scheduling resources System architecture for big-data scheduling CPU allocation Multiple processors want to execute, OS selects one to run for some amount of time Bandwidth allocation Packets from multiple incoming queue want to be transmitted out some link, switch chooses one 5 6 What do we want from a scheduler? Single Resource: Fair Sharing Isolation Have some sort of guarantee that misbehaved processes cannot affect me too much n users want to share a resource (e.g. CPU) Solution: give each 1/n of the shared resource 10 CPU 33% 33% 33% Efficient resource usage Resource is not idle while there is process whose demand is not fully satisfied Work conservation -- not achieved by hard allocations Flexibility Can express some sort of priorities, e.g., strict or time based 7 Generalized by max-min fairness Handles if a user wants less than its fair share E.g. user 1 wants no more than Generalized by weighted max-min fairness Give weights to users according to importance User 1 gets weight 1, user weight 10 10 4 4 33% 66% 8

Max-Min Fairness is Powerful Weighted Fair Sharing / Proportional Shares User u1 gets weight, u weight 1 Priorities: Give u1 weight 1000, u weight 1 Reservations Ensure u1 gets 1: Give u1 weight 10, sum weights 100 Deadline-based scheduling Given a job s demand and deadline, compute user s reservation / weight Max-min Fairness via Fair Queuing Fair queuing explained in a fluid flow system: reduces to bit-by-bit round robin among flows Each flow receives min( r i, f ), where r i flow arrival rate f link fair rate (see next slide) Weighted Fair Queuing (WFQ) Associate a weight with each flow Isolation: Users cannot affect others beyond their share 9 10 Fair Rate Computation If link congested, compute f such that Fair Rate Computation Associate a weight w i with each flow i If link congested, compute f such that 8 6 i min(r i, f ) = C 10 4 4 f = 4: min(8, 4) = 4 min(6, 4) = 4 min(, 4) = (w 1 = 3) (w = 1) (w 3 = 1) 8 6 i min(r i, f w i ) = C 10 6 f = : min(8, *3) = 6 min(6, *1) = min(, *1) = 11 1 3

Theoretical Properties of Max-Min Fairness Why is Max-Min Fairness Not Enough? Share guarantee Each user gets at least 1/n of the resource But will get less if her demand is less Strategy-proof Users are not better off by asking for more than they need Users have no reason to lie Job scheduling is not only about a single resource Tasks consume CPU, memory, network and disk I/O What are task demands today? 13 14 Heterogeneous Resource Demands How to allocate? Some tasks are CPU-intensive resources: CPUs & memory 10 Most task need ~ < CPU, GB RAM> Some tasks are memory-intensive User 1 wants <1 CPU, 4 GB> per task User wants <3 CPU, 1 GB> per task?? CPU mem What s a fair allocation? 000-node Hadoop Cluster at Facebook (Oct 010) 15 16 4

A Natural Policy Strawman for asset fairness Asset Fairness: Equalize each user s sum of resource shares Approach: Equalize each user s sum of resource shares Cluster with 8 CPUs, 56 GB RAM U 1 needs <1 CPU, GB RAM> per task, or <3.6% CPUs, 3.6% RAM> per task U needs <1 CPU, 4 GB RAM> per task, or <3.6% CPUs, 7.% RAM> per task Asset fairness yields 10 U 1: 1 tasks: <43% CPUs, 43% RAM> ( =86%) 43% 8% CPU 43% 57% RAM User 1 User Cluster with 8 CPUs, 56 GB RAM 10 43% 43% U 1 needs <1 CPU, GB RAM> per task, or <3.6% CPUs, 3.6% RAM> per task Problem: violates share guarantee 57% U needs User <1 CPU, 1 has 4 GB < RAM> of per both task, CPUs and RAM8% or <3.6% CPUs, 7.% RAM> per task CPU RAM Better off in separate cluster with half the resources User 1 Asset fairness yields User U 1: 1 tasks: <43% CPUs, 43% RAM> ( =86%) U : 8 tasks: <8% CPUs, 57% RAM> ( =86%) 17 U : 8 tasks: <8% CPUs, 57% RAM> ( =86%) 18 Cheating the Scheduler Dominant Resource Fairness (DRF) Users willing to game the system to get more resources Real-life examples A cloud provider had quotas on map and reduce slots Some users found out that the map-quota was low. Users implemented maps in the reduce slots! A search company provided dedicated machines to users that could ensure certain level of utilization (e.g. 8). Users used busy-loops to inflate utilization. How achieve share guarantee + strategy proofness for sharing? Generalize max-min fairness to multiple resources/ 19 A user s dominant resource is resource user has biggest share of Example: Total resources: User 1 s allocation: 8 CPU CPU 5% CPUs Dominant resource of User 1 is CPU (as 5% > ) A user s dominant share: fraction of dominant resource allocated User 1 s dominant share is 5% 5 GB 1 GB RAM Dominant Resource Fairness: Fair Allocation of Multiple Resource Types Ali Ghodsi, Matei Zaharia, Benjamin Hindman, Andy Konwinski, Scott Shenker, Ion Stoica, NSDI 11 0 5

Dominant Resource Fairness (DRF) Apply max-min fairness to dominant shares Equalize the dominant share of the users. Example: Total resources: <9 CPU, 18 GB> User 1 demand: <1 CPU, 4 GB>;; dom res: mem (1/9 < 4/18) User demand: <3 CPU, 1 GB>;; dom res: CPU (3/9 > 1/18) Online DRF Scheduler Whenever available resources and tasks to run: 10 3 CPUs 1 GB 66% 6 CPUs CPU (9 total) 66% GB mem (18 total) User 1 User Schedule task to user with smallest dominant share 1 Many Competing Frameworks Many different Big Data frameworks Today s lecture 1. Metrics / goals for scheduling resources. System architecture for big-data scheduling Hadoop Spark Storm Spark Streaming Flink GraphLab MPI Heterogeneity will rule 3 No single framework optimal for all applications So each framework runs on dedicated cluster? 4 6

One Framework Per Cluster Challenges Inefficient resource usage E.g., Hadoop cannot use underutilized resources from Spark Not work conserving Common resource sharing layer? Abstracts ( virtualizes ) resources to frameworks Enable diverse frameworks to share cluster Make it easier to develop and deploy new frameworks Hard to share data Copy or access remotely, expensive Hadoop Spark Hadoop Spark Resource Management System Hard to cooperate E.g., Not easy for Spark to use graphs generated by Hadoop Uniprograming Multiprograming 5 6 Abstraction hierarchy 101 In a cluster: a framework (e.g., Hadoop, Spark) manages 1+ jobs a job consists of 1+ tasks a task (e.g., map, reduce) involves 1+ processes executing on single machine Abstraction hierarchy 101 In a cluster: a framework (e.g., Hadoop, Spark) manages 1+ jobs a job consists of 1+ tasks a task (e.g., map, reduce) involves 1+ processes executing on single machine Executor (e.g., task Task 1 Tracker) task 5 Executor (e.g., task Task Traker) task 6 Executor Executor (e.g., task Task 3 (e.g., Task Tracker) task 7 Tracker) task 4 Framework Scheduler (e.g., Job Tracker) Job 1: tasks 1,, 3, 4 Job : tasks 5, 6, 7 Seek fine-grained resource sharing Tasks typically short: median ~= 10 sec minutes Better data locality / failure-recovery if tasks fine-grained 7 8 7

Approach #1: Global scheduler Google s Borg Global scheduler takes input, outputs task schedule Organization policies Resource Availability Estimates: Task durations, input sizes, xfer sizes, Job requirements: Latency, throughput, availability Job execution plan: Task DAG, inputs/outups Advantages: Optimal Disadvantages More complex, harder to scale (yet Google: 10,000s servers/scheduler) Anticipate future requirements, refactor existing 9 Centralized Borgmaster + Localized Borglet (manage/monitor tasks) Goal: Find machines for a given job job hello = { runtime = { cell = ic } binary =../hello_webserver args = { port = %port% } requirements = { RAM = 100M disk = 100M CPU = 0.1 } replicas = 10000 } Large-scale cluster management at Google with Borg A. Verma, L. Pedrosa, M. Korupolu, D. Oppenheimer, E. Tune, J. Wilkes, EuroSys 15 30 Google s Borg Centralized Borgmaster + Localized Borglet (manage/monitor tasks) Goal: Find machines for a given job Used across all Google services Services: Gmail, web search, GFS Analytics: MapReduce, streaming Framework controller sends master allocation request to Borg for full job Google s Borg Centralized Borgmaster + Localized Borglet (manage/monitor tasks) Goal: Find machines for a given job Allocation Minimize # / priority preempted tasks Pick machines already having copy of the task s packages Spread over power/failure domains Mix high/low priority tasks 31 3 8

Approach #: Offers, not schedule Mesos Architecture: Example Unit of allocation: resource offer Vector of available resources on a node E.g., node1: <1CPU, 1GB>, node: <4CPU, 16GB> 1. Master sends resource offers to frameworks. Frameworks: Select which offers to accept Perform task scheduling Unlike global scheduler, requires another level of support Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy Katz, Scott Shenker, Ion Stoica, NSDI 11 33 Slaves continuously send status updates about resources Slave S1 task Hadoop 1 MPI Executor task executor 1 8CPU, 8GB Slave S task Hadoop Executor S:<8CPU,16GB> Slave S3 8CPU, 16GB 16CPU, 16GB Framework executors launch tasks and may persist across tasks Mesos Master S1 <6CPU,4GB> <8CPU,8GB> <CPU,GB> S <8CPU,16GB> <4CPU,1GB> task S3 :<4CPU,4GB> <16CPU,16GB> Allocation Module Pluggable scheduler to pick framework to send an offer to Framework scheduler selects resources and provides tasks Hadoop JobTracker Spark JobTracker 34 Why does it Work? Two Key Questions A framework can wait for offer that matches its constraints or preferences, reject otherwise Example: Hadoop s job input is blue file S1 S S3 Mesos master Hadoop (Job tracker) Reject: Accept: S1 both doesn t S and store S3 blue store file the blue file How long does a framework need to wait? Depends on distribution of task duration Pickiness of framework given hard/soft constraints How allocate resources of different types? Use DRF! 35 36 9

Ramp- Up Time Ramp-Up Time: time job waits to get its target allocation Example: Job s target allocation, k = 3 Number of nodes job can pick from, n = 5 Improving Ramp- Up Time? Preemption: preempt tasks Migration: move tasks around to increase choice: Job 1 constraint set = {m1, m, m3, m4} Job constraint set = {m1, m} task task task task wait! S5 S4 S3 S S1 job readyramp-up time job ends time 37 Existing frameworks implement No migration: expensive to migrate short tasks m1 m m3 m4 Preemption with task killing (e.g., Dryad s Quincy): expensive to checkpoint data-intensive tasks 38 Macro- benchmark Facebook: Job Completion Times Simulate an 1000-node cluster Job and task durations: Facebook traces (Oct 010) Constraints: modeled after Google* Allocation policy: fair sharing Scheduler comparison Resource Offers: no preemption, and no migration (e.g., Hadoop s Fair Scheduler + constraints) Global-M: global scheduler with migration Global-MP: global scheduler with migration and preemption CDF 1 0.8 0.6 0.4 0. 0 1 10 100 1000 10000 Job Duration (s) Choosy R. Offers Global-M Global-MP * Sharma et al., Modeling and Synthesizing Task Placement Constraints in Google Compute Clusters, ACM SoCC, 011. 39 40 10

How to allocate resources? DRF! CPU Memory Cluster Supply 10 0 A s Demand 4 (4) (1) B s Demand 1 (1) 5 (5%) Today s lecture Cluster: Remaining Cluster: Offer A s Allocation ( 10cpu, 0gb ) (cpu, gb) to A (cpu, gb, ) (0cpu, 0gb, ) ( 8cpu, 18gb ) (1cpu, gb) to B (cpu, gb, ) (1cpu, gb, 1) ( 7cpu, 16gb ) (1cpu, 3gb) to B (cpu, gb, ) (cpu, 5gb, 5%) ( 6cpu, 13gb ) (1cpu, 6gb) to A (cpu, gb, ) (cpu, 5gb, 5%) ( 6cpu, 13gb ) (1cpu, 6gb) to B (cpu, gb, ) (3cpu, 11gb, 55%) ( 5cpu, 7gb ) (3cpu, gb) to B (5cpu, 4gb, ) (3cpu, 11gb, 55%) Metrics / goals for scheduling resources B s Allocation System architecture for big- data scheduling 41 Max- min fairness, weighted- fair queuing, DRF Central allocator (Borg), two- level resource offers (Mesos) 4 43 11