Mesos: A Pla+orm for Fine- Grained Resource Sharing in the Data Center

Size: px

Start display at page:

Download "Mesos: A Pla+orm for Fine- Grained Resource Sharing in the Data Center"

Piers Mathews
5 years ago
Views:

1 Mesos: A Pla+orm for Fine- Grained Resource Sharing in the Data Center Ion Stoica, UC Berkeley Joint work with: A. Ghodsi, B. Hindman, A. Joseph, R. Katz, A. Konwinski, S. Shenker, and M. Zaharia

dedicated cluster Expensive Hard to share

2 Challenge Rapid innovaeon in cloud compueng Dryad Hypertable Cassandra Pregel No single framework opemal for all applicaeons Running each framework on its dedicated cluster Expensive Hard to share data Need to run multiple frameworks on same cluster 2

3 Computa?on Model Each framework (e.g., Hadoop, Torque) manages and runs one or more jobs A job consists of one or more tasks A task (e.g., map, reduce) consists of one or more processes running on same machine Today s, a framework assume it controls the cluster Implement sophisecated schedulers, e.g., Quincy, LATE, Fair sharing, Maui, 3

4 Computa?on Environment Typical workload Task granularity Fine grain scheduling Cloud computing HPC Data intensive Computation intensive Small (median < 1min) Large Grid Computation intensive Large Hardware Commodity Specialized Commodity Data Distributed across nodes (e.g., 10TB per node) Need data locality Partitioned (usually, comp. & storage different) Partitioned 4

5 Where We Want to Go uniprogrammed Today: static partitioning multiprogrammed Mesos: dynamic sharing Hadoop Pregel MPI Shared cluster

6 Solu?on Mesos is a common resource sharing layer over which diverse frameworks can run Hadoop MPI Hadoop MPI Mesos Node Node Node Node Node Node Node Node 6

7 Mesos Goals High u?liza?on of resources Support diverse frameworks (current & future) Scalability to 10,000 s of nodes Reliability in face of failures Resulting design: Small microkernel-like core that pushes scheduling logic to frameworks

8 Mesos Provides Fine grained sharing of resource across frameworks Unit of scheduling: task ExecuEon environment for tasks IsolaEon Focus of this talk: resource management and scheduling 8

9 Approach: Global Scheduler Organization policies Resource availability Fwk/Job requirements Job/tasks estimated running times Global Scheduler Task schedule Advantages: can achieve opemal schedule Disadvantages: Complexity hard to scale and ensure resilience Hard to anecipate future s frameworks requirements Need to refactor exiseng frameworks 9

10 Our Approach: Distributed Scheduler Organization policies Resource availability Master (Global Scheduler) Fwk schedule Framework Framework scheduler scheduler Task schedule Advantages: Simple easier to scale and make resilient Easy to port exiseng frameworks, support new ones Disadvantages: Distributed scheduling decision not opemal 10

11 Resource Offers Unit of allocaeon: resource offer Vector of available resources on a node E.g., m1: <1 cpu, 1GB>, m2: <4 cpu, 16GB> Master sends resource offers to frameworks Frameworks select which offers to accept and which tasks to run 11

Mesos Architecture Slaves conenuously send status updates about resources Slave 1

Status Framework executors launch tasks and may persist across tasks Launch

to pick framework to send an offer to Framework Scheduler selects resources and

12 Mesos Architecture Slaves conenuously send status updates about resources Slave 1 task 1 Hadoop Executor task MPI 1 executor Slave 2 task 2 Hadoop Executor Slave 3 Status Framework executors launch tasks and may persist across tasks Launch Hadoop task 2 Mesos Master Resource offers AllocaEon Module Pluggable Scheduler to pick framework to send an offer to Framework Scheduler selects resources and provides tasks Framework Scheduler Hadoop JobTracker Framework Scheduler MPI Scheduler 12

13 Outline Global scheduler: Dominant Resource Fairness AnalyEc evaluaeon of distributed scheduler efficiency Experimental results 13

14 Single Resource: Fair Sharing n users want to share a resource (e.g. CPU) SoluEon: allocate each 1/n of the shared resource Generalized by max- min fairness Handles if a user wants less than its fair share E.g. user 1 wants no more than 20% Generalized by weighted max- min fairness Give weights to users according to importance User 1 gets weight 1, user 2 weight 2 100% 50% 0% 100% 50% 0% 100% 50% 0% CPU 33% 33% 33% 20% 40% 40% 33% 66% 14

15 Why Max- Min Fairness? Weighted Fair Sharing / Propor>onal Shares User 1 gets weight 2, user 2 weight 1 Priori>es Give user 1 weight 1000, user 2 weight 1 Reverva>ons Ensure user 1 gets 10% of a resource Give user 1 weight 10, sum weights 100 Isola>on Policy Users cannot affect others beyond their fair share Widely used: OS: RR, prop. sharing, logery, Linux s cfs, Networking: wfq, wf2q, sfq, drr, csfq,... Datacenters: Hadoop s fair sched, capacity sched, Qunincy 15

16 Proper?es of Max- Min Fairness Share guarantee Each user gets at least 1/n of the resource But will get less if her demand is less Strategy- proof Users are not beger off by asking for more than they need Users have no reason to lie Max- min fairness is the only reasonable mechanism with these two properees 16

17 Why is Max- Min Fairness not Enough? Job scheduling in datacenters is not only about a single resource Jobs consume CPU, memory, disk, and I/O What are task demands today? 17

18 Heterogeneous Resource Demands Some tasks are CPU- intensive Most task need ~ <2 CPU, 2 GB RAM> Some tasks are memory- intensive 2000-node Hadoop Cluster at Facebook (Oct 2010)! 18

19 Problem 100% Single resource example 50% 1 resource: CPU 50% User 1 wants <1 CPU> per task User 2 wants <3 CPU> per task Mul;- resource example 2 resources: CPUs & mem 100% 0% 50% CPU User 1 wants <1 CPU, 4 GB> per task User 2 wants <3 CPU, 1 GB> per task What s a fair alloca>on? 50%?? 0% CPU 19 mem

Uoff 2 needs in a separate <1 CPU, 2 cluster GB RAM> with per 50% task of the resources Asset fairness yields U 1

20 A Natural Policy Asset Fairness Equalize each user s sum of resource shares Problem: Cluster violate with 70 share CPUs, guarantee 70 GB RAM User 1 Uhas 1 needs < 50% <2 of CPU, both 2 GB CPUs RAM> and per RAM task Be_er Uoff 2 needs in a separate <1 CPU, 2 cluster GB RAM> with per 50% task of the resources Asset fairness yields U 1 : 15 tasks: U 2 : 20 tasks: 30 CPUs, 30 GB ( =60) 20 CPUs, 40 GB ( =60) 100% 50% 0% User 1 User 2 43% 43% 57% 28% CPU RAM

21 Challenge Can we find a fair sharing policy that provides Share guarantee Strategy- proofness Max- min fairness for a single resource had these properees Can we generalize max- min fairness to muleple resources? 21

22 Chea?ng the Scheduler Users willing to game the system to get more resources Real- life examples A cloud provider had quotas on map and reduce slots Some users found out that the map- quota was low Users implemented maps in the reduce slots! A search company provided dedicated machines to users that could ensure certain level of uelizaeon (e.g. 80%) Users used busy- loops to inflate u?liza?on 22

23 Dominant Resource Fairness A user s dominant resource is the resource user has the biggest share of Example: Total resources: <10 CPU, 4 GB> User 1 s allocaeon: <2 CPU, 1 GB> Dominant resource is memory as 1/4 > 2/10 (1/5) A user s dominant share is the fraceon of the dominant resource she is allocated User 1 s dominant share is 25% (1/4) 23

24 Dominant Resource Fairness (2) Apply max- min fairness to dominant shares Equalize the dominant share of the users Example: Total resources: <9 CPU, 18 GB> User 1 demand: <1 CPU, 4 GB> dom res: mem User 2 demand: <3 CPU, 1 GB> dom res: CPU 100% 3 CPUs 12 GB User 1 50% 66% 66% User 2 0% 6 CPUs 2 GB CPU (9 total) mem (18 total) 24

25 Online DRF Scheduler Whenever there are available resources and tasks to run: Schedule a task to the user with smallest dominant share O(log n) Eme per decision using binary heaps 25

26 Why not Use Pricing? Approach Set prices for each good Let users buy what they want Problem How do we determine the right prices for different goods? 26

27 How Would an Economist Solve It? Let the market determine the prices Compe;;ve Equilibrium from Equal Incomes (CEEI) Give each user 1/n of every resource Let users trade in a perfectly compeeeve market Not strategy- proof! 27

28 DRF vs CEEI User 1: <1 CPU, 4 GB> User 2: <3 CPU, 1 GB> DRF more fair, CEEI better utilization 100% Dominant Resource Fairness 100% Compe??ve Equilibrium from Equal Incomes 100% Dominant Resource Fairness 100% Compe??ve Equilibrium from Equal Incomes 50% 66% 66% 50% 55% 91% user 1 user 2 50% 66% 66% 50% 60% 80% 0% 0% 0% 0% CPU mem CPU mem CPU mem CPU mem User 1: <1 CPU, 4 GB> User 2: <3 CPU, 2 GB> User 2 increased her share of both CPU and memory by lying! 28

29 Gaming U?liza?on- Op?mal Schedulers Cluster with <100 CPU, 100 GB> 2 users, each demanding <1 CPU, 2 GB> per task 100% 100% 50% User 1 User 2 50% 50% 95% 50% 0% CPU mem 0% CPU mem User 1 lies and demands <2 CPU, 2 GB> U?liza?on- Op?mal scheduler prefers user 1 29

30 Proper?es of Policies Property Asset CEEI DRF Share guarantee Strategy-proofness Pareto efficiency Assuming non-zero demands, DRF is the only allocation that is strategy proof and provides sharing incentive (Eric Friedman, Cornell) Envy-freeness Single resource fairness Bottleneck res. fairness Population monotonicity Resource monotonicity 30

31 Outline Global scheduler: Dominant Resource Fairness AnalyEc evaluaeon of distributed scheduler efficiency Experimental results 31

32 How well does Distributed Scheduler Work? Organization policies Resource availability Master (scheduler) Fwk schedule Framework Framework scheduler scheduler Task schedule Simplified model for workload Evaluate main performance metrics Job waieng Eme CompleEon Eme UElizaEon Fwk/Job requirements Job/tasks estimated running times 32

33 Workload Characteris?cs Elas?c vs. rigid jobs: Elas?c: can dynamically adjust allocaeon during execueon (e.g., Hadoop, Spark) Rigid: cannot run before they acquire all resources (e.g., MPI) Picky vs. non- picky jobs Picky: can only run on a subset of resources (e.g., need GPU, public IP address) Homogeneous vs. heterogeneous task duraeons 33

34 Metrics Job ramp- up?me: Eme it takes a new framework to achieve its target allocaeon (e.g., fair share); Job comple?on?me: Eme it takes a job to complete, assuming one job per framework; Cluster u?liza?on 34

35 Simplified System Model One job per framework AllocaEon granularity: slot Each framework has a target alloca?on, determined by allocaeon policy E.g., fair sharing, reservaeon, priority 35

36 Ideal System Job gets its target allocaeon instantaneously k = target allocaeon (slots) T = average task duraeon m = number of tasks in a job Job compleeon Eme (m/k) T target allocation (m/k) T # of slots in system k time job ready/starts job ends 36

37 Elas?c Framework Can scale allocaeon up and down during job execueon Longer compleeon Eme: take Eme to ramp- up to target allocaeon High uelizaeon: can use a slot as soon as it becomes available # slots in system target allocation time job ready/starts job ends 37

38 Rigid Framework Cannot start before achieving target allocaeon Higher compleeon Eme Lower uelizaeon: cannot use slots as soon as it acquire them # slots in system target allocation wasted resources time job ready/starts job ends 38

39 Analysis Result Summary T mean task duraeon k target allocaeon of job (in slots) m number of job s tasks Good performance β T Within job compleeon T of ideal Eme in ideal only system, for larger where jobs, β m/k completion time and only if β >> ln k Ramp-up time Completion time Elastic Framework Rigid Framework Const. dist Exp. dist Const. dist Exp. dist T (ln k) T T (ln k) T (1/2 + β) T (1 + β) T (1 + β) T (ln k + β) T 39 Utilization 1 1 β / (1/2 + β) β /(ln k + β-1)

40 Modeling Picky Jobs k number of slots with property P requested by job n number of slots in system with property P Examples of property P: A slot running on a node storing a job s input block A slot running on a node with GPU Job pickiness, p = k/n 40

41 Picky Jobs: Wai?ng Time Assume only one framework request slots with property P E.g., all other frameworks have achieved target allocaeon Property: waiting time of a job with pickiness factor p p-percentile of task duration distribution Example: if p = 0.9, waiting time is equal to 90-th percentile of T 41

42 Summary Elastic Picky Homogeneous Performance Examples Y N Y Best MapReduce, Spark Y N N Very Good Web framework Y Y Y Very Good GPU-based ML frameworks Y Y N Good N N Y Good storage services (e.g., memcached, dist. file systems) N N N Good MPI N Y Y Good long running services with special needs (e.g., DNS) N Y N Worst 42

43 Outline Global scheduler: Dominant Resource Fairness AnalyEc evaluaeon of distributed scheduler efficiency Experimental results 43

44 Implementa?on Stats 20,000 lines of C++ Master failover using ZooKeeper Frameworks ported: Hadoop, MPI, Torque New specialized framework: Spark, for iteraeve jobs (up to 20 faster than Hadoop) Open source in Apache Incubator

45 Users Twi_er uses Mesos on > 100 nodes to run ~12 produceon services (mostly stream processing) Berkeley machine learning researchers are running several algorithms at scale on Spark Conviva is using Spark for data analyecs UCSF medical researchers are using Mesos to run Hadoop and eventually non- Hadoop apps 45

46 Dynamic Resource Sharing 100 node cluster 46

47 Mesos vs Static Partitioning Compared performance with staecally pareeoned cluster where each framework gets 25% of nodes Framework Speedup on Mesos Facebook Hadoop Mix 1.14 Large Hadoop Mix 2.10 Spark 1.26 Torque / MPI

48 Scalability 50,000 simulated nodes using 200 VMs 200 frameworks with 30 second tasks Measured: Eme to launch + get status update Task Launch Overhead (seconds) Number of Nodes 48

49 Related Work HPC schedulers (e.g. Torque, LSF, Sun Grid Engine) Coarse- grained sharing for rigid jobs (e.g. MPI) Virtual machine clouds Coarse- grained sharing similar to HPC Condor Centralized scheduler based on matchmaking Next- GeneraEon Hadoop Redesign of Hadoop to have per- applicaeon masters Also aims to support non- MapReduce jobs Based on resource request language with locality prefs 49

50 Conclusion Mesos shares clusters efficiently among diverse frameworks thanks to two design elements Fine- grained sharing at the level of tasks Resource offers, a scalable mechanism for applicaeon- controlled scheduling Enable co- existence of current frameworks and development of new specialized ones In use at Twiger, UC Berkeley, Conviva and UCSF h_p:// 50

Mesos: Mul)programing for Datacenters

Mesos: Mul)programing for Datacenters Ion Stoica Joint work with: Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy Katz, ScoC Shenker, UC BERKELEY Mo)va)on Rapid innovaeon