Everything You Ever Wanted To Know About Resource Scheduling... Almost

Size: px

Start display at page:

Download "Everything You Ever Wanted To Know About Resource Scheduling... Almost"

Curtis Carter
5 years ago
Views:

1 logo Everything You Ever Wanted To Know About Resource Scheduling... Almost Tim Hockin Senior Staff Software Engineer,

2 Who is thockin? Founding member of Kubernetes team Focused on storage, networking, node and other infrastructure things In my time at Google: - Worked on Borg & Omega - Machine management & monitoring - BIOS & Linux kernel I reserve the right to have an opinion, regardless of how wrong I probably am.

3 ! WARNING: Some of this presentation is aspirational

4 I posit: Kubernetes is fundamentally ABOUT resource management

5 Memory

6 Memory Disk space Disk time Disk spindles

7 Memory Disk space Disk time Disk spindles Network bandwidth Host ports

8 Memory Disk space Disk time Disk spindles Network bandwidth Host ports Cache lines Memory bandwidth IP addresses Attached storage PIDs GPUs Power

9 Cache lines Memory Memory bandwidth Disk space IP addresses Disk time Attached storage Disk spindles PIDs Network bandwidth GPUs Host ports Power Arbitrary, opaque third-party resources we can t possibly predict

10 Mental model: Nodes produce capacity apiversion: v1 kind: Node status: capacity: cpu: "4" memory: Ki

11 Mental model: Pods consume capacity apiversion: v1 kind: Pod spec: containers: - resources: requests: cpu: 1500m memory: 3.75Gi

12 Mental model: Scheduler binds Pods to Nodes

13 RAM Mental model: Representing resources Node

14 RAM Mental model: Representing resources

15 RAM Mental model: Representing resources

16 RAM Mental model: Representing resources

17 RAM Mental model: Representing resources

18 Mental model: Representing resources RAM Available

19 A more correct representation Available RAM

20 Scheduling

21 Basic scheduling Node A RAM Node B 6 4 RAM

22 Basic scheduling Node A 6 RAM 4 Node B RAM

23 Basic scheduling Node A 6 RAM 4 Node B 3 4 RAM

24 Basic scheduling Node A 6 RAM 4 Node B RAM 3 4

25 Basic scheduling Node A 6 RAM 4 Node B RAM

26 Basic scheduling Node A 6 RAM Node B RAM 3 4?

27 Basic scheduling Node A 6 RAM 4 Node B RAM

28 Basic scheduling Node A 6 RAM 4 Node B RAM

29 Basic scheduling Node A 6 RAM Node B RAM ?

30 Basic scheduling Node A 6 RAM? 4 Node B RAM

31 Fragmentation Node A 6 RAM 4 Pending Node B RAM

32 TODO: Optimizing rescheduler Node A 6 RAM Node B 5 RAM 3

33 TODO: Optimizing rescheduler Node A 6 RAM Node B 5 RAM 3 5 3

34 Stranded resources Node A 6 RAM Node B 5 RAM 3 5 Can t be used: stranded! 3

35 Many people are still asking the wrong questions.

36 How do I make sure my compute-intensive jobs don t get scheduled on my database machine? Images by Connie Zhou

37 Why would I want multiple replicas on a node? I want to use ALL of the memory. Images by Connie Zhou

38 How do I save some machines for important work, and use the rest for batch? Images by Connie Zhou

39 So... what should they be asking?

40 How do I make sure my compute jobs can t hurt my database job?

41 Isolation

42 How do I know how much memory and my job needs?

43 Sizing

44 How do I safely pack more work onto less machines?

45 Utilization

46 Isolation

47 Isolation Prevent apps from hurting each other Make sure you actually get what you paid for Kubernetes (and Docker) isolate and memory Don t handle things like memory bandwidth, disk time, cache, network bandwidth,... (yet) Predictability at the extremes is paramount

48 When does isolation matter? Infinite loops Memory leaks Disk hogs Fork bombs Cache thrashing

49 Counter-measures Infinite loops: shares and quota Memory leaks: OOM yourself Disk hogs: Quota Fork bombs: Process limits Cache thrashing: LLC jails, cache segments

50 Counter-measures: work to do Infinite loops: shares and quota Memory leaks: OOM yourself Disk hogs: Quota Fork bombs: Process limits Cache thrashing: LLC jails, cache segments

51 Resource taxonomy Compressible resources Hold no state Can be taken away very quickly Merely cause slowness when revoked e.g., disk time Non-compressible resources Hold state Are slower to be taken away Can fail to be revoked e.g. Memory, disk space

52 Requests and limits Limit Request: amount of a resource allowed to be used, with a strong guarantee of availability (seconds/second), RAM (bytes) Scheduler will not over-commit requests Limit: max amount of a resource that can be used, regardless of guarantees scheduler ignores limits Repercussions: request < usage <= limit: resources might be available usage > limit: throttled or killed 1.5

53 Quality of service Limit Guaranteed: highest protection limit == request Burstable: medium protection request > 0 && limit > request 1.5 Best Effort: lowest protection request == 0 How is protection implemented? : cgroup shares & quota Memory: OOM score + user-space evictions

54 Requests and limits Limit Behavior at (or near) the limit depends on the particular resource Compressible resources: throttle usage e.g. No more time for you! 1.5 Non-compressible resources: reclaim e.g. Write-back and reallocate dirty pages Failure means process death (OOM) Being correct is more important than optimal

Coupled resources Example: memory 1. 2. 3. 4.

56 Coupled resources Example: memory Try to allocate, fail Find some clean pages to release (consumes ) Write-back some dirty pages (consumes disk time) If necessary, repeat this on another container How long should this be allowed to take? Really: this should be happening all the time

57 What if I don t specify? You get best-effort isolation You might get defaulted values You might get OOM killed randomly You might get starved You might get no isolation at all

58 Sizing

59 Sizing How many replicas does my job need? How much /RAM does my job need? Do I provision for worst-case? Expensive, wasteful Do I provision for average case? High failure rate (e.g. OOM) Benchmark it!

60 Benchmarks are hard.

61 Benchmarks are hard. Accurate benchmarks are VERY hard.

62 Horizontal scaling Add more replicas Easy to reason about Works well when combined with resource isolation Having >1 replica per node makes sense Not always applicable... e.g. Memory use scales with cluster size HorizontalPodAutoscaler

63 What can we do? Horizontal scaling is not enough Resource needs change over time If only we had an autopilot mode... Collect stats & build a model Predict and react Manage Pods, Deployments, Jobs Try to stay ahead of the spikes

64 Autopilot in Borg Most Borg users use autopilot See earlier statement regarding benchmarks - even at Google Kubernetes API is purpose-built for this sort of use-case We need a VerticalPodAutoscaler

65 Utilization

66 Utilization Resources cost money Wasted resources == wasted money You want NEED to use as much of your capacity as possible Selling it is not the same as using it

67 What is YOUR average utilization?

68 How can we do better? Utilization demands isolation If you want to push the limits, it has to be safe at the extremes People are inherently cautious Provision for 90%-99% case VPA & strong isolation should give enough confidence to provision more tightly We need to do some kernel work, here

69 Some lessons from Borg Priority Low-priority jobs get paused/killed in favor of high-priority jobs Quota If everyone is important, nobody is important Overcommit Hedge against rare events with lower QoS/SLA for some work

70 Overcommit Build a model of recent real usage per-container The delta between request and reality is idle -- resell it with a lower SLA First-tier apps can always get what they paid for - kill second-tier apps Use stats to decide how aggressive to be Let the priority system deal with the debris

Siren s song: over-packing Clusters need some room to operate Nodes fail or get upgraded As you approach 100% bookings (requests), consider what happens when

71 Siren s song: over-packing Clusters need some room to operate Nodes fail or get upgraded As you approach 100% bookings (requests), consider what happens when things go bad Nowhere to squeeze the toothpaste! Plan for some idle capacity - it will save your bacon one day Priorities & rescheduling can make this less expensive

72 Wrapping up

73 ! WARNING: Some of this presentation was aspirational

74 We still have a LONG WAY to go. Fortunately, this is a path we ve been down before.

75 Kubernetes is Open open community open source open design open to ideas Code: github.com/kubernetes/kubernetes Chat: slack.k8s.io

Kubernetes: What s New

Kubernetes: What s New LISA 15 Tim Hockin Senior Staff Software Engineer @thockin This is Kubernetes 201 If you re lost, I m happy to answer questions later or at the BoF tonight Obligatory