TetriSched: global rescheduling with adaptive plan-ahead in dynamic heterogeneous clusters

Size: px

Start display at page:

Download "TetriSched: global rescheduling with adaptive plan-ahead in dynamic heterogeneous clusters"

Barbara Shelton
5 years ago
Views:

1 TetriSched: global rescheduling with adaptive plan-ahead in dynamic heterogeneous clusters Alexey Tumanov Timothy Zhu, Jun Woo Park, Michael Kozuch, Mor Harchol-Balter, Gregory R. Ganger PARALLEL DATA LABORATORY Carnegie Mellon University

2 Motivation Spark Analytics (by 5pm) 100xCPU1 80min -OR- 100xCPU2 40min Deep Learning (by 3pm) 10xGPU 20min -OR- 100xCPU2 40min Long Running 10Gbps backbone FPGA FPGA Heterogeneous Cluster 2 FPGA FPGA

3 Heterogeneity Amplifies Options rack1 rack2 m4 m3 m2 m1 time Anti-Affinity(BE) MPI (by t=3) GPU (by t=3) rack1 rack2 m4 m3 m2 m1 rack1 rack2 m4 m3 m2 m1 3

4 Exploiting Time Flexibility (Plan-ahead) Previously: just look at current state rack1 rack2 time 4

5 Exploiting Time Flexibility (Plan-ahead) Previously: just look at current state Plan-ahead: estimate runtimes and choices rack1 rack2 time 5

6 Exploiting Time Flexibility (Plan-ahead) Previously: just look at current state Plan-ahead: estimate runtimes and choices should this job wait for better placement? rack1 rack2 time 6

7 Key Challenges Leverage runtime estimates robustly Express combinatorially many options succinctly Including quantifying their relative merit Exploit this knowledge to improve allocation All in a practical manner 7

8 Quantify: Internal Utility Functions SLO jobs: zero value after deadline BE jobs: lower value for longer completion time Higher value for more important jobs (U > u) U SLO job deadline u best-effort job utility completion time 8

9 Express: Space-Time Request Language Utility u(p,t): placement à scalar value u n Choose k (nck) k=2, s=1, d=2, u nck ( m rack1, ) i n à refers to a group of nodes to choose from k à how many nodes to choose nck ( m, ) i k=2, s=0, d=4, u/2 rack1 rack2 m4 m3 m2 m1 time 9

10 Express: Space-Time Request Language Utility u(p,t): placement à scalar value u nck ( m rack1, k=2, s=1, d=2, u ) i rack1 rack2 m4 m3 m2 m1 time value 10 u 3

11 Express: Space-Time Request Language Utility u(p,t): placement à scalar value u nck ( m rack1, ) i k=2, s=1, d=2, u nck ( m, ) i k=2, s=0, d=4, u/2 rack1 rack2 m4 m3 m2 m1 time value 11 u/2 4

12 Express: Space-Time Request Language Utility u(p,t): placement à scalar value u nck ( m rack1, ) i k=2, s=1, d=2, u nck ( m, ) i k=2, s=0, d=4, u/2 rack1 rack2 m4 m3 m2 m1 time value 12 u u/2 3 4

13 STRL Expression Composition nck ( m rack1, k=2, s=1, d=2, u ) i max OR nck ( m m rack2,, ) i i k=2, s=1, s=0, d=2, d=4, u/2u nck ( m, k=2, s=0, d=4, u/2) i 13

14 TetriSched System Architecture resources jobs framework framework framework Perforator Reservation Reservation System YARN ResAlloc YARN ProxyScheduler CapacityScheduler Job schedule and placement Resource request Runtime, deadline STRL Generator Framework Plugins TetriSched Scheduler Core objective function supply/demand constraints MPI GPU HA... nck nck min max nck nck Resources MILP solver Time MILP STRL Compiler MILP max sum min 14

15 TetriSched System Architecture resources jobs framework framework framework Perforator Reservation Reservation System ResAlloc YARN ProxyScheduler CapacityScheduler Job schedule and placement Resource request Runtime, deadline STRL Generator Framework Plugins TetriSched Scheduler Core objective function supply/demand constraints MPI GPU HA... nck nck min max nck nck Resources MILP solver Time MILP STRL Compiler MILP max sum min 15

16 TetriSched Scheduler Core Compile to MILP Aggregate sum max min Global scheduling Adaptive plan-ahead Solve MILP Cache Results Interpret Results 16

17 Experimental Results Real cluster: 256 nodes (8 racks, 32 nodes/rack) Workloads: Production derived: Facebook SLO + Yahoo BE job mix Systems compared: Rayon / CapacityScheduler Rayon / TetriSched Takeaway: With global rescheduling and plan-ahead, TetriSched outperforms Rayon/CapacityScheduler stack 17

18 Leverage runtime estimates robustly 6/2 AttaLnment(%) ayRn/C6 7etrL6ched (stlmate (rrrr(%) 18

19 Leverage runtime estimates robustly 6/2 AttaLnment(%) 5ayRn/C6 7etrL6ched Achieves high SLO attainment with 2x error (stlmate (rrrr(%) 19

20 Leverage runtime estimates robustly 0ean /atency(s) ayRn/C6 3x TetrL6ched 5x (stlmate (rrrr(%) 20

21 Leverage runtime estimates robustly 0ean /atency(s) 5ayRn/C6 TetrL6ched x x Achieves lower best-effort latency (stlmate (rrrr(%) 21

22 Leverage runtime estimates robustly 6/2 AttaLnment(%) ayRn/C6 7etrL6ched 0ean /atency(s) (stlmate (rrrr(%) 5ayRn/C6 TetrL6ched (stlmate (rrrr(%) Exploits runtime estimates better & more robustly 22

23 Benefit from plan-ahead and global 6/2 AttaLnPent(%) ayRn/C6 7etrL6cheG 7etrL6cheG-1G Oan-aheaG(s) 23

24 Benefit from plan-ahead and global 6/2 AttaLnPent(%) ayRn/C6 7etrL6cheG Oan-aheaG(s) 7etrL6cheG-1G Plan-ahead adds 2.5x in SLO attainment Global scheduling further improves perf. 24

25 Takeaway Results Each primary TetriSched feature is needed: Soft constraint support yields 2x over baseline Plan-ahead support yields 2.5x over baseline Global scheduling yields 40% over baseline Scales to sizeable clusters 256 node real cluster 1000-node simulated cluster 25

26 Conclusions Modern clusters induce complex scheduling tradeoffs Scheduling is hard must schedule harder! TetriSched: General support for space-time preferences (STRL) Leveraging runtime estimates robustly (plan-ahead) Global scheduling (STRL+MILP) Key takeaway result: Significantly higher SLO attainment, lower latency 26

Asymmetry-aware execution placement on manycore chips

Asymmetry-aware execution placement on manycore chips Alexey Tumanov Joshua Wise, Onur Mutlu, Greg Ganger CARNEGIE MELLON UNIVERSITY Introduction: Core Scaling? Moore s Law continues: can still fit more