Packing Tasks with Dependencies. Robert Grandl, Srikanth Kandula, Sriram Rao, Aditya Akella, Janardhan Kulkarni

Size: px

Start display at page:

Download "Packing Tasks with Dependencies. Robert Grandl, Srikanth Kandula, Sriram Rao, Aditya Akella, Janardhan Kulkarni"

Christopher Hancock
5 years ago
Views:

1 Packing Tasks with Dependencies Robert Grandl, Srikanth Kandula, Sriram Rao, Aditya Akella, Janardhan Kulkarni

2 The Cluster Scheduling Problem Jobs Goal: match tasks to resources Tasks 2

3 The Cluster Scheduling Problem Jobs Goal: match tasks to resources Tasks 2

4 The Cluster Scheduling Problem Jobs Tasks Goal: match tasks to resources to achieve High cluster utilization Fast job completion Guarantees (deadlines, fair shares) 2

5 The Cluster Scheduling Problem Jobs Tasks Goal: match tasks to resources to achieve High cluster utilization Fast job completion Guarantees (deadlines, fair shares) Constraints Scale fast twitch 2

6 The Cluster Scheduling Problem Jobs Tasks Goal: match tasks to resources to achieve High cluster utilization Fast job completion Guarantees (deadlines, fair shares) Constraints Scale fast twitch Large and high-value deployments E.g., Spark, Yarn*, Mesos*, Cosmos 2

7 The Cluster Scheduling Problem Jobs Tasks Goal: match tasks to resources to achieve High cluster utilization Fast job completion Guarantees (deadlines, fair shares) Constraints Scale fast twitch Large and high-value deployments E.g., Spark, Yarn*, Mesos*, Cosmos Today, schedulers are simple and (as we show) performance can improve a lot 2

8 Jobs have heterogeneous DAGs User queries Query optimizer Job DAG (Dryad, Spark-SQL, Hive, )

9 Jobs have heterogeneous DAGs Example DAG User queries Query optimizer Job DAG (Dryad, Spark-SQL, Hive, ) # Tasks 1 Edges Legend Duration 200s 60s 0s need shuffle can be local

10 Jobs have heterogeneous DAGs Example DAG User queries Query optimizer Job DAG (Dryad, Spark-SQL, Hive, ) # Tasks 1 Edges Legend Duration 200s 60s 0s need shuffle can be local

11 Jobs have heterogeneous DAGs Example DAG User queries Query optimizer Job DAG (Dryad, Spark-SQL, Hive, ) DAGs have deep and complex structures Task durations range from <1s to >100s Tasks use different amounts of resources # Tasks 1 Edges Legend Duration 200s 60s 0s need shuffle can be local

12 Challenges in scheduling heterogeneous DAGs 0: εε T , : , εε 3: 4: , εε , : TT , : εε TT ,

13 Challenges in scheduling heterogeneous DAGs 0: εε T , : , εε 3: 4: , εε , : TT , : εε TT ,

14 Challenges in scheduling heterogeneous DAGs 0: εε T , : , εε 3: 4: , εε , : TT , : εε TT ,

15 Challenges in scheduling heterogeneous DAGs CPSched cannot overlap tasks with complementary demands 0: εε T , : , εε 3: 4: 2: TT , : , εε , εε TT ,

16 Challenges in scheduling heterogeneous DAGs Packers 1 CPSched cannot overlap tasks with complementary demands 0: εε T , : , εε 3: 4: 2: TT , : , εε , εε TT , [1] Tetris: Multi-resource Packing for Cluster Schedulers, SIGCOMM 14

17 Challenges in scheduling heterogeneous DAGs Packers 1 CPSched cannot overlap tasks with complementary demands Packers do not handle dependencies 0: εε T , : , εε 3: 4: 2: TT , : , εε , εε TT , [1] Tetris: Multi-resource Packing for Cluster Schedulers, SIGCOMM 14

Challenges in scheduling heterogeneous DAGs Packers 1 CPSched cannot overlap tasks with complementary demands Packers do not handle dependencies 0: 11 + 55εε T 00. 66, 00. 33 1: 00.

18 Challenges in scheduling heterogeneous DAGs Packers 1 CPSched cannot overlap tasks with complementary demands Packers do not handle dependencies 0: εε T , : , εε 3: 4: 2: TT , : , εε , εε TT , nn tasks dd resources [1] Tetris: Multi-resource Packing for Cluster Schedulers, SIGCOMM 14

19 Challenges in scheduling heterogeneous DAGs

20 Challenges in scheduling heterogeneous DAGs 1. Simple heuristics lead to poor schedules

21 Challenges in scheduling heterogeneous DAGs 1. Simple heuristics lead to poor schedules 2. Production DAGs are roughly 50% slower than lower bounds

22 Challenges in scheduling heterogeneous DAGs 1. Simple heuristics lead to poor schedules 2. Production DAGs are roughly 50% slower than lower bounds 3. Simple variants of Packing dependent tasks are NP-hard problems

23 Challenges in scheduling heterogeneous DAGs 1. Simple heuristics lead to poor schedules 2. Production DAGs are roughly 50% slower than lower bounds 3. Simple variants of Packing dependent tasks are NP-hard problems 4. Prior analytical solutions miss some practical concerns

24 Challenges in scheduling heterogeneous DAGs 1. Simple heuristics lead to poor schedules 2. Production DAGs are roughly 50% slower than lower bounds 3. Simple variants of Packing dependent tasks are NP-hard problems 4. Prior analytical solutions miss some practical concerns Multiple resources Complex dependencies Machine-level fragmentation Scale; Online;

25 Given an annotated DAG and available resources, compute a good schedule + practical model

26 Main ideas for one DAG resources time Existing schedulers: A task is schedulable after all its parents have finished

27 Main ideas for one DAG resources time Existing schedulers: A task is schedulable after all its parents have finished

28 Main ideas for one DAG resources time Existing schedulers: A task is schedulable after all its parents have finished Graphene: Identifies troublesome tasks and places them first

29 Main ideas for one DAG resources time Existing schedulers: A task is schedulable after all its parents have finished Graphene: Identifies troublesome tasks and places them first

30 Main ideas for one DAG resources time Existing schedulers: A task is schedulable after all its parents have finished Graphene: Identifies troublesome tasks and places them first Place other tasks around trouble

31 Does placing troublesome tasks first help? Revisit the example 0: εε T , : , εε 3: 4: 2: TT , : , εε , εε TT ,

32 Does placing troublesome tasks first help? Revisit the example 0: εε T , : , εε 3: 4: 2: TT , : , εε , εε TT , If troublesome tasks long-running tasks, Graphene OPT

33 Does placing troublesome tasks first help? Revisit the example 0: εε T , : , εε 3: 4: , εε , time 2: TT , : εε TT , If troublesome tasks long-running tasks, Graphene OPT

34 Does placing troublesome tasks first help? Revisit the example 0: εε T , : , εε 3: 4: , εε , time 2: TT , : εε TT , If troublesome tasks long-running tasks, Graphene OPT

35 How to choose troublesome tasks T? frag ff

36 How to choose troublesome tasks T? Optimal choice is intractable (recall: NP-Hard) frag ff

37 How to choose troublesome tasks T? Optimal choice is intractable (recall: NP-Hard) frag ff Stage fragmentation score f ll Task duration

38 How to choose troublesome tasks T? ffff Optimal choice is intractable (recall: NP-Hard) Graphene: BuildSchedule(T) and or Stage fragmentation score f ll Task duration

39 How to choose troublesome tasks T? Vary ll, ff ffff Optimal choice is intractable (recall: NP-Hard) Graphene: BuildSchedule(T) and or Pick the most compact schedule Stage fragmentation score f ll Task duration

40 How to choose troublesome tasks T? Vary ll, ff ffff Optimal choice is intractable (recall: NP-Hard) Graphene: BuildSchedule(T) and or Pick the most compact schedule Stage fragmentation score f ll Task duration Extensions 1) Explore different choices of T in parallel 2) Recurse 3) Memoize

41 Schedule dead-ends T resources T time

42 Schedule dead-ends T resources T time

43 Schedule dead-ends T resources T time 1) Since some parents and children of are already placed with T, may not be able to place

44 Schedule dead-ends T resources T time 1) Since some parents and children of are already placed with T, may not be able to place T TransitiveClosure (T)

45 Schedule dead-ends T resources T time 1) Since some parents and children of are already placed with T, may not be able to place T TransitiveClosure (T)

46 Schedule dead-ends P T resources T time 1) Since some parents and children of are already placed with T, may not be able to place T TransitiveClosure (T)

47 Schedule dead-ends P T resources T time 1) Since some parents and children of are already placed with T, may not be able to place T TransitiveClosure (T) 2) When placing tasks in, P, have to go backwards (place task after all children are placed)

48 Schedule dead-ends P T C resources T time 1) Since some parents and children of are already placed with T, may not be able to place T TransitiveClosure (T) 2) When placing tasks in, P, have to go backwards (place task after all children are placed)

49 Schedule dead-ends P T C S resources T time 1) Since some parents and children of are already placed with T, may not be able to place T TransitiveClosure (T) 2) When placing tasks in, P, have to go backwards (place task after all children are placed)

50 Schedule dead-ends Which of these orders are legit? T fb P b S f C f P T C S resources T time T fb P b C f S f T fb S fb P b C f 1) Since some parents and children of are already placed with T, may not be able to place T TransitiveClosure (T) 2) When placing tasks in, P, have to go backwards (place task after all children are placed)

51 Schedule dead-ends Which of these orders are legit? T fb P b S f C f P T C S resources T time T fb P b C f S f T fb S fb P b C f 1) Since some parents and children of are already placed with T, may not be able to place T TransitiveClosure (T) 2) When placing tasks in, P, have to go backwards (place task after all children are placed)

52 Schedule dead-ends Which of these orders are legit? T fb P b S f C f P T C S resources T time T fb P b C f S f T fb S fb P b C f Graphene explores all orders and avoids dead-ends 1) Since some parents and children of are already placed with T, may not be able to place T TransitiveClosure (T) 2) When placing tasks in, P, have to go backwards (place task after all children are placed)

53 Main ideas for one DAG P T C S resources T time resources P CT s time 1. Identify troublesome tasks and place them first 2. Systematically place tasks to avoid dead-ends

54 Computed offline schedule for One DAG Production clusters have Multiple DAGs

55 Convert offline schedule to priority order on tasks 2

56 Convert offline schedule to priority order on tasks 0: 1: εε T , : , εε TT , : 4: 5: , εε , εε TT ,

57 Convert offline schedule to priority order on tasks 0: 1: εε T , : , εε TT , : 4: 5: , εε , εε TT , Job

58 Convert offline schedule to priority order on tasks 0: 1: εε T , : , εε TT , : 4: 5: , εε , εε TT , Job tt 1 tt 3 tt 4 tt 0, tt 2, tt 5 2

59 Convert offline schedule to priority order on tasks 0: 1: εε T , : , εε TT , : 4: 5: , εε , εε TT , Offline Runtime 1 3 Job tt 1 tt 3 tt 4 tt 0, tt 2, tt 5 2 Job2 2 0 time

60 Convert offline schedule to priority order on tasks 0: 1: εε T , : , εε TT , : 4: 5: , εε , εε TT , Offline Runtime Job tt 1 tt 3 tt 4 tt 0, tt 2, tt 5 2 Job2 2 0 time

61 Convert offline schedule to priority order on tasks 0: 1: εε T , : , εε TT , : 4: 5: , εε , εε TT , Offline Runtime 1 3 Job tt 1 tt 3 tt 4 tt 0, tt 2, tt Job2 2 0 time

62 Convert offline schedule to priority order on tasks 0: 1: εε T , : , εε TT , : 4: 5: , εε , εε TT , Offline Runtime Job tt 1 tt 3 tt 4 tt 0, tt 2, tt Job2 2 0 time

63 Convert offline schedule to priority order on tasks 0: 1: εε T , : , εε TT , : 4: 5: , εε , εε TT , Offline Runtime Job tt 1 tt 3 tt 4 tt 0, tt 2, tt Job2 2 0 time

64 Convert offline schedule to priority order on tasks 0: 1: εε T , : , εε TT , : 4: 5: , εε , εε TT , Offline Runtime Job tt 1 tt 3 tt 4 tt 0, tt 2, tt Job2 2 0 time

65 Convert offline schedule to priority order on tasks 0: 1: εε T , : , εε TT , : 4: 5: , εε , εε TT , Offline Runtime Job tt 1 tt 3 tt 4 tt 0, tt 2, tt Job2 2 0 time 2T

66 Convert offline schedule to priority order on tasks 0: 1: εε T , : , εε TT , : 4: 5: , εε , εε TT , Offline Runtime Job tt 1 tt 3 tt 4 tt 0, tt 2, tt Job2 2 0 time 2T 4 5 CPSched T 4T

67 Main ideas for multiple DAGs 1) Convert offline schedule to priority order on tasks

68 Main ideas for multiple DAGs 1) Convert offline schedule to priority order on tasks 2) Online, enforce schedule priority along with heuristics for (a) Multi-resource packing (b) SRPT to lower average job completion time (c) Bounded amount of unfairness (d) Overbooking

69 Main ideas for multiple DAGs 1) Convert offline schedule to priority order on tasks 2) Online, enforce schedule priority along with heuristics for (a) Multi-resource packing (b) SRPT to lower average job completion time (c) Bounded amount of unfairness (d) Overbooking

70 Main ideas for multiple DAGs 1) Convert offline schedule to priority order on tasks 2) Online, enforce schedule priority along with heuristics for (a) Multi-resource packing (b) SRPT to lower average job completion time (c) Bounded amount of unfairness (d) Overbooking Schedule priority best for one DAG; overall? Trade-offs: Packing Efficiency Job Completion Time may lose packing efficiency, may delay job completion,? Fairness may counteract all others

71 Main ideas for multiple DAGs 1) Convert offline schedule to priority order on tasks 2) Online, enforce schedule priority along with heuristics for (a) Multi-resource packing (b) SRPT to lower average job completion time (c) Bounded amount of unfairness (d) Overbooking Schedule priority best for one DAG; overall? Trade-offs: Packing Efficiency Job Completion Time may lose packing efficiency, may delay job completion, We show that: {best perf bounded unfairness} ~ best perf? Fairness may counteract all others

72 Graphene summary & implementation 1) Offline, schedule each DAG by placing troublesome tasks first 2) Online, enforce priority over tasks along with other heuristics

73 Graphene summary & implementation 1) Offline, schedule each DAG by placing troublesome tasks first 2) Online, enforce priority over tasks along with other heuristics NM DAG DAG AM AM RM Node heartbeat Task assignment NM NM NM

74 Graphene summary & implementation 1) Offline, schedule each DAG by placing troublesome tasks first 2) Online, enforce priority over tasks along with other heuristics NM DAG DAG Schedule Constructor Schedule Constructor AM AM RM Node heartbeat Task assignment NM NM NM

75 Graphene summary & implementation 1) Offline, schedule each DAG by placing troublesome tasks first 2) Online, enforce priority over tasks along with other heuristics NM DAG DAG Schedule Constructor Schedule Constructor AM AM RM + priority order + packing + bounded unfairness + + overbook Node heartbeat Task assignment NM NM NM

76 Implementation details DAG annotations Bundling: improve schedule quality w/o killing scheduling latency Co-existence with (many) other scheduler features

77 Evaluation Prototype 200 server multi-core cluster TPC-DS, TPC-H,, GridMix to replay traces Jobs arrive online Simulations Traces from production Microsoft Cosmos and Yarn clusters Compare with many alternatives

78 Results - 1 [20K DAGs from Cosmos] Packing Packing + Deps. Lower bound 1.5X 2X

79 Results - 1 [20K DAGs from Cosmos] Packing Packing + Deps. Lower bound 1.5X 2X

80 Results - 2 [200 jobs from TPC-DS, 200 server cluster] Tez + Pack +Deps Tez + Packing

81 Results - 2 [200 jobs from TPC-DS, 200 server cluster] Tez + Pack +Deps Tez + Packing

82 Conclusions Scheduling heterogeneous DAGs well requires an online solution that handles multiple resources and dependencies

83 Conclusions Scheduling heterogeneous DAGs well requires an online solution that handles multiple resources and dependencies Graphene Offline, construct per-dag schedule by placing troublesome tasks first Online, enforce schedule priority along with other heuristics New lower bound shows nearly optimal for half of the DAGs

84 Conclusions Scheduling heterogeneous DAGs well requires an online solution that handles multiple resources and dependencies Graphene Offline, construct per-dag schedule by placing troublesome tasks first Online, enforce schedule priority along with other heuristics New lower bound shows nearly optimal for half of the DAGs Experiments show gains in job completion time, makespan,

85 Conclusions Scheduling heterogeneous DAGs well requires an online solution that handles multiple resources and dependencies Graphene Offline, construct per-dag schedule by placing troublesome tasks first Online, enforce schedule priority along with other heuristics New lower bound shows nearly optimal for half of the DAGs Experiments show gains in job completion time, makespan, Graphene generalizes to DAGs in other settings

86 When scheduler works with erroneous task profiles 0.75, , , , Amount of error Fractional change in JCT

87 When scheduler works with erroneous task profiles 0.75, , , , Amount of error Fractional change in JCT

88 When scheduler works with erroneous task profiles 0.75, , , , Amount of error Fractional change in JCT

89 When scheduler works with erroneous task profiles 0.75, , , , Amount of error Fractional change in JCT

90 DAG annotations G uses per-stage average duration and demands of {cpu, mem, net. disk} 1) Almost all frameworks have user s annotate cpu and mem 2) Recurring jobs 1 have predictable profiles (correcting for input size) 3) Ongoing work on building profiles for ad-hoc jobs Sample and project 2 Program analysis 3 [1] RoPE, NSDI 12; [2] Perforator, SOCC 16; [3] SPEED, POPL 09;

91 Using Graphene to schedule other DAGs

92 Characterizing DAGs in Cosmos clusters

93 Characterizing DAGs in Cosmos clusters 2

94 Runtime of production DAGs

95 Job completion times on different workloads

96 Makespan Fairness

97 Comparison with other alternatives

98 Online Pseudocode

Graphene: Packing and Dependency-Aware Scheduling for Data-Parallel Clusters

Graphene: Packing and Dependency-Aware Scheduling for Data-Parallel Clusters Robert Grandl, Microsoft and University of Wisconsin Madison; Srikanth Kandula and Sriram Rao, Microsoft; Aditya Akella, Microsoft