Starling: A Scheduler Architecture for High Performance Cloud Computing

Size: px

Start display at page:

Download "Starling: A Scheduler Architecture for High Performance Cloud Computing"

Nigel Norris
5 years ago
Views:

1 Starling: A Scheduler Architecture for High Performance Cloud Computing Hang Qu, Omid Mashayekhi, David Terei, Philip Levis Stanford Platform Lab Seminar May 17,

2 High Performance Computing (HPC) Arithmetically dense codes (CPU-bound) Operate on in-memory data SpaceX NOAA Pixar A lot of machine learning is HPC 2

3 HPC on the Cloud HPC has struggled to use cloud resources HPC software has certain expectations Cores are statically allocated, never change Cores do not fail Every core has identical performance Cloud doesn t meet these expectations Elastic resources, dynamically reprovisioned Expect some failures Variable performance 3

4 HPC in a Cloud Framework? Strawman: run HPC codes inside a cloud framework (e.g., Spark, MapReduce, Naiad, etc.) Framework handles all of the cloud s challenges for you: scheduling, failure, resource adaptation Problem: too slow (by orders of magnitude) Spark can schedule 2,500 tasks/second Typical HPC compute task is 10ms Each core can execute 100 tasks/second A single 18 core machine can execute 1,800 tasks/second Queueing theory and batch processing mean you want to operate well below the maximum scheduling throughput 4

5 Starling Scheduling architecture for high performance cloud computing (HPCC) Controller decides data distribution, workers decide what tasks to execute In steady state, no worker/controller communication except periodic performance updates Can schedule up to 120,000,000 tasks/s Scales linearly with number of cores HPC benchmarks run x faster 5

6 Outline HPC background Starling scheduling architecture Evaluation Thoughts and questions 6

7 High Performance Computing Arithmetically intense floating point problems Simulations Image processing If you might want to do it on a GPU but it s not graphics, it s probably HPC Embedded in a geometry, data dependencies 7

8 Example: Semi-lagrangian Advection A fluid is represented as a grid Each cell has a velocity Update value/velocity for next time step 8

9 9

10 Example: Semi-lagrangian Advection A fluid is represented as a grid Each cell has a velocity Update value/velocity for next time step 10

11 Example: Semi-lagrangian Advection A fluid is represented as a grid Each cell has a velocity Update value/velocity for next time step Walk velocity vector backwards, interpolate 11

12 Example: Semi-lagrangian Advection A B A fluid is represented as a grid Each cell has a velocity Update value/velocity for next time step Walk velocity vector backwards, interpolate What if partitions A+B are on different nodes? 12

13 Example: Semi-lagrangian Advection A B ab A fluid is represented as a grid Each cell has a velocity Update value/velocity for next time step Walk velocity vector backwards, interpolate What if partitions A+B are on different nodes? Each partition has partial data from other, adjacent partitions ( ghost cells ) 13

14 High Performance Computing Arithmetically intense floating point problems Simulations Image processing If you might want to do it on a GPU but it s not graphics, it s probably HPC Embedded in a geometry, data dependencies Many small data exchanges after computational steps Ghost cells, reductions, etc. 14

15 Most HPC Today (MPI) while (time < duration) { // locally calculate, then global min dt = calculate_dt(); // locally calculate, then exchange update_velocity(dt); // locally calculate, then exchange update_fluid(dt); // locally calculate, then exchange update_particles(dt); time += dt; } Control flow is explicit in each process. Operate in lockstep. Partitioning and communication is static. A single process fails, program crashes. Problems in the cloud Runs as fast as slowest process. Assumes every node runs at the same speed. 15

16 Most HPC Today (MPI) while (time < duration) { // locally calculate, then global min dt = calculate_dt(); // locally calculate, then exchange update_velocity(dt); // locally calculate, then exchange update_fluid(dt); // locally calculate, then exchange update_particles(dt); time += dt; } Control flow is explicit in each process. Operate in lockstep. Partitioning and communication is static. A single process fails, program crashes. Runs as fast as slowest process. Assumes every node runs at the same speed. 16

17 HPC inside a cloud framework? HPC code cloud (EC2, GCP) HPC code cloud framework cloud (EC2, GCP) load balancing failure recovery elastic resources {stragglers 17

18 Frameworks too slow Depend on a central controller to schedule tasks to worker nodes Controller can schedule ~2,500 tasks/second Fast, optimized HPC tasks can be ~10ms long 18-core worker executes 1,800 tasks/second Two workers overwhelm a scheduler Prior work is fast or distributed, not both Driver Tasks Tasks Controller Stats Worker Worker Worker 18

19 Outline HPC background Starling scheduling architecture Evaluation Thoughts and questions 19

20 Best of both worlds? Depend on a central controller to Balance load Recover from failures Adapt to changing resources Handle stragglers Depend on worker nodes to Schedule computations Don t execute in lockstep Use AMT/software processor model (track dependencies) Exchange data 20

21 Control Signal Need a different control signal between controller and workers to schedule tasks Intuition: tasks execute where data is resident Data moves only for load balancing/recovery Controller determines how data is distributed Workers generate schedule tasks based on what data they have 21

22 Controller Architectures Driver Program RDD Lineage Par//on Manager Master P1 P2 P3 W2 W1 W2 Spark Controller Report data placement Scheduler Task Graph T1 T2 T3 T4 Slaves report local par//ons P1 P2 W2 W1 Par//on Manager Slave T2 T1 Task Queue P1 P3 W2 W2 Par//on Manager Slave T3 Task Queue Spark Worker 1 Spark Worker 2 22

23 Controller Architectures Slaves report local par//ons Driver Program RDD Lineage Par//on Manager Master P1 P2 P3 W2 W1 W2 Spark Controller Report data placement Scheduler Task Graph T1 T2 T3 T4 Par$$on Manager Replica Par$$on Manager P1 P2 P3 W2 W1 W2 Broadcast data placement P1 P2 P3 W2 W1 W2 Starling Controller Decide data placement Par$$on Manager Replica Scheduler Snapshot Specify a stage to snapshot at P1 P2 P3 W2 W1 W2 P1 P2 W2 W1 Par//on Manager Slave T2 T1 Task Queue P1 P3 W2 W2 Par//on Manager Slave T3 Task Queue Scheduler Task Graph Driver Program T2 T4 Task Queue T2 Scheduler Task Graph T1 Driver Program T3 Task T3 Queue T1 Spark Worker 1 Spark Worker 2 Canary Worker 1 Canary Worker 2 23

24 Partitioning Specification Driver program computes valid placements Task accesses determine a set of constraints E.g., A1 and B1 must be on the same node, A2 and B2, etc. Micropartitions (overdecomposition) If expected max cores is n, create 2-8n partitions 24

25 Managing Migrations Controller does not know where workers are in the program Workers do not know where each other are in program When a data partition moves, need to ensure that destination picks up where source left off 25

26 Example Task Graph sca*erw 0 LW0 TD0 LG0 gatherw 0 gradient 0 sca*erg 0 gatherw 1 gradient 1 sca*erg 1 LW1 TD1 LG1 LW: local weight TD: training data LG: local gradient GW: global weight GG: global gradient write read gatherg 0 sum 0 again? 0 GW0 GG0 26

27 Execution Tracking Program posi;on Stage Id Stage sca$erw gatherw gradient sca$erg gatherg sum again? sca$erw. Par++on : Metadata LW0 TD0 LG0 GW0 GG0 : : : : : Each worker executes the identical program in parallel Control flow is identical (like GPUs) Tag each partition with last stage that accessed it Spawn all subsequent stages Data exchanges implicitly synchronize 27

28 Implicitly Synchronize sca*erw 0 LW0 TD0 LG0 gatherw 0 gradient 0 sca*erg 0 gatherw 1 gradient 1 sca*erg 1 LW1 TD1 LG1 LW: local weight TD: training data LG: local gradient GW: global weight GG: global gradient write read gatherg 0 sum 0 again? 0 GW0 GG0 Stage gatherg0 cannot execute until both gradient1 and gradient2 complete Stage again?0 cannot execute until sum0 completes 28

29 Spawning and Executing Local Tasks Metadata for each data partition is the last stage that reads from or writes to it. After finishing a task, a worker: Updates metadata. Examines task candidates that operate on data partitions generated by the completed task. Puts a candidate task into a ready queue if all data partitions it operates on are (1) local, and (2) modified by the right tasks. 29

30 Two Communication Primitives Scatter-gather Similar to MapReduce, Spark GroupBy, etc. Takes one or more datasets as input Produces one or more datasets as output Used for data exchange Signal Deliver a boolean result to every node Used for control flow 30

31 EVALUATION 31

32 Evaluation Questions How many tasks/second can Starling schedule? Where are the scheduling bottlenecks? Does this improve computational performance? Logistic regression K-means, PageRank Lassen PARSEC fluidanimate Can Starling adapt HPC applications to changes in available nodes? 32

33 Scheduling Throughput CPU utilization (6.6µs, 90%) Average task length (µs) A single core can schedule 136,000 tasks/second while using 10% of CPU cycles. Scheduling puts a task on a local queue (no locks); no I/O needed. 33

34 Scheduling Throughput Scheduling throughput increases linearly with number of cores. 120,000,000 tasks/second on 1,152 cores (10% CPU overhead) Bottleneck is data transmission on partition migration. 34

35 Overprovisioned Scheduling Workload is 3,600 tasks/second. Queueing delay due to batched scheduling means much higher throughput needed. 35

36 Queueing Delay Controller dispatches 4 tasks/sec 0.00s 0.25s 0.50s 0.75s 1.00s 1.25s 1.50s 1.75s Core Core Core Core task(1s) task(1s) task(1s) task(1s) Load is 4 tasks/second. If scheduler can handle 4 tasks/second, queueing delay increases execution time from 1s to 1.75s (75%). 36

37 Logistic Regression Optimal micropartitioning on 32 workers is >18,000 tasks/second Controller: c4.large instance (1 core) workers: c4.8xlarge instance (18 cores)

38 Lassen (Eulerian) Micro partitioning + load balancing runs 3.3x faster than MPI. Controller: c4.large instance (1 core) workers: c4.8xlarge instance (18 cores)

39 PARSEC fluidanimate (Lagrangian) Micro partitioning + load balancing runs 2.4x faster than MPI. Controller: c4.large instance (1 core) workers: c4.8xlarge instance (18 cores)

40 Migration Costs 32 nodes, reprovisioned so computation moves off of 16 nodes onto 16 new nodes. Data transfer takes several seconds at 9Gbps. Migration times are huge compared to task execution times. Migration not a bottleneck at controller. 40

41 Starling Evaluation Summary Can schedule 136,000 tasks/s/core 120,000,000/s on 1,152 cores Need to overprovision scheduling/high throughput Schedules HPC applications on >1,000 cores Micropartitions improve performance Increase required throughput: 5-18,000 tasks/s on 32 nodes Central load balancer improves performance x for HPC benchmarks Can adapt to changing resources 41

42 Outline HPC background Starling scheduling architecture Evaluation Thoughts and questions 42

43 Thoughts Many newer cloud computing workloads resembled high performance computing I/O bound workloads are slow CPU bound workloads are fast Next generation systems will draw from both Cloud computing: variation, completion time, failure, programming models, system decomposition HPC: scalability, performance 43

44 Thank you! Philip Levis Hang Qu Omid Mashayekhi David Terei Chinmayee Shah

Decoupling the Control Plane from Program Control Flow for Flexibility and Performance in Cloud Computing

Decoupling the Control Plane from Program Control Flow for Flexibility and Performance in Cloud Computing ABSTRACT Hang Qu Stanford University Stanford, California quhang@stanford.edu Chinmayee Shah Stanford