Preemptive, Low Latency Datacenter Scheduling via Lightweight Virtualization

Size: px

Start display at page:

Download "Preemptive, Low Latency Datacenter Scheduling via Lightweight Virtualization"

Kimberly Davis
6 years ago
Views:

1 Preemptive, Low Latency Datacenter Scheduling via Lightweight Virtualization Wei Chen, Jia Rao*, and Xiaobo Zhou University of Colorado, Colorado Springs * University of Texas at Arlington

2 Data Center Computing Challenges - Increase hardware utilization and efficiency - Meet SLOs Heterogeneous workloads - Diverse resource demands Short jobs v.s. long jobs - Different QoS requirements Latency v.s. throughput

3 Data Center Computing Challenges - Increase hardware utilization and efficiency - Meet SLOs Heterogeneous workloads - Diverse resource demands Short jobs v.s. long jobs - Different QoS requirements Latency v.s. throughput Long jobs help improve hardware utilization while short jobs are important to QoS

4 Data Center Trace Analysis 10% long jobs account for 80% resource usage Tasks are evicted if encountering resource shortage Google traces (

5 Data Center Trace Analysis 10% long jobs account for 80% resource usage Tasks are evicted if encountering resource shortage Short jobs have higher priority and most preempted (evicted) s belong to long jobs Google traces (

6 Overhead of Kill-based Preemption Normalized slowdown Spark Pagerank Kmeans Bayes Wordcount Terasort Normalized slowdown MapReduce Wordcount Grep RandWrite Terasort SelfJoin Map-heavy Reduce-heavy 1. MapReduce jobs experience various degrees of slowdowns 2. Spark jobs suffer from more slowdowns due to frequent inter- synchronization and the re-computation of failed RDDs

7 Our Approach Container-based preemption - Containerize s using docker and control resource via cgroup - Task preemption without losing the execution progress Suspension: reclaim resources from a preempted Resumption: re-activate a by restoring its resource Preemptive fair share scheduler - Augment the capacity scheduler in YARN with preemptive scheduling and fine-grained resource reclamation

8 Related Work Optimizations for heterogeneous workloads - YARN [SoCC 13]: kill long jobs if needed Long job slowdown and resource waste - Sparrow [SOSP 13]: decentralized scheduler for short jobs No mechanism for preemption - Hawk [ATC 15]: hybrid scheduler based on reservation Hard to determine optimal reservation Task preemption - Natjam [SoCC 13], Amoeba [SoCC 12]: proactive checkpointing Hard to decide frequency - CRIU [Middleware 15]: on-demand checkpointing Application changes required Task containerization - Google Borg [EuroSys 15]: mainly for isolation Still kill-based preemption

9 Related Work Optimizations for heterogeneous workloads - YARN [SoCC 13]: kill long jobs if needed Long job slowdown and resource waste - Sparrow [SOSP 13]: decentralized scheduler for short jobs If short jobs can timely preempt long jobs - Hawk [ATC 15]: hybrid No scheduler need for based cluster on reservation Task preemption Preserving long job s progress Application agnostic - Natjam [SoCC 13], Amoeba [SoCC 12]: proactive checkpointing Fine-grained resource management - CRIU [Middleware 15]: on-demand checkpointing Task containerization No mechanism for preemption Hard to determine optimal reservation Hard to decide frequency Application changes required - Google Borg [EuroSys 15]: mainly for isolation Still kill-based preemption

10 Container-based Task Preemption Task containerization - Launch s in Docker containers - Use cgroup to control resource allocation, i.e., CPU and memory Task suspension - Stop execution: deprive of CPU - Save context: reclaim container memory and write dirty memory pages onto disk Task resumption - Restore resources

11 Task Suspension and Resumption Keep a minimum footprint for a preempted : 64MB memory and 1% CPU Reclaim memory Restore memory Deprive CPU Restore CPU & memory

12 Task Suspension and Resumption Keep a minimum footprint for a preempted : 64MB memory and 1% CPU Reclaim memory Suspended is alive, but does not make progress or affect other s Restore memory Deprive CPU Restore CPU & memory

13 Two Types of Preemption

14 BIG-C: Preemptive Cluster Scheduling Container allocator - Replaces YARN s nominal container with docker Container monitor - Performs container suspend and resume (S/R) operations Resource monitor & Scheduler - Determine how much resource and which container to preempt Resource Monitor Resource Manager Preemption Decision Request Scheduler Application Master Heartbeat Launch Release Task Launch Container Allocator Node Node Node S/R Node Manager Container Monitor Source code available at

15 YARN s Capacity Scheduler Work conserving, use more than fair share if rsc is available DRF Capacity scheduler Cluster resource

16 YARN s Capacity Scheduler Work conserving, use more than fair share if rsc is available DRF Capacity scheduler Cluster resource

17 YARN s Capacity Scheduler Work conserving, use more than fair share if rsc is available DRF Capacity scheduler Cluster resource

18 YARN s Capacity Scheduler Work conserving, use more than fair share if rsc is available DRF Capacity scheduler Cluster resource At least kill one long Rsc reclamation does not enforce DRF

19 VOID Capacity scheduler Preemptive Fair Share Scheduler Work conserving, use more than fair share if rsc is available DRF Preemptive fair sharing Cluster resource

20 VOID Capacity scheduler Preemptive Fair Share Scheduler Work conserving, use more than fair share if rsc is available DRF Preemptive fair sharing Cluster resource

21 VOID Capacity scheduler Preemptive Fair Share Scheduler Work conserving, use more than fair share if rsc is available DRF Preemptive fair sharing Cluster resource

22 VOID Capacity scheduler Preemptive Fair Share Scheduler Work conserving, use more than fair share if rsc is available Cluster resource Preempt part of rsc Enforce DRF, avoid unnecessary reclamation DRF Preemptive fair sharing

23 Compute DR at Task Preemption MEM MEM MEM CPU CPU CPU MEM MEM MEM CPU CPU CPU

24 Compute DR at Task Preemption MEM MEM MEM CPU CPU CPU MEM MEM MEM CPU CPU CPU

25 Container Preemption Algorithm Choose a job with the longest remaining time Yes No Yes Choose a container c from the preempted job No END

26 Container Preemption Algorithm Choose a job with the longest remaining time Yes No Yes Choose a container c from the preempted job No END

27 Optimizations Disable speculative execution of preempted s - Suspended s appear to be slow to the cluster manager and will likely trigger futile speculative execution Delayed resubmission - Tasks may be resubmitted immediately after preemption, causing them to be suspended again. A suspended is required to perform D attempts before it is re-admitted

28 Experimental Settings Hardware - 26-node cluster; 32 cores, 128GB on each node; 10Gbps Ethernet, RAID-5 HDDs Software - Hadoop-2.7.1, Docker Cluster configuration - Two queues: 95% and 5% shares for short and long jobs queues, respectively - Schedulers: FIFO (no preemption), Reserve (60% capacity for short jobs), Kill, IP and GP - Workloads: Spark-SQL as short jobs and HiBench benchmarks as long jobs

29 Synthetic Workloads Cluster utilization (%) Time (S) High, low, and multiple bursts of short jobs. Long jobs persistently utilize 80% of cluster capacity

30 Short Job Latency with Spark Low-load High-load Multi-load CDF JCT (S) JCT (S) JCT (S) FIFO is the worst due to the inability to preempt long jobs Reserve underperforms due to lack of reserved capacity under high-load GP is better than IP due to less resource reclamation time or swapping

JCT (s) Performance of Long Spark Jobs Low-load High-load Multi-load FIFO Reserve Kill IP GP FIFO Reserve Kill IP GP FIFO Reserve Kill IP GP FIFO is the reference performance for long jobs GP

31 JCT (s) Performance of Long Spark Jobs Low-load High-load Multi-load FIFO Reserve Kill IP GP FIFO Reserve Kill IP GP FIFO Reserve Kill IP GP FIFO is the reference performance for long jobs GP achieves on average 60% improvement over Kill. IP incurs significant overhead to Spark jobs: - aggressive resource reclamation causes system-wide swapping - completely suspended s impede overall job progress

32 Short Job Latency with MapReduce Low-load High-load Multi-load CDF JCT (S) JCT (S) JCT (S) FIFO (not shown) incurs mins slowdown to short jobs Re-submissions of killed MapReduce jobs block short jobs IP and GP achieve similar performance

Performance of Long MapReduce Jobs Map-heavy Wordcount Reduce-heavy Terasosrt Normalized JCT (s) Low-load High-load Multi-load Low-load High-load Multi-load Kill

33 Performance of Long MapReduce Jobs Map-heavy Wordcount Reduce-heavy Terasosrt Normalized JCT (s) Low-load High-load Multi-load Low-load High-load Multi-load Kill performs well for map-heavy workloads IP and GP show similar performance for MapReduce workloads - MapReduce s are loosely coupled - A suspended does not stop the entire job

34 Google Trace Contains 2202 jobs, of which 2020 are classified as short jobs and 182 as long jobs. Cluster utilization (%) Time (S)

35 Google Trace Contains 2202 jobs, of which 2020 are classified as short jobs and 182 as long jobs. Cluster utilization (%) IP and GP guarantee short job latency GP improved the 90th percentile long job runtime by 67%, 37% and 32% over kill, IP, and Reserve, respectively 23% long jobs failed with kill-based preemption while BIG-C cause NO job failures. Time (S)

36 Summary Data-intensive cluster computing lacks an efficient mechanism for preemption - Task killing incurs significant slowdowns or failures to preempted jobs BIG-C is a simple yet effective approach to enable preemptive cluster scheduling - lightweight virtualization helps to containerize s - Task preemption is achieved through precise resource management Results: - BIG-C maintains short job latency close to reservation-based scheduling while achieving similar long job performance compared to FIFO scheduling

Key aspects of cloud computing. Towards fuller utilization. Two main sources of resource demand. Cluster Scheduling

Key aspects of cloud computing Cluster Scheduling 1. Illusion of infinite computing resources available on demand, eliminating need for up-front provisioning. The elimination of an up-front commitment