Towards Fair and Efficient SMP Virtual Machine Scheduling

Size: px

Start display at page:

Download "Towards Fair and Efficient SMP Virtual Machine Scheduling"

Alice Baker
5 years ago
Views:

1 Towards Fair and Efficient SMP Virtual Machine Scheduling Jia Rao and Xiaobo Zhou University of Colorado, Colorado Springs

2 Executive Summary Problem: unfairness and inefficiency in consolidating SMP VMs Existing VM schedulers favor VMs w/ more virtual CPUs Fairness mechanisms hurt parallel performance Flex is a scheduling framework that: Adaptively adjusts vcpu weights for VM-level fairness Flexibly schedules vcpus to minimize unnecessary spinning Results: 5% error to the ideally fair allocation, 30%+ performance improvement for parallel workloads, ~1% overhead in Xen

3 SMP VM Consolidation Abundant hardware parallelism in DC SMP VMs are prevalent in the cloud 26 out of 29 instance types in Amazon EC2 have more than two vcpus Support parallel applications Heterogeneous consolidation is common, e.g., Amazon EC2

4 Unfair VM CPU Allocation CPU allocation vcpu0 vcpu1 vcpu2 NOT proportional to VM weights VMs w/ more vcpus gain advantage Common issue in all hypervisors vcpu0 Normalized CPU consumption 3-vCPU VM 4-vCPU VM 1.59 vcpu1 vcpu2 pcpu0 pcpu1 pcpu2 pcpu3 Fair share = two CPU cores 1.42 vcpu Xen KVM VMware

5 Causes of Unfairness Per-CPU scheduling Independent scheduler on each CPU VM3 vcpu1 Each allocates CPU based on relative vcpu weights vcpu0 vcpu0 vcpu1 vcpu1 vcpu2 vcpu2 VM3 vcpu0 vcpu3 Per-vCPU weight dependent on VM weight and the number of vcpus Scalable but hard for VM-level fairness pcpu0 pcpu1 pcpu2 pcpu3 Equal weight VMs Box size reflects per-vcpu weight Allocations on vcpu => VM-level fairness IFF Same total weight on each CPU

6 Existing Solutions Capping VM CPU consumption (Cap) Requires pre-calculation of the fair share Introduce significant inefficiencies Non-work-conserving Load balancing (LB) to parallel applications Tries to achieve equal weights on CPUs Balance weight vs. balance run queue length

7 Cap on Busy-waiting-based Workloads Busy-waiting synchronization Dynamic task assignment Tasks stay in a busy loop waiting for lock release Avoids contexts switches CPU time (%) Homo. VM Heter. VM Useful work Busy waiting 20 Capping exacerbates the LHP issue in 0 lu sp ft ua bt cg ep mg Lock holder preemption (LHP) Preemption of vcpu holding locks virtualized environments Since spinning wastes CPU cycles, capping may mistakenly preempt vcpus that are doing useful work. Long synchronization latency

8 LB on Blocking-based Workloads Blocking synchronization Tasks go to sleep if failing to acquire the lock Avoids wasted CPU cycles vcpu stacking CPU time (%) LB exacerbates the vcpu stacking issue 20 0 bodytrack PIN LB Running Blocked canneal dedup facesim raytrace vips x264 vcpus belonging to one VM pile on the same CPU No parallelism + Long sync latency blackscholes streamcluster swaptions Since blocking vcpus frequently switch between READY and RUNNING states, they are more likely victims of work-stealing based load balancing. Gradually, stolen vcpus pile on a few CPUs

9 Related Work Fairness in multicore systems [Li-PPoPP09] - no VM-level fairness Minimizing sync latency in SMP VMs [Sukwong-Eurosys11] - avoids vcpu stacking Flex: non-intrusive, lightweight and applicable [Kim-ASPLOS13], [Weng-HPDC11] - not effective in user-level synchronization Pause loop exit (PLE) - needs hardware support Spin detection to different implementations [Wells-PACT06] - store-based spin detection, not accurate to apps with different store rates, e.g., LU in NAS parallel benchmark

10 Flex for Fairness and Efficiency Flexible vcpu weight (FlexW) Monitors VM CPU consumption Calculates fair shares based on VM weights Adjusts vcpu weights to compensate the difference Flexible vcpu scheduling (FlexS) Stops spinning vcpus to avoid wasted CPU cycles Switches the preempted vcpu with one on another CPU that is doing useful work Ensures that no vcpus from the same VM stack on one CPU

11 FlexW Design Determine the fair share P - number of shared CPUs, wi - VM weight Ideally fair share according to generalized processor sharing (GPS) each do S i,gp S (t 1,t 2 )= P w i (t 2 t 1 ) P w j Adjust VM weights wi r - VM weight P calculate the lag lag i (t 1,t 2 )= S i,gp S (t 1,t 2 ) S i (t 1,t 2 ) S i,gp S (t 1,t 2 ) r r P compensate the lag with real-time weights i,gp S wi r = wi r + w i lag i (t 1,t 2 ) if FAIR WIND

12 FlexS Design Identifying busy-waiting vcpu Non-intrusive identification without application knowledge Common pattern in different spin implementations - Spin loops contain a few instructions BPI Solo Spinning vcpu non-spinning vcpu lu sp cg povray - Spin loops are executed many times Spin loops show high branch per instruction (BPI) and low branch miss prediction rate (BMPR) BPMR Solo Spinning vcpu non-spinning vcpu lu sp cg povray

13 FlexS Design (cont ) Eliminating busy-waiting time Periodically update a vcpu s BPI and BMPR Busy-waiting vcpu voluntarily yields CPU Find a sibling vcpu to complete the unfinished time slice Switch the two vcpu to avoid vcpu stacking and run queue weight changes

14 Practical Considerations Starvation VMs demanding less than its share will have ever increasing real-time weight Solution: reset real-time weight every 10s Infeasible weight -> peak CPU demand less than the fair share Solution: peak demand as the fair share False positive in identifying spinning vcpu Solution: reset BPI and BMPR every 10s Inter-CPU locking overhead due to vcpu migrations Solution: only try twice when looking for siblings to switch- the power of two choices

15 Implementation Implement Flex in Xen s credit scheduler weight -> credit FlexW in the system-wide csched_acct() routine, adjusts VM credit refill based on real-time weights, invoked every 30ms FlexS in the per-cpu schedule() function, adds load_balance_switch() to exchange work with sibling vcpus Identify spinning vcpu in vcpu_acct() when Xen charges credit to the current running vcpu

16 Evaluation Methodology Questions: VM-level fairness? and parallel performance? Workload NAS Parallel benchmark (OpenMP, busy-waiting sync) PARSEC (Pthreads, blocking sync) Background interfering loops isolate from cache contention Scheduling strategies for comparison Xen default credit scheduler Balance+cap+CO - [Sukwong-Eurosys11] Demand+cap - [Kim-ASPLOS13]

17 VM-level Fairness Heterogeneous VMs: 1vCPU, 2vCPU, 3vCPU, 4vCPU Each running while(1) loop Relative lag = ness and differentiation. S i,gp S(t 1,t 2 ) S i (t 1,t 2 ) S i,gp S (t 1,t 2 ) Lower is better Relative lag (%) Xen Flex Number of VMs Flex: significant improvement over Xen with no more than 5% unfairness

18 VM Differentiation Proportional CPU share vCPU VM 3-vCPU VM 2-vCPU VM 3:2:1 3:1:2 2:1:3 2:3:1 1:2:3 1:3:2 VM weights Flex realizes proportional share among VMs

19 Parallel Performance Challenge: Flex allocates less CPU time to the 4vCPU VM than Xen Loop NAS Loop NAS Loop NAS NAS pcpu0 pcpu1 pcpu2 pcpu3 1.5 Xen Balance+cap+CO FlexW FlexW+FlexS Normalized runtime Observation: FlexW alone does NOT guarantee good perf. Reason: Imbalance wastes CPU time Results: FlexW+FlexS performs closely to Xen and balance+cap+co lu sp ft ua bt cg ep mg Lower is better

20 Parallel Performance NAS NAS NAS Xen FlexW Balance+cap+CO FlexW+FlexS Loop Loop Loop Loop 1.5 Normalized runtime pcpu0 pcpu1 pcpu2 pcpu3 Expected: Flex performs better than Xen Reason: Flex allocates more CPU time to the 3vCPU VM than Xen lu sp ft ua bt cg ep mg Lower is better

21 Mix of Parallel Workloads LU NAS LU NAS LU NAS NAS pcpu0 pcpu1 pcpu2 pcpu3 Results: Flex outperforms balance+cap by 30.4% Xen Balance+cap Flex Lower is better 1.5 Normalized runtime Dynamic task assignment lu sp ft ua bt cg ep mg Foreground NAS 0 Background lu

22 Overhead Execution time (micro-second) System-wide csched_acct() Xen Flex Per-CPU func. schedule() Xen Flex FlexW overhead: VM weight adjustment Overhead increases with # of VMs performed by the idle VM not affecting parallel performance FlexS overhead: vcpu stealing constant overhead, not increases with # of VMs frequency of schedule() - 30ms less than1% overhead

23 Conclusions & Future Work Fairness-efficiency tradeoff Straightforward solutions to unfairness lead to poor efficiency Flex: a holistic solution Adaptively adjusts weight for fairness Flexibly schedule vcpus to minimize wasted work Problem: NOT quite effective for apps with dynamic task assignment Future work Cross-layer application-cloud coordination

24 Thank you! Questions?

25 Backup Slides Begin Here

26 Parallel Performance (blocking) Loop Loop Loop PARSEC PARSEC PARSEC PARSEC Xen FlexW Demand+cap FlexW+FlexS pcpu0 pcpu1 pcpu2 pcpu3 1.5 Observation: Both Flex and demand+cap improve perf. 1 Reason: Avoiding vcpu stacking helps a lot 0.5 Conclusion: Flex does not incur much penalty to blocking sync.-based apps 0 blackscholes bodytrack caneal dedup facesim raytrace streamcluster swaptions vips x264

Improving Virtual Machine Scheduling in NUMA Multicore Systems

Improving Virtual Machine Scheduling in NUMA Multicore Systems Jia Rao, Xiaobo Zhou University of Colorado, Colorado Springs Kun Wang, Cheng-Zhong Xu Wayne State University http://cs.uccs.edu/~jrao/ Multicore