Analysis of Virtual Machine Scalability based on Queue Spinlock

Size: px

Start display at page:

Download "Analysis of Virtual Machine Scalability based on Queue Spinlock"

Ami Sparks
5 years ago
Views:

1 , pp Analysis of Virtual Machine Scalability based on Queue Spinlock Seunghyub Jeon, Seung-Jun Cha, Yeonjeong Jung, Jinmee Kim and Sungin Jung Electronics and Telecommunications Research Institute, 218 Gajeong-ro, Yuseong-gu, Daejeon, 34129, Korea {shjeon00, seungjunn, yjjeong, jinmee, Abstract. Depending on the needs of the applications that require a lot of memory and processing resources, cloud providers are offering instances that have many cores, but they have not been able to provide performance scalability based on the number of cpu. To solve these problems, various locking mechanisms have been proposed, and Linux kernel 4.2 provides a queue spinlock. In this paper, we analyze queue spinlock performance problems of a manycore virtual machine through benchmark and suggest simple ways to improve them. Keywords: queue spinlock, virtual machine scalability. 1 Introduction Applications such as in-memory databases, big-data analytics, and deep-learning are becoming increasingly popular. The characteristics of these applications require large amounts of memory and robust processing power to process large amounts of data simultaneously. To meet the requirements of these applications, cloud providers such as amazon and google have started offering enterprise-class X1 instances with 128- cores [1] and n1-standard-96 instance with 96-cores [2] respectively. However, the performance of a virtual machine does not increase in proportion to the number of cores [3][4]. One cause of this problem is due to ticket spinlock in Linux that generates cache coherency traffics, and another cause is performance anomaly in virtualized lock such as lock holder preemption problem(lhp), lock waiter preemption problem(lwp), and sleepy spinlock anomaly(ssa) [4], etc. Linux kernel 4.2 introduces queue spinlock that compensates the problems of existing ticket spinlocks. Hardware VM(HVM) using PLE(Pause-loop-exit) was scalable according to the number of cores, but it still suffered extreme performance degradation in the overcommitted virtualized environment. On the other hand, paravirtualized VM(PVM) has performance degradation from 90 cores, but it shows better performance than HVM in overcommitted environment [5]. In this paper, we improve the scalability of PVM by adding a hypercall to use the PLE handler of HVM and show the problems of HVM and PVM in the overcommitted state through performance analysis. ISSN: ASTL Copyright 2017 SERSC

2 2 Background and Related Work 2.1 Paravirtualized Queue Spinlock Queue spinlock [7] is a customized version of MCS lock [6] that has been modified to fit the existing Linux spinlock data structure. Queue spinlock is able to eliminate the cache-line bouncing by using per-cpu structure. AIM7 benchmark shows good results in case of high contention. [7]. Paravirtualized queue spinlock uses two hypercalls (pv_wait and pv_kick) for halting vcpu instead of busy-waiting. pv_wait suspends vcpu and pv_kick is used to wake the suspended vcpu. pv_wait is generally called after waiting for SPIN_THRESHOLD but is immediately called if previous lock waiter is in the halted state. This alleviates sleepy spinlock anomaly somewhat. In Linux, pv_wait is implemented using halt instruction. 2.2 Hardware VM using PLE PLE(Pause-Loop-Exit) is hardware to prevent vcpu from consuming meaningless CPU time due to busy-waiting when a spinlock is executed in the virtual machine. PLE detects when a virtual CPU is spinning on a lock and will trap to the host. And then PLE handler choose a best vcpu candidate to run and does a directed yield to it [8]. This reduces LHP problem by determining a potential lock holder and boosting the vcpu. 3 Design and Implementation To simply solve the PVM LHP problem mentioned above, we use the existing handler of HVM. For this purpose, I added hypercall which calls PLE handler (kvm_vcpu_on_spin) directly in KVM. We call this small modification version as PVM-SWPLE. Fig. 1. Difference between PVM, HVM and PVM-SWPLE 16 Copyright 2017 SERSC

3 Fig 1 shows the difference between PVM, HVM, and PVM-SWPLE. PVM- SWPLE is the same as the PVM which is waiting for lock acquisition during SPIN_THRESHOLD but uses hypercall instead of halt instructions. PVM-SWPLE differs slightly from HVM in that PVM-SWPLE performs sleep after yielding, while HVM boosts other candidates and returns to VM. 4 Evaluation 4.1 Experimental setup To measure the scalability of a virtual machine, we used an IBM x3950 x6 server with eight Xeon E (2.3 GHz, 15 cores). Each virtual machine has the same 120 vcpus as the host and 64 GBytes memory. The benchmark used is MOSBENCH gmake, which measures the Linux kernel build time and tmpfs is used to reduce the impact of I/O. Experiments were conducted on HVM, PVM, and PVM-SWPLE, and the kernel build time was measured by increasing the number of cores and virtual machines. 4.1 Experiment results Experiment results are shown Fig 2. In the case of HVM, performance increases in proportion to the number of vcpus at VM=1, but performance collapses from 30 vcpus at VM=2 and worst performance shows at VM=4. In the case of PVM, performance degradation occurs from 90 cores at VM=1, and performance is saturated from 60 cores at VM=2, VM=4. In the case of suggested PVM-SWPLE, performance is similar to that of HVM at VM=1 and performance degradation occurs from 90 vcpus at VM=2, but it is better than HVM and PVM. Fig. 2. Performance comparison according to the number of cores and virtual machines. Copyright 2017 SERSC 17

4 Fig 3 shows the number of occurrences of VM_EXIT in order to find the cause of performance collapse of HVM at VM=2. As can be seen in Figure 3, the number of VM_EXIT increase dramatically on over 30 cores in HVM and 90 cores in PVM- SWPLE. This is consistent with the point of degradation in Fig 2. The reason why PLE EXIT happens suddenly is that the PLE handler performs lock holder boosting and the vcpu returns to VM and spins on the same lock. Fig. 3. VM_EXIT COUNT at VM=2 Fig 4 is a snapshot of perf profiling call-graph [9] in HVM (at 45 vcpu) and is another evidence that too much VM_EXITs affects performance. It takes up a lot of execution time to process the PLE handler. Fig. 4. Snapshot of perf profiling. 5 Conclusion In this paper, we propose PVM-SWPLE applying PLE handler of HVM to PVM and examine its performance. PVM-SWPLE has improved performance by inheriting the advantages of both. However, it is still not scalable in overcommitted environments because LHP cannot be solved enough. It is necessary to improve the scalability by reducing the number of VM_EXIT by allowing the lock holder to be identified and yielded using the characteristics of queue spinlock. Acknowledgments. This work was supported by Institute for Information & communications Technology Promotion(IITP) grant funded by the Korea government 18 Copyright 2017 SERSC

5 (MSIT) (No. B , Research on High Performance and Scalable Manycore Operating System) References 1. Amazon EC2 Instance Types, 2. Google Machine Types, 3. Seung-Jun Cha, et al. Virtual-Machine Scalability Evaluations in the Clouds of Manycore, KIISE winter conference, Kashyap, et al. "Scalability in the Clouds!: A Myth or Reality?." Proceedings of the 6th Asia- Pacific Workshop on Systems. ACM, SeungHyub Jeon, et al. Performance Experiments of Manycore Virtual machine in the Overcommitted Clouds, KIISE winter conference, M. Mellor-Crummey and M. L. Scott. Algorithms for scalable synchronization on sharedmemory multiprocessors. ACM Trans. Compute. Syst., 9(1):21 65, qspinlock: Introducing a 4-byte queue spinlock implementation, 8. K.T. Raghavendra, Virtual cpu scheduling techniques for Kernel Based Virtual Machine (KVM), CCEM, perf: Linux profiling with performance counters, Copyright 2017 SERSC 19

Remote Direct Storage Management for Exa-Scale Storage

, pp.15-20 http://dx.doi.org/10.14257/astl.2016.139.04 Remote Direct Storage Management for Exa-Scale Storage Dong-Oh Kim, Myung-Hoon Cha, Hong-Yeon Kim Storage System Research Team, High Performance Computing