Who stole my CPU? Leonid Podolny Vineeth Remanan Pillai. Systems DigitalOcean

Similar documents
Real-Time KVM for the Masses Unrestricted Siemens AG All rights reserved

Developing cloud infrastructure from scratch: the tale of an ISP

Red Hat Enterprise Virtualization Hypervisor Roadmap. Bhavna Sarathy Senior Technology Product Manager, Red Hat

Libvirt presentation and perspectives. Daniel Veillard

Xen scheduler status. George Dunlap Citrix Systems R&D Ltd, UK

AutoNUMA Red Hat, Inc.

What is KVM? KVM patch. Modern hypervisors must do many things that are already done by OSs Scheduler, Memory management, I/O stacks

Libvirt: a virtualization API and beyond

The failure of Operating Systems,

Consulting Solutions WHITE PAPER Citrix XenDesktop XenApp 6.x Planning Guide: Virtualization Best Practices

Consulting Solutions WHITE PAPER Citrix XenDesktop XenApp Planning Guide: Virtualization Best Practices

Automatic NUMA Balancing. Rik van Riel, Principal Software Engineer, Red Hat Vinod Chegu, Master Technologist, HP

Performance and Scalability of Server Consolidation

KVM 在 OpenStack 中的应用. Dexin(Mark) Wu

RT- Xen: Real- Time Virtualiza2on. Chenyang Lu Cyber- Physical Systems Laboratory Department of Computer Science and Engineering

Scheduling in Xen: Present and Near Future

Painless switch from proprietary hypervisor to QEMU/KVM. Denis V. Lunev

Performance & Scalability Testing in Virtual Environment Hemant Gaidhani, Senior Technical Marketing Manager, VMware

KVM PERFORMANCE OPTIMIZATIONS INTERNALS. Rik van Riel Sr Software Engineer, Red Hat Inc. Thu May

Linux Virtualization Update

KVM Weather Report. Red Hat Author Gleb Natapov May 29, 2013

Performance Optimization on Huawei Public and Private Cloud

Virtualization Device Emulator Testing Technology. Speaker: Qinghao Tang Title 360 Marvel Team Leader

Achieve Low Latency NFV with Openstack*

VIRTUAL APPLIANCES. Frequently Asked Questions (FAQ)

Cloud and Datacenter Networking

KVM Weather Report. Amit Shah SCALE 14x

A Userspace Packet Switch for Virtual Machines

KVM Virtualized I/O Performance

OpenNebula on VMware: Cloud Reference Architecture

Avid inews Server Enterprise Virtualization Reference. Release 1.0

VMware vsphere 4: The CPU Scheduler in VMware ESX 4 W H I T E P A P E R

Red Hat Virtualization 4.1 Technical Presentation May Adapted for MSP RHUG Greg Scott

Virtualization and Performance

Applying Polling Techniques to QEMU

Virtualisation: The KVM Way. Amit Shah

Live Migration: Even faster, now with a dedicated thread!

CFS-v: I/O Demand-driven VM Scheduler in KVM

Memory Allocation. Copyright : University of Illinois CS 241 Staff 1

Multiprocessor Scheduling. Multiprocessor Scheduling

SIMPLIFIED VDI WITH RED HAT ENTERPRISE VIRTUALIZATION FOR DESKTOPS

CHAPTER 16 - VIRTUAL MACHINES

libvirt integration and testing for enterprise KVM/ARM Drew Jones, Eric Auger Linaro Connect Budapest 2017 (BUD17)

What s New in VMware vsphere 4.1 Performance. VMware vsphere 4.1

ò mm_struct represents an address space in kernel ò task represents a thread in the kernel ò A task points to 0 or 1 mm_structs

Pexip Infinity Server Design Guide

Scheduling. Don Porter CSE 306

PROPORTIONAL fairness in CPU scheduling mandates

SOFT CONTAINER TOWARDS 100% RESOURCE UTILIZATION ACCELA ZHAO, LAYNE PENG

LCA14-104: GTS- A solution to support ARM s big.little technology. Mon-3-Mar, 11:15am, Mathieu Poirier

Server Virtualization Approaches

VMWARE TUNING BEST PRACTICES FOR SANS, SERVER, AND NETWORKS

Chapter 5 C. Virtual machines

Changpeng Liu. Senior Storage Software Engineer. Intel Data Center Group

Increase KVM Performance/Density

Monitoring KVM servers

Nested Virtualization and Server Consolidation

davidklee.net gplus.to/kleegeek linked.com/a/davidaklee

SANDPIPER: BLACK-BOX AND GRAY-BOX STRATEGIES FOR VIRTUAL MACHINE MIGRATION

238P: Operating Systems. Lecture 14: Process scheduling

EXAM - VCP5-DCV. VMware Certified Professional 5 Data Center Virtualization (VCP5-DCV) Exam. Buy Full Product.

KVM on Embedded Power Architecture Platforms

Supporting Soft Real-Time Tasks in the Xen Hypervisor

IBM PowerKVM available with the Linux only scale-out servers IBM Redbooks Solution Guide

Monitoring and Analyzing Virtual Machines Resource Overcommitment Detection and Virtual Machine Classification

Tweaking Linux for a Green Datacenter

Deep Insights: High Availability VMs via a Simple Host-to-Guest Interface OpenStack Masakari Greg Waines (Wind River Systems)

Tackling the Management Challenges of Server Consolidation on Multi-core System

RT#Xen:(Real#Time( Virtualiza2on(for(the(Cloud( Chenyang(Lu( Cyber-Physical(Systems(Laboratory( Department(of(Computer(Science(and(Engineering(

Yabusame: Postcopy Live Migration for Qemu/KVM

Tuning Your SUSE Linux Enterprise Virtualization Stack. Jim Fehlig Software Engineer

CLOUD ARCHITECTURE & PERFORMANCE WORKLOADS. Field Activities

Running Informix in a Monster Virtual Machine

Landslide Virtual Machine Installation Instructions

Dirty Memory Tracking for Performant Checkpointing Solutions

Red Hat Enterprise Linux 8.0 Beta

Red Hat Summit 2009 Rik van Riel

Memory Externalization With userfaultfd Red Hat, Inc.

Virtualization Overview NSRC

Parallels Virtuozzo Containers

Virtualization at Scale in SUSE Linux Enterprise Server

Virtualizing the Locomotive: Ready, Set, Go!

Generic Buffer Sharing Mechanism for Mediated Devices

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

Red Hat and Intel: Driving a New Era for the Data Center

June IBM Power Academy. IBM PowerVM memory virtualization. Luca Comparini STG Lab Services Europe IBM FR. June,13 th Dubai

Nova Scheduler: Optimizing, Configuring and Deploying NFV VNF's on OpenStack

Zoptymalizuj Swoje Centrum Danych z Red Hat Virtualization. Jacek Skórzyński Solution Architect/Red Hat

vsphere Resource Management Update 1 11 JAN 2019 VMware vsphere 6.7 VMware ESXi 6.7 vcenter Server 6.7

CS370 Operating Systems

Scalability and stability of libvirt: Experiences with very large hosts

Transparent Hugepage

Knut Omang Ifi/Oracle 20 Oct, Introduction to virtualization (Virtual machines) Aspects of network virtualization:

VM Migration, Containers (Lecture 12, cs262a)

Memory - Paging. Copyright : University of Illinois CS 241 Staff 1

Microsoft Virtualization Delivers More Capabilities, Better Value than VMware

Hypervisors & related technologies Arwed Tschoeke Client Center Böblingen

Virtual Machine Migration

Preserving I/O Prioritization in Virtualized OSes

Oracle Corporation. Speaker: Dan Magenheimer. without the Commitment. Memory Overcommit. <Insert Picture Here>

Transcription:

Who stole my CPU? Leonid Podolny Vineeth Remanan Pillai leonid@ vineeth@ Systems Engineering @ DigitalOcean 1

2

Introduction DigitalOcean Providing developers and businesses a reliable, easy-to-use cloud computing platform of virtual servers (Droplets), object storage (Spaces), and more. Systems Engineering Responsible for the Hypervisor and its software stack Host Operating System and kernel, KVM, qemu, libvirt and misc services to facilitate VM hosting 3

Agenda Steal Definitions Causes Analysis Mitigation strategies Mitigation approach in DigitalOcean Octopus: Implementation Octopus: Issues and their resolutions NUMA migrations problem Swapoff enhancements 4

What is steal? steal, n. : The fraction of time a vcpu had to wait for a physical CPU in a runqueue. 5

What is steal? (cntd.) steal, n. : a vcpu is just another host task for а guest, the time stops exists only within the VM reported by the hypervisor 6

Why Steal? Obvious reason: More runnable vcpus than physical CPUs How can we explain steal when the hypervisor has enough CPU resources to satisfy the runnable but waiting vcpus? 7

Steal analysis Steal can be analysed from two different perspectives VMs Steal as observed from within the VM Useful for determining if the steal is impacting the VM. Hypervisor Sum of steal of individual VMs Useful for determining mitigation approaches 8

Steal as seen from the VM Busy Steal CPU utilization + steal is close to 100% VM could have made use of the stolen time had it not been stolen Idle Steal Idle VM experiencing steal: utilization + steal is significantly below 100% VM could not have used the stolen time even if available. 9

Steal as seen from Hypervisor Total Steal experienced by all VMs in a Hypervisor Busy Steal More runnable vcpus than physical CPUs Not mitigatable in software Migrating VMs out of the busy HV is the probable solution Idle Steal Caused due to scheduler limitations, config issues etc Mitigatable in software 10

Hypervisor idle steal 11

Hypervisor idle steal: NUMA balancing VMs span all the NUMA nodes by default. Linux automatic NUMA balancing: keep tasks closer to their memory Node 0 M Socket 0 Node 1 Socket 1 M Memory follow CPU model CPU follow memory model Migration threads takes up cpu, resulting in steal VM 12

Hypervisor idle steal: NUMA balancing Node 0 Mitigation M Socket 0 Node 1 Socket 1 M Pin VMs to NUMA nodes and Disable NUMA balancing VM 13 13

Hypervisor idle steal (After disabling automatic NUMA balancing) 14

Hypervisor idle steal: process grouping Cgroups Default cgroups created by libvirt is per-vm Bigger VMs are considered equal in weight to smaller VMs due to per-vm cgroups Example: 8 cpu Hypervisor 8 cpu (800%) 3 VMs 1x 4-vCPU 2x 2-vCPU 66% 66% 66% 66% cgroup 1 (VM1 [266%]) 100% 100% cgroup 2 (VM2 [266%]) 100% 100% cgroup 3 (VM3 [266%]) 15

Hypervisor idle steal: process grouping (cntd.) Mitigation Disable CPU cgroups for VMs Side Effects Loses the capability to control VM cpu utilization Autogroup feature kicks in Disabling autogroup feature works fine for newly launched VMs, but running VMs are still managed by the autogroup 16

Hypervisor idle steal: process grouping (Contd ) Working solution: per-host VM Cgroup Consolidate all vcpus in one cgroup. CFS allocates CPU proportionally to the number of vcpus. Modified libvirt A tunable to consolidate all vcpus to one cpu cgroup and all cgroups parameters are same(default) for all VMs cfs.cfs_period_us, cfs.cfs_quota_us Moves a VM to its own cgroup if any cgroup parameters modified for a VM Moves the VM back to perhost cgroup if the cgroup parameters are reverted to default 17

Hypervisor idle steal: process grouping (Contd ) PerHost Cgroup per-host Cgroup Example 8 cpu Hypervisor 100% 8 cpu (800%) 3 VMs 1x 4-vCPU 2x 2-vCPU 100% 100% 100% VM1 100% VM3 100% 100% 100% VM2 18

Hypervisor idle steal: per-host cgroup 19

Sidenote: CFS & Idle VMs On a fully utilized hypervisor, idle VMs might experience steal because they are penalized by CFS. When idle VMs share a hypervisor with busy VMs, they will be assigned a lower weight as their utilization is low. Results in scheduler latency, when competing with busy VM vcpus. 20

Octopus Userspace VM-placement daemon Pins VMs to resources (CPUs, NUMA nodes) As a result, occasionally migrates VMs across sockets VM awareness in vcpus placement jiffy-level resolution not needed Functionality: NUMA-partitioning utilization tracking on overcommitted fleet CPU-to-vCPU mapping on optimized HVs 21

Octopus Issues swap usage during NUMA migration OOMs when VM aggressively allocates RAM: when VM allocates RAM faster than RAM is swapped to disk at swapoff phase, when swap is disabled 22

Swapoff optimizations Swapoff implementation in kernel is not efficient Goes through all process address space for each swap entry For considerably large swap and a heavily loaded hypervisor with thousands of processes, might take hours or days. Usually swapoff happens during system shutdown We use it to migrate VMs between NUMA nodes and need it to be quick. Revived a dormant patch and initiated discussions upstream. https://lkml.org/lkml/2018/10/3/638 Basic approach is to give a single pass on all process address space and page in the swapped out pages. 23

Failing NUMA migrations The process of migrating the RAM of VM between NUMA nodes might fail, even with swap to back it. The placement service makes its best effort to avoid OOM by tuning the swappiness and pre-calculating available memory and swap before the migration. A dangerous side effect is OOM kill of VMs Dynamic nature of the hypervisor breaks this approach VM launches and destroys causes all calculations to go wrong Efforts started inhouse to have a kernel level mechanism to do a safe memory migration without OOM and fail gracefully. migrate_pages(2) 24

Questions? 25

Thank you! 26