Performance Optimization on Huawei Public and Private Cloud

Performance Optimization on Huawei Public and Private Cloud Jinsong Liu <liu.jinsong@huawei.com> Lei Gong <arei.gonglei@huawei.com>

Agenda Optimization for LHP Balance scheduling RTC optimization 2

Agenda Optimization for LHP Balance scheduling RTC optimization 3

LHP (Lock Holder Preemption) More obvious in virtualization vcpu scheduling Task preemption Then Potentially blocking the progress of other vcpus waiting to acquire the same lock Increasing synchronization latency Performance degradation 4

LHP (Lock Holder Preemption) How to solve or alleviate? PLE (Pause Loop Exiting) DLHS (Delay LH scheduling) Co-scheduling Balance scheduling 5

PLE Hardware support VMCS configuration Optimization for Lock Waiters VM Exit actively Avoid waste vcpu cycles for invalid spin Pros. Supported by upstream Cons. Setting appropriate values of ple_gap and ple_windows is difficult Workloads adjustment Find an appropriate vcpu to yield 6

DLHS (Delay Lock Holder Scheduling) Background & precondition Usually, lock holders are under interrupt disable contexts Normally, the period of holding lock is shortly Hardware support (e.g. intel VT-X) interrupt window exiting Software support Hrtimer, 7

DLHS (Delay Lock Holder Scheduling) Solution Set a grace period for LH before scheduling If one vcpu is LH Start one hrtimer, and Set interrupt window exiting for VMCS If the hrtimer expire Clear interrupt window Continue to schedule for vcpu Judge the vcpu release the lock PLE happened Interrupt window exiting happen then Cancel hrtimer Release grace period Schedule the vcpu immediately 8

Unit: sec Lower is better DLHS performance 300 Hackbench results(cpu overcommit 1:3) 250 200 150 100 VM1-patched VM2-patched VM3-patched VM1 VM2 VM3 50 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 9

Agenda Optimization for LHP Balance scheduling RTC optimization 10

Time X Co-scheduling & Balance scheduling Guest Guest vcpu vcpu vcpu vcpu vcpu vcpu Run all vcpus on Time x Disperse all vcpus AMAS pcpu pcpu pcpu pcpu pcpu pcpu Co-scheduling Balance-scheduling 11

Co-scheduling CPU fragmentation Reduces CPU utilization Delay vcpu execution Priority inversion Degrades I/O performance pcpu 0 xxx vcpu0 xxx vcpu0 pcpu 1 vcpu1 I/O vcpu1 T 0 T 1 T 2 T 3 T 4 12

Balance scheduling Balances vcpu siblings on pcpus without precisely scheduling the vcpus simultaneously How to? Uses a bitmap to record all used pcpus for VM Scheduler adjustment Enqueue & dequeue Migration/find_idle_cpu/select_task_rq etc. 13

Proportion of building chains Performance evaluation Workload: Pushserver in Huawei Private Cloud Continuous testing for 24 hours Results 120.00% Proportion of building chains (higher is better) 100.00% 93.50% 95.30% 80.00% 70% 60.00% 40.00% 20.00% 0.00% with balance sched without balance sched 1:1 vcpupin <10ms 93.50% 70% 95.30% 14

Agenda Optimization for LHP Balance scheduling RTC optimization 15

RTC on KVM Windows use RTC as clock event device RTC emulation in Qemu, three timers rtc_periodic_timer Generates periodic interrupts Programmable to occur according to interrupt rate rtc_update_timer Generates alarm interrupts Occur one per second to once per day rtc_coalesced_timer Generates compensation interrupts Slews the lost ticks since different reasons Getting worse and worse with the VM density increase Pain points Some operations need to hold BQL Context switching between user space and kernel space Interrupt injecting from user space Performance degradation Latency increase Windows guest density decrease 16

RTC optimizations on KVM Minimize influence of BQL Placing RTC memory region outside BQL Using irqfd inject interrupts Hyperv features hyperv clock, Decreases read/write ioports Decreases ioport access on 0x70/0x71 Move RTC emulation to hypervisor Inject interrupts in KVM Reduce context switching But Large attack surface Incompatible with new features, such as split irqchip Optimize RTC compensation solution 17

RTC compensation solution Slew RTC ticks in hypervisor directly Count the coalesced interrupts When an RTC interrupt injecting failed Adjust the count when RTC interrupt rate changes Inject coalesced interrupts after EOI handler if exist Don t need a separate timer More timely Throttle the speed if there is too many coalesced interrupts Live migration support Save the coalesced interrupts in src side Restore them in dest side Both KVM and Qemu need to be patched 18

Optimization evaluation Before optimization After optimization 19

Thank You!