Virtualization and High-Availability

Virtualization and High-Availability LAAS, 30 Novembre 2009 François Armand OpenWide, Université Paris 7 francois.armand@openwide.fr

Agenda Reminder about virtualization, HA, SA Forum HA challenge introduced by virtualization Some failover scenario Reminder on virtualization and devices Local failover scenario and issues Some initial modeling work 2

Virtualization Enables Consolidation Taking advantage of more powerful hardware Applications Applications A B C D SMP SMP Core Core Applications Applications Applications Applications A B C D A B C D Virtualization VMM HW HW Core Core

Classic VMs Enable to run multiple independent- s simultaneously on the same processor(guest os), each in its own Virtual Machine Two main approaches: Native VM s: Introduce a software layer between the hardware and the : Virtual Machine Monitor (VMM) or Bare metal Hypervisor Apps Apps Guest s VMM Apps Hardware Apps Host VMM VMM Hosted VM s: require a Host to start first Hardware 4

Taxonomy (derived from E. Smith & Nair) System Virtualization (Same ISA) System Level (ISA) VMs System VMs (# ISA) (Same ISA) Process Level (ABI) Process VMs (# ISA) Hardware Virtualization Classic VM Hardware Emulation Whole System Bochs, QEMU Multiprogrammed Dynamic Systems Translators Native, Type I Paravirtualized Xen, VLX HW Assisted Xen, VLX Transparent Full/Native Virtualization Dyn. Bin. Translation Vmware ESX Hosted, Type II VMware WS, KVM, VirtualBox, (=) )(#) (=) (#) Multitask Virtualization Translator WINE Virtual Servers Virtuozzo, Solaris Zones ISA & ABI Translator FX!32 ISA & Translator Transitive High Level Language Java 5

Virtualization and Availability VM Live Migration for planned downtime Hypervisor based Fault-Tolerance VMM rejuvenation In RAM suspension (ACPI S3) + VMM kexec style Marathon HA and FT (lock step), VmWare FT Kemari (Xen, KVM) Synchronisation passive VM 6

HA is about surviving failures To survive a single failure you need a redundant component (spare, standby, ) The system can detect faults and reconfigure itself to use a redundant component System User Failure Error Fault Failure Component System Error Detected Fault Failure Component User No Failure Standby Component Copyright 2006 Service Availability Forum, Inc 7

4 Means to achieve HA: Fault Prevention MTBF Quality Insurance Avoid Operator s error Avoid Overload situations Supported by Virtualization Independent management of Fair share scheduling policy in VMM Fault Removal MTBF Remove faults after/before HW Maintenance Corrective, Preventive (FRU) Collection of evidence: log, dumps Supported by Virtualization Software upgrades Fault Tolerance MTTR + MTBF Survive in spite of failures Detection, Isolation, Recovery, Repair => HA Middleware Supported by Virtualization Fully isolated guest s Independent reboot of s Fast restart Fault Forecasting MTBF Evaluation of system behaviour Qualitative Identify, classify, rank failures Quantitative Evaluate the probability with which the attributes of dependability are satisfied.

What s (usually) needed Redundant hardware Costly but easy Redundant runtime software instances This creates additional needs: Need to determine which instance is active / passive Need to determine failure and instruct passive to go active, Need to help active to send state to passive (checkpoints) And much more 9

Service Availability Forum SAF provides specification to standardize API s for software providing availability services: Checkpoint, log, notification, alarms, events, messaging Availability Management Framework (AMF) Hardware Platform Interface (HPI) System Management Framework (SMF) Platform Management (PLM) Information Management Model (IMM) And more See http://www.saforum.org/ 10

Virtualization modifies underlying HA assumption / paradigm Application HA Middleware VMM Core Core 12

Challenge arising from HW evolution Virtualization must meet HA requirements Traditional single core based platform New multicore Virtualized platform Appli a Appli b Appli c HA mngt Appli Appli HA Appli b a c Blade Blade Blade Blade Core 1Cores 2 & 3 Core 4 Multicore Blade 1 to1 dependency of Appli / / HW Virtualization enables consolidation 1 application per processing blade Many Core processors part of next designs [Monocore processors on blades ] Virtualization will be a key element of platforms Virtualization mngt 13

Challenge arising from HW evolution Virtualization adds new dimension to HA Traditional HA New HA Appli HA mngt Appli Appli HA mngt Appli HW HW HW Virtualization HW 1to1dependency of Appli / / HW 1 application per processing blade Redundancy & HA managed at blade level Virtualization introduces new entity: VMs HA dependency chain is modified VMs mngt by HA enhances platform availability

Remote Failover Resilience to software and hardware failures HA management is redundant as well HA Appli Appli HA Appli Appli mngt Active x mngt Sby y VM VM VM Virtualization HW VM VM VM Virtualization HW

Failover on Hardware Failure Multiple failovers can be handled simultaneously Failovers could be directed to different physical machines 2N OK N+1??? N+M??? HA Appli a Appli x HA Appli a Appli x mngt Active Active mngt Sby Sby VM VM VM Virtualization HW VM VM VM Virtualization HW

Local Failover Low cost hardware solution Resilience to software failure only Restart failed VM (policy defined) Can reboot Hardware upon escalation HA Appli Appli HA Appli Appli mngt Active Sby mngt Active ALONE VM VM VM Virtualization HW VM VM VM Virtualization HW

Single Hardware Issues HA mngt Appli Appli g Active Sby VM performing HA management VM providing device access Adds a SPOF VM VM VM Virtualization in addition to HW, and VMM HW HW (board, devices): it s OK: it s a design choice VMM: it s OK: limited amount of code HA management can run replicated by design Partly solves the problem Issue: Device access / management More than 80% of system failures stem from device drivers (cf Nooks)

Virtualization and devices Shared devices: Accessed by more than one VM Ex: disk is shared, partitions are not Ex: Ethernet actually bridging/routing between virtual and physical Non shared devices Devices used exclusively by a single VM Ex: Network interface Virtualized by VMM Virtualized within a dedicated VM Dom0, Dom I/O in Xen, Any VM in VLX Direct physical device access from VM VT-d, PCI support / extensions, VMDQ, / VLX, 21

Virtualization and Devices Different ways to provide access to devices: Transparent I/O s or para-virtualized I/O s Pro s and Con s in both cases Applications Driver Native Driver Back-End Driver Applications Front-End Driver I/O conversion Real Driver V M M VMM Device Controller Device Controller 22

Virtualization and Devices (Cont d) Better hardware support: PCI SRIOV, MRIOV, Intel VT-d, Specific controllers (e.g.: VMDQ) Or Specific VMM implementations ti Applications Native Driver VMM Device Controller VLX Unmodified drivers, better performance 23

Sharing Devices Shared devices are a concern for failure resilience Shared devices provided by VMM: Failure of driver implies failure of VMM Applications Applications And failures of all VM s Driver Driver VMM I/O conversion Real Driver Device Controller 24

Sharing Devices Sharing provided by a VM, through back-end driver Failure of driver => failure of VM Only client VM s are impacted Restart under condition Native Driver Back-End Driver Applications Front-End Driver VMM Device Controller 25

Not Sharing Devices Multiple I/O able VM s could solve the dependability issue At the cost of more devices Applications Native Driver Applications Native Driver VMM Device Controller Device Controller 26

Fully Independent VM s (for I/O s) VM physically independent of each other Not a typical SBC device (disk) configuration But provides redundancy, HA with limited SPOF: Hardware and Virtualization layer HA Mgt active Native Eth. Native Disk Virtual Eth. HA Mgt Standby Virtual Eth. Native Disk Native Eth. Virtualization P2 P3 Ethernet P9 P10 Ethernet 28

Realistic Hardware Configuration Requires moving ownership of a device from one VM to the other Support from Virtualization layer (I/O permission, DMA, IRQ routing) Support from : device hot plug, or device activation in sync with application failover! HA Mgt replicated, co-located with application. VM I/O Native Disk HA Mgt active User App Virtual Disk HA Mgt Standby User App Virtual Disk Native Eth. Virtual Eth. Phys. Dev. Virtual Eth. Virtual Eth. Phys Dev. P2 P3 P9 P10 Ethernet Virtualization Device 29

Relaxing VM I/O SPOF issue Upon failure of VM/IO restart it, seen as a non replicated resource by HA Mgt Virtualized devices wait for VM I/O recovery Issue: reset of devices w/o reset of the board! VM I/O Native Disk HA Mgt active User App Virtual Disk HA Mgt Standby User App Virtual Disk Native Eth. Virtual Eth. Phys. Dev. Virtual Eth. Virtual Eth. Phys Dev. P2 P3 P9 P10 Ethernet Virtualization Device 30

Complex scenario Upon active VM failure, standby takes over (=> alone) VM alone grabs physical devices (dsk, eth) owned by failed VM Need multipath support in w/o page fault during switch! Alone VM exports virtual devices to other VM s which rebind! Front-end device drivers must be able to rebind P2 P3 P9 P10 HA Mgt active Native Eth. Native Disk Virtual Eth. HA Mgt alone Virtual Eth. Virtual Disk Native Eth. Virtualization Ethernet 31

Simple configuration (SC): 1 SBC, 1 dual core processor OK µ hw λ hw Assumptions: Core Core Board Simplified failure model of a single board with 1 dual core processor without software HW failed Currently failure of a single core implies failure of the processor (e.g. of both cores) Failure of the processor is identical to failure of the board Failure of I/O peripherals considered equivalent to failure of the board. Repair requires changing the board.

SC + 1 SMP + 1 application OK µ hw App failed λ hw HW failed µ λ hw λ λ App failed λhw Application A SMP Core Board Core µ App Failure of any component leads to unavailability of service λ Application repair: Restart the application repair: Reboot the and restart the application Might be fast restart or hard reset Board repair: Change the board

SC + VLX+ 2 + 2 applications Appl. A Appl. A VLX Core Core OK 2 * λ App µ App Board 2*λ λ µ hw 1 App failed µ App λ App MIN(µ App, µ ) 2 App failed λ hw µ µ λ λ 2 * λ µ vlx λ App 1 + 1Ap failed 1 failed λ λ λ vlx 2 failed HW failed λ hw VLX failed λ vlx Green states: available Red states: unavailable

States of: SC + VLX+ 2 + 2 applications 1 App failed (system said available) The other application is still up and running, whether the system is said available or not tis up to the end user, Repair: restart the failed application 2 App failed After failure of 1 st app instance, the second one fails. System is unavailable Repair: restart the failed application (done in parallel) 1 failed (system said available) Application running on such an is failed too, The other application is still up and running, whether the system is said available or not is suptot the eend duse, user, Repair: reboot that and its application

States of: SC + VLX+ 2 + 2 applications 1 + 1App failed Only an remaining up and running System unavailable Repair: restart the failed app and reboot failed and its application 2 failed System unavailable Repair: restart the 2 failed and their application VLX failed System unavailable Repair: reboot VLX (board reset or not) HW failed System unavailable Repair: Change the board

Guessed Failure and Repair rates Application A SMP Core Board Core MTBF hw : once every 2 years 17 520 hours λ hw : 57 077 FIT hw MTTR hw : 6 hours µ hw : 166 666 666 FIT MTBF : twice / year 4 380 hours λ : 228 310 FIT MTTR : 2mns= 0.0333 hours µ : 30 10 9 FIT Appl. A Appl. A VLX Core Board Core MTBF hw : once every 2 years 17 520 hours λ hw : 57 077 FIT hw MTTR hw : 6 hours µ hw : 166 666 666 FIT MTBF : twice / year 4 380 hours λ : 228 310 FIT MTTR : // repair is faster 45 sec µ : 80 10 9 FIT

Guessed Failure and Repair rates Application A SMP Core Board Core MTBF App : twice / year 4380 hours λ App : 228310 FIT MTTR App : 30sec µ : 120 10 9 FIT Appl. A Appl. A VLX Core Board Core MTBF App : twice / year 4380 hours λ App : 228310 FIT pp MTTR App : 30sec µ : 120 10 9 FIT MTBF VLX : once every 2 years 17 520 hours λ VLX : 57 077 FIT MTTR VLX : 2mns= 0.0333 hours µ VLX : 30 10 9 FIT

Resulting Computed Availability Application A SMP Core Board Core Results obtained with MEADEP Downtime: 3,0405862 hours per year Appl. A Appl. A VLX Core Board Core Results obtained with MEADEP Downtime: 3,0155951 hours per year Uptime: increased by ~90 seconds / year Independent of HW

Bibliographie A Fast Rejuvenation Technique for Server Consolidation with Virtual Machines, Kenichi Kourai, Kenichi Kourai (DSN 2007) Hypervisor Based Fault-Tolerance, Thomas Bressoud, Fred Schneider (ACM TOCS, 1996) 41