Faculty of Computer Science Institute for System Architecture, Operating Systems Group Virtualization Henning Schild Dresden, 2009-12-01
So Far... Basics Introduction Threads & synchronization Memory Real-time Resource Management Device Drivers TU Dresden, 2009-12-01 MOS - Virtualization Slide 2 von 58
Today: Virtualization Introduction Motivation & classification, flavors L4Linux: Para-virtualization on top of L4 Architecture Address space layout Scenarios NOVA a μ-hypervisor KVM on FiascoOC TU Dresden, 2009-12-01 MOS - Virtualization Slide 3 von 58
One possible definition... Introduction of layers of abstraction between physical ressources and users/applications. partitioning of ressources aggregation of ressources combinations TU Dresden, 2009-12-01 MOS - Virtualization Slide 4 von 58
Virtualization flavours Multitasking OS as layer of abstraction machine partitioning, virtual memory and time slices application level Unix chroot FreeBSD Jails, Solaris Zones, Linux Vserver Wine multiple OSs on one machine VMWare, QEMU, VirtualBOX UML, Xen, L4Linux TU Dresden, 2009-12-01 MOS - Virtualization Slide 5 von 58
Virtualization a hype A lot of interest in the research community within the last years, e.g.: SOSP 03: Xen and the Art of Virtualization EuroSys 07: a whole session on virtualization Many virtualization products: VMware, QEmu, VirtualBox, KVM, Hyper-V x86 Hardware support further increasing demand: VMware: from 240 to 6300 employees within the last few years TU Dresden, 2009-12-01 MOS - Virtualization Slide 6 von 58
Virtualization - a new idea? Originates in IBM's CP/CMS series used on System/3xx mainframes (starting ~1964) Control Program - VMM Cambridge Monitor System Guest OS Memory protection SIE instruction (VM mode) CP encodes much of the guest privileged state in a hardware-defined format IBM's first virtual memory system TU Dresden, 2009-12-01 MOS - Virtualization Slide 7 von 58
Motivation TU Dresden, 2009-12-01 MOS - Virtualization Slide 8 von 58
Virtualization - Motivation optimize utilization server consolidation Isolation security reasons incompatibility reusing legacy software i.e. Windows on Linux development virtual test machines TU Dresden, 2009-12-01 MOS - Virtualization Slide 9 von 58
Virtualization - Buzzwords TCO Virtualization Availability Efficiency Security Migration Utilization Flexibility Maintainability Manageability Virtual Appliance Consolidation TU Dresden, 2009-12-01 MOS - Virtualization Slide 10 von 58
Formal Requirements Equivalence guest behaviour should match real machine Isolation host controls ressource access guests are isolated from host and from each other Efficiency guest code should be executed natively see paper reading 2010-01-12: Formal requirements for virtualizable third generation architectures TU Dresden, 2009-12-01 MOS - Virtualization Slide 11 von 58
Classification help Virtualization - an overloaded term Some classification criteria: Objective target: hardware, OS API or ABI? Emulation vs. virtualization: do we have to interpret some or all instructions? Binary vs. byte code interpretation (e.g.: JVM) Can we modify the target software? (e.g. using para-virtualization techniques) TU Dresden, 2009-12-01 MOS - Virtualization Slide 12 von 58
Reimplementation of the OS interface used to integrate a bunch of existing software to other respectively newly created OSes when copying the API of an OS, target software needs to be re-linked in contrast to that, ABI emulation can run unmodified binaries e.g.: Wine Disadvantages of both approaches: huge effort shooting at a moving target TU Dresden, 2009-12-01 MOS - Virtualization Slide 13 von 58
Virtualize the hardware instead of emulating the OS API or ABI, take the underlying platform common to many OSs Emulation interprete/translate guest code Virtualization native execution of guest code with or without HW-Support Paravirtualization modification of the guest TU Dresden, 2009-12-01 MOS - Virtualization Slide 14 von 58
Emulation binary translation/interpretation of guest code no native execution contradicts with efficiency requirement applicable to a lot of architectures often used for peripheral devices Example: QEMU, Bochs QEMU emulates x86, ARM, SPARC, PowerPC... TU Dresden, 2009-12-01 MOS - Virtualization Slide 15 von 58
Platform virtualization in software guest OS runs natively in less privileged mode privileged instructions fail and are handled by the VMM (trap-and-emulate) VMM derives and manages shadow structures from guest's primary structures, e.g.: shadow page tables JIT binary translation Examples: VMware, KQEMU, VirtualBox TU Dresden, 2009-12-01 MOS - Virtualization Slide 16 von 58
X86 Virtualization TU Dresden, 2009-12-01 MOS - Virtualization Slide 17 von 58
Problems with x86 virtualization Ring-alias problem guest OS runs in privilege level > 0 Address space compression part of the guest OS's address space used by the VMM (e.g. IDT, GDT) some instructions do not trap, e.g.: popf: pop stack into EFLAGS register, causes interrupt handling problems (IF not updated in user-mode) faulting implies performance loss kernel entry/exit -> doubled context switch TU Dresden, 2009-12-01 MOS - Virtualization Slide 18 von 58
Hardware enabled virtualization Example Intel-VT root and non-root mode, VM entry and exit Virtual Machine Control Structure in physical memory holds information of guest and host state and some additional control information VMCS is used to investigate VM exit conditions, e.g.: whether a guest traps when masking or unmasking interrupts AMD SVM is similar TU Dresden, 2009-12-01 MOS - Virtualization Slide 19 von 58
Hardware enabled virtualization problematic instructions trap reduced software complexity Examples: KVM, VirtualBox, Xen, Hyper-V, Windows 7 XP Mode, Parallels... TU Dresden, 2009-12-01 MOS - Virtualization Slide 20 von 58
MMU Virtualization TU Dresden, 2009-12-01 MOS - Virtualization Slide 21 von 58
Shadow page tables Memory tracing of the page tables decode and emulate guest's pagefaults guest virtual memory guest page table guest physical memory host virtual memory shadow page table host page table host physical memory TU Dresden, 2009-12-01 MOS - Virtualization Slide 22 von 58
Shadow page tables 1) pagefault in guest (GVA) 2) caught by hypervisor/vmm 3) parse guest page tables (GVA GPA) GVA GPA HVA HPA guest virtual address guest physical address host virtual address host physical address 4) maybe inject pagefault into guest and parse again 5) translate guest pt entry to shadow pt entry (GPA HVA HPA) 6) create mapping in shadow pt and resume costly, recent x86 processors come with hardware support guest virtual memory guest page table guest physical memory host virtual memory shadow page table host physical memory host page table TU Dresden, 2009-12-01 MOS - Virtualization Slide 23 von 58
MMU Virtualization with HW support hardware can parse two page table levels VM page table constructed by VMM maps HPA to GPA guest manages its own GPA to GVA tables no shadow paging in software required pagefaults can be resolved without mode switching AMD: nested paging, Intel: EPT significant performance increase for VMs TU Dresden, 2009-12-01 MOS - Virtualization Slide 24 von 58
Paravirtualization TU Dresden, 2009-12-01 MOS - Virtualization Slide 25 von 58
Paravirtualization modify guest OS to integrate it in the runtime environment of another OS advantages: no hardware support required cooperation from guests possible disadvantages: source code required high development cost L4Linux, Xen, User Mode Linux, colinux Afterburner (Karlsruhe): modify binary code paravirtualized drivers: VMware, KVM (virtio) TU Dresden, 2009-12-01 MOS - Virtualization Slide 26 von 58
XEN TU Dresden, 2009-12-01 MOS - Virtualization Slide 27 von 58
Examples from TUDOS group TU Dresden, 2009-12-01 MOS - Virtualization Slide 28 von 58
L4Linux TU Dresden, 2009-12-01 MOS - Virtualization Slide 29 von 58
L4Linux: history presented at SOSP '97 based on x86 Linux 2.0 on top of first L4 kernel (L4)Linux has evolved over the years 2.2 supported MIPS and x86 2.4 first version to run on L4Env 2.6 uses 'paravirtualization' L4 kernel features recently latest L4Linux release 2.6.31 x86 and ARM support SMP TU Dresden, 2009-12-01 MOS - Virtualization Slide 30 von 58
Linux Architecture user Application Application Application Application kernel Arch- Depend. System-Call Interface Linux Kernel Arch- Ind. File Systems VFS File System Impl. Device Drivers Networking Sockets Protocols Processes Scheduling IPC Memory Management Page allocation Address spaces Swapping Arch- Depend. Hardware Access Hardware CPU, Memory, PCI, Devices TU Dresden, 2009-12-01 MOS - Virtualization Slide 31 von 58
Linux Architecture Architecture dependent part Small, for x86 about 2% of the kernel Application Application Application Application user System call interface: kernel Arch- Depend. Arch- Depend. Linux Kernel Arch- Ind. Kernel entry Signal delivery System-Call Interface Copy from/to user space File Systems VFS File System Impl. Hardware access: Device Drivers Networking Sockets Protocols CPU state and features MMU Interrupt Memory mapped I/O, I/O ports Hardware CPU, Memory, PCI, Devices, Processes Scheduling IPC Hardware Access Memory Management Page allocation Address spaces Swapping Architecture dependent part implements generic interface used by independent part TU Dresden, 2009-12-01 MOS - Virtualization Slide 32 von 58
Linux Architecture user Application Application Application Application kernel Arch- Depend. System-Call Interface Linux Kernel Arch- Ind. File Systems VFS File System Impl. Device Drivers Networking Sockets Protocols Processes Scheduling IPC Memory Management Page allocation Address spaces Swapping Arch- Depend. Hardware Access Hardware CPU, Memory, PCI, Devices TU Dresden, 2009-12-01 MOS - Virtualization Slide 33 von 58
L4Linux Architecture L4 Task L4 Task L4 Task L4 Task Application Application Application Application L4 Task Arch- Depend. System-Call Interface Linux Kernel Arch- Ind. File Systems VFS File System Impl. Device Drivers Networking Sockets Protocols Processes Scheduling IPC Memory Management Page allocation Address spaces Swapping Arch- Depend. Hardware Access user kernel sigma0 FiascoOC L4IO Console moe Hardware TU Dresden, 2009-12-01 MOS - Virtualization Slide 34 von 58
L4Linux Architecture Linux kernel and Linux user processes run each within a single L4 task L4/L4RE specific part is implemented as separate architecture: arch/l4 include/asm-l4 L4/L4RE architecture dependent part itself divides into x86 and ARM specific part most code is reused from x86 resp. ARM specific part TU Dresden, 2009-12-01 MOS - Virtualization Slide 35 von 58
Linux address space layout 0xFFFFFFFF Kernel Address Space PAGE_OFFSET 0xC0000000 TASK_SIZE User Address Space 0x00000000 vmalloc, kmap, Phys. Memory Kernel Image Application, Libraries, 0x0 TASK_SIZE user part changes on every context switch TASK_SIZE 0xF... kernel part constant in all address spaces Physical memory mapped beginning at PAGE_OFFSET TU Dresden, 2009-12-01 MOS - Virtualization Slide 36 von 58
L4Linux address space layout 0xFFFFFFFF Kernel Address Space PAGE_OFFSET 0xC0000000 TASK_SIZE User Address Space 0x00000000 vmalloc, kmap, Phys. Memory Kernel Image Application, Libraries, 0xFFFFFFF F 0xC000000 0 0x0000000 0 0xFFFFFFF F FiascoOC Microkernel L4Linux User Process Application, Libraries, FiascoOC Microkernel 0xC000000 0 L4Linux Server vmalloc, kmap, Guest-phys. Memory PAGE_OFFSE 0x00000000 T Kernel Image TU Dresden, 2009-12-01 MOS - Virtualization Slide 37 von 58
L4Linux: problems to be solved L4Linux server has to: have some basic resources (memory, I/O) manage page tables of its user processes handle exceptions from user processes schedule its tasks L4Linux user processes have to: 'enter' the L4Linux kernel (now in a different address space) Kernel needs information from user processes formerly accessible in the same address space, e.g.: syscall arguments TU Dresden, 2009-12-01 MOS - Virtualization Slide 38 von 58
Linux address space management Architecture-independent part: general page table management implements allocator strategies page replacement strategies assumes 4-level page table by architecture-dependent part Architecture-dependent part set, remove and test entries TLB handling Linux for x86 uses 2 level page tables Linux Kernel Memory Management Page allocation Address spaces Swapping Architecture- Dependent Part (i386) Hardware Application thread_info TU Dresden, 2009-12-01 MOS - Virtualization Slide 39 von 58
L4Linux address space management L4Linux user processes are actually L4 tasks L4Linux server is the pager Hardware page tables are managed by L4 kernel L4Linux page tables are mirrored L4Linux uses map/unmap operations adding page table entries is done lazy (pagefault occurs) Linux Kernel Memory Management Page allocation Address spaces Swapping Architecture- Dependent Part (i386) Fiasco Kernel Application thread_info Hardware TU Dresden, 2009-12-01 MOS - Virtualization Slide 40 von 58
General exception handling if a L4 task raises an exception kernel sends exception IPC to handler (feature in FiascoOC and L4.X2) Exception IPC contains CPU state of the client Exception handler can reply with a new state, for instance another instruction pointer Exception IPC can be used to recognize Linux system calls: INT 0x80 will trigger an exception L4Linux server acts as exception handler for its user processes TU Dresden, 2009-12-01 MOS - Virtualization Slide 41 von 58
L4Linux kernel entry System call costs: 2x kernel entry/exit (exception and reply) 2x address space switch L4Linux User Process INT 0x80 3 1 L4Linux Server arch. dependent 2 arch. independent 4 Fiasco microkernel TU Dresden, 2009-12-01 MOS - Virtualization Slide 42 von 58
Interrupt handling Interrupt messages are received in separate threads Interrupt threads run on a higher priority than other Linux threads (Linux semantic) Interrupt thread wake up idle thread or force the running user process to enter the linux server Plain Linux disables interrupts for syncronization Use a lock instead of CLI/STI L 4 Linux Server Device Driver Main Thread Fiasco Kernel Hardware r equest _i r q( i r q_no, handl er, ) Interrupt Threads L4IO TU Dresden, 2009-12-01 MOS - Virtualization Slide 43 von 58
not covered in detail here... Linux kernel needs to access address space of user processes (e.g. syscall arguments) walk page tables of user process Security problems with DMA move device drivers out of L4Linux I/O MMU L4Linux scheduling only one L4Linux process is active at a time other processes are waiting in IPC (exception or pagefault) TU Dresden, 2009-12-01 MOS - Virtualization Slide 44 von 58
Hybrid applications Linux applications that are 'L4 aware' Needs to be detected by Linux server Linux server puts them in UNINTERRUPTIBLE state in its own data structures Will not disturb ongoing IPC in hybrid task L4Linux user processes run as Aliens Special alien flag used when creating a task Aliens trap when calling L4 system Exception handler monitors system call Fiasco-only feature TU Dresden, 2009-12-01 MOS - Virtualization Slide 45 von 58
L4Linux Use - cases TU Dresden, 2009-12-01 MOS - Virtualization Slide 46 von 58
Real-time video player L4Linux user processes might use L4 services MPlayer Frontend controls L4Linux RT-MPEG Player Loader Roottask moe DOpE FiascoOC kernel TU Dresden, 2009-12-01 MOS - Virtualization Slide 47 von 58
Multiple L4Linux instances Using multiple instances concurrently, e.g. for each security domain Devices need to be multiplexed (see resource management lesson: ORe, nitpicker, windhoek, ) Communication through network, special IPC monitors... App. App. App. App. L4Linux server Virtualization infrastructure Loader Roottask console moe FiascoOC kernel L4Linux server TU Dresden, 2009-12-01 MOS - Virtualization Slide 48 von 58
Use L4Linux as a toolbox L4Linux instances can provide access to various complex software stacks, e.g.: Network stacks Drivers Filesystems Alien Filesystem Wrapper L4Linux L4 App Loader Roottask moe Fiasco kernel TU Dresden, 2009-12-01 MOS - Virtualization Slide 49 von 58
Faithful Virtualization TU Dresden, 2009-12-01 MOS - Virtualization Slide 50 von 58
NOVA μ hypervisor approach NOVA OS Virtualization Architecture Separate hypervisor and VMM(s) Guest OS Guest OS Guest OS non-root user Server VMM VMM VMM root kernel hypervisor TU Dresden, 2009-12-01 MOS - Virtualization Slide 51 von 58
NOVA Hypervisor manages protection domains: address spaces and virtual machines Virtual machine has associated virtualization handler -> the VMM (codename: Vancouver) VMMs handle virtualization faults and implement virtual devices split functionality of hypervisor and VMM reduced complexity of hypervisor which runs security-sensitive applications beside the VMs TU Dresden, 2009-12-01 MOS - Virtualization Slide 52 von 58
FiascoOC and KVM-L4 FiascoOC provides AMD SVM support KVM can be reused with little modification qemu-kvm qemu-kvm L4Linux server Guest OS Guest OS KVM-L4 guest host user kernel Loader Roottask DMPhys Names Fiasco kernel TU Dresden, 2009-12-01 MOS - Virtualization Slide 53 von 58
FiascoOC and KVM-L4 FiascoOC supports AMD SVM memory is mapped to VMs using map/unmap mechanism invoke VM capability to enter guest mode existing VMM can be reused KVM with little modification low development cost Virtual Machines next to secure applications TU Dresden, 2009-12-01 MOS - Virtualization Slide 54 von 58
Summary Virtualization flavours API or ABI emulation Emulation Full virtualization Hardware (especially x86) or OS Paravirtualizition L4Linux paravirtualization in detail Address space layout & management Taming Linux (interrupts, I/O memory) Faithful Virtualization Nova minimal hypervisor + VMM from scratch KVM-L4 reusing a VMM TU Dresden, 2009-12-01 MOS - Virtualization Slide 55 von 58
References Tom Van Vleck: 'The IBM 360/67 and CP/CMS' http://www.multicians.org/thvv/360-67.html Keith Adams and Ole Agesen: 'A Comparision of Software and Hardware Techniques for x86 Virtualization' ASPLOS 2006 http://www.vmware.com/pdf/asplos235_adams.pdf Intel Virtualization Technology http://www.intel.com/technology/itj/2006/v10i3/1-hardware/1-abstract.htm H. Härtig, M. Roitzsch, A. Lackorzynski, B. Döbel and A. Böttcher: 'L4 Virtualization and Beyond' TU Dresden, 2009-12-01 MOS - Virtualization Slide 56 von 58
References Udo Steinberg: 'NOVA Hypervisor Architecture Whitepaper' Internal Report 2007 L4Linux Webpage http://os.inf.tu-dresden.de/l4/linuxonl4 Adam Lackorzynski: 'L4Linux Porting Optimizations' Diploma Thesis 2004 http://os.inf.tu-dresden.de/papers_ps/adam-diplom.pdf TU Dresden, 2009-12-01 MOS - Virtualization Slide 57 von 58
Outlook now, paper reading: Singularity - Rethinking the Software Stack next weeks: legacy containers OS Personalities TU Dresden, 2009-12-01 MOS - Virtualization Slide 58 von 58