I/O and virtualization CSE-C3200 Operating systems Autumn 2015 (I), Lecture 8 Vesa Hirvisalo
Today I/O management Control of I/O Data transfers, DMA (Direct Memory Access) Buffering Single buffering Double buffering Virtualization The process abstraction defines a virtualization Why do we need something else? Compared to emulation and simulation Different classes of virtualization Virtual machines, virtualization engines, etc., and their support 2
I/O management 3
Introduction Reminder: The main tasks of an OS Resource management Abstraction of hardware Peripheral devices Often many, different types, Must be Managed Abstracted The appear as the filesystem, the network, Usually Significant proportion of a large OS is driver code The driver code is typically also the I/O code 4
I/O and devices Often there is plenty of structure Busses, bridges, controllers, (below a classical PC) Accessing the structure and its parts is not trivial From old main frames to newer devices Computer hardware is evolving rapidly (no universal computer ) Stand-alone computers are (more or less) dead No such thing Huge variations The traditional computer architecture (the PC) The novel computer architecture 5
Devices Devices come in wide variety human readable: display/keyboard/mouse etc. machine readable: hard disks, USB keys, sensors communication: modems,... Key differences data rate: orders of magnitude differences application: e.g., a hard disk vs. a keyboard complexity of control: e.g., disk vs. printer unit of transfer: block vs. character (stream) oriented data representation: data encoding differs from device to device error conditions: how the errors are handled and reported back, T.Lilja 6
Control of I/O Programmed I/O (typically polling) OS (CPU) issues I/O commands on behalf of a process the process waits until the the operation is complete Interrupt-driven I/O if instruction is non-blocking, process continues if instructions is blocking, the process is moved to blocked state once the I/O is finished, an interrupt is issued Direct Memory Access (DMA) processor initiates data transfer DMA module transfers the data independently DMA module issues an interrupt once the transfer is completed 7
Data transfers Small amount can be transferred by the processor big amounts call for HW support otherwise, CPU will become a bottleneck DMA transfer operation type: read or write address of the I/O device involved (e.g. network card) memory address where to start the operation number of words to be read or written PCI architecture includes arbitration PCI devices can request control of the bus and issue memory read/write operations 8
Organization Sharing the system bus with memory and CPU DMA module performs programmed I/O to transfer data inefficient: DMA must issue a transfer request and a transfer which both go to the same system bus! Explicit paths from DMA to I/O modules saves bus cycles: avoids the transfer request I/O systems might have their own DMA modules or share them Dedicated I/O bus allows sharing the DMA module among I/O devices easily expandable communication among I/O devices possible without going to memory, T.Lilja 9
Layered I/O handling Layered architecture lower the level the closer the HW layers should communicate through well-defined APIs Logical I/O provides the interface for the processes open, close, read, write Device I/O data converted to I/O instructions, channel and control commands buffering techniques may be used Scheduling and control I/O requests are scheduled and executed handles interrupts, memory transfer, status updates, T.Lilja 10
Buffering 11
I/O Buffering Typical issues in I/O there are significant speed differences efficient transfers must be large bursts of data there can be latencies no one at other end is listening Therefore considering control requests are deferred (asynchronous handling) considering data buffers (i.e., reserved memory areas) are used There typically are several buffers user, kernel, device, read, write, 12
I/O Device Types Different classifications per OS We use the UNIX ones Stream or character oriented data is transferred one unit at a time random access is not usually supported e.g., keyboard, mouse Block oriented data is moved in blocks block size for HDD is, e.g., 4096 bytes random accessing data is possible (or even fast) usually linearly addressed 13
Single Buffer: Block-Oriented Devices Block-oriented device: reading I/O device input is first transferred to the kernel space buffer once completed, the buffer is moved to the user space and another request for the I/O is issued (read-ahead) at the end one unnecessary read is done Block-oriented device: writing process data is first copied to a kernel buffer user process is free to continue its run data is written to the I/O device at a later time Process need not hang waiting for I/O to be completed swapping the userland side of I/O buffer is possible, T.Lilja 14
Single Buffer: Stream-Oriented Devices Line-at-a-time buffering kernel buffer is filled until a line termination character is found from the input stream e.g., terminal window Byte-a-time buffering single byte is read to the kernel and moved to the user avoids writing the data directly to the user address space e.g., mouse in a GUI Stream-oriented device: reading user process is suspended until a line of input is read avoids context switches between user process and I/O handling Stream-oriented device: writing user process can write a line and continue must suspend only if another line of output is available before the previous one gets written to the I/O device (flushed), T.Lilja 15
Double Buffering We can assign two kernel buffers for the I/O operation one is used by the process for reading/writing another is used by the kernel/device driver for I/O when operation(s) is(are) completed, the buffers are swapped Allows simultaneously access for both OS and process single buffering: process must block if the device driver is currently updating the buffer Using more than two buffers (circular buffering) Needed if the process performs rapid bursts of complex I/O Note: buffer = an area with concurrency control The point here is the use of locks, or semaphores, or Smaller granularity gives more opportunities for parallelism 16
Virtualization 17
Introduction Process abstraction is a virtualization But not a very good one (roughly: a program in execution) Process descriptors are large They refer to a whole lot of data Complex memory structures (incl. sharing), complex open files, etc. The data binds processes tightly onto the platform Migration of processes is heavy (sometimes almost impossible) Programs need to co-operate Process context switches are heavy Therefore IPC (Inter-Process Communication) is slow Threads are no good, also Parallel processing with threads is a nightmare Difficult to fix (we get back to this issue on the next lecture) HW threads are hard to replace 18
Virtualization Virtualization using resources that do not match real resources virtualization is based on real resources a virtual printer may be implemented by using several physical printers to do its job a virtual machine is usually based on the computational resources of an other (physically existing) machine Emulation a form of virtualization where there exists a physical device, whose behavior we mimic (without having the physical device) Simulation we imitate the operation of a system based on a model E.g., we have an abstract model of a memory system and simulate the operation of the memory system by using the model 19
Virtual computers (1/2) CPU emulates CPU instruction set, registers and other internal state e.g., MIPS simulator run on x86 would allow executing MIPS binaries provided that binary does no system calls or external device access Peripherals individual device emulation like memory, hard disks, networks e.g. distributed role system provide an illusion of a single hard disk even though data is spread through set of networked disks Full system virtualization models all parts of real or fictive system e.g. CPU, Memory, PCI-Bus, Network Interface if such hardware is exists and is supported by a SDK we can run real binaries unmodified, T.Lilja 20
Virtual computers (2/2) Operating system virtualization share the same kernel several isolated user-space instances user space instances have their own independent file system hierarchy resources can be allocated on instance bases e.g., Solaris containers Application level virtualization emulate some run-time requirements of applications e.g., system calls of a kernel: FreeBSD's Linux system call emulation allows running unmodified Linux binaries on a FreeBSD host WINE Windows Emulator allows running Windows binaries on Linux Programming language virtual machines Java Virtual Machine providing runtime support for Java byte code for these, usually there is no real hardware counterpart, T.Lilja 21
Classification of virtualization (1/2) Separation of guest and host (and their OSes) Host operating system is run on real hardware Guest operating system is run in a virtualized environment Bare metal architecture Hypervisor on real hardware Guest on the hypervisor Hosted architecture Host operating system on real hardware Hypervisor run on the operating system In both cases, the guest is running on the hypervisor Basically, multiple different guest operating system(s) on top of a real hardware, T.Lilja 22
Classification of virtualization (2/2) Full virtualization the whole system is completely modelled (CPU, Disk, NIC, ) allows for running unmodified guest operating systems e.g., system-mode QEMU Partial virtualization parts of the hardware are simulated allowing some code run unmodified not full blown kernel but some user-land binaries r.g., virtual 8086 mode on x86 architecture Paravirtualization hardware is not necessary emulated at all Guest OS is modified to be able to run in paravirtualized environment, T.Lilja 23
Hardware virtualization (1/3) Virtual Machine Monitor/Hypervisor is capable of virtualizing full set of hardware resources when the following criteria are met equivalence: program running under virtual machine monitor behaves essentially identically to one running on equivalent (real) machine safety: hypervisor or virtual machine monitor must be in complete control of the virtualized resources performance: most of the instructions should be executed without virtual machine monitor intervention If the safety criterion is broken guest program can take control of the virtualized resources without ever giving control back to VMM If the performance criterion is broken It may be too slow to provide any useful service, T.Lilja 24
Hardware virtualization (2/3) To derive conditions of a virtualization of a hardware architecture, we have to classify ISA of a CPU into three categories Privileged instructions cause a trap or exception when run in user mode do not cause any exception when run in kernel mode Control sensitive instructions change the configuration or state of resource e.g., processor execution mode Behavior sensitive instructions result depends on the configuration of a resource e.g., content of the relocation register or processor mode, T.Lilja 25
Hardware virtualization (3/3) An effective VM can be constructed if the set of sensitive instructions is a subset of the set of privileged instructions Why? All instructions that can effect the functioning of the VMM (i.e. sensitive instructions) must pass control to the VMM guarantees safety criteria Non-privileged instructions are executed natively guarantees performance criteria Classic or trap-and-emulate virtualization VMM must trap and emulate every sensitive instruction run non-sensitive instructions natively and for the sensitive instructions install a trap handler that is run instead of the OS trap handler, T.Lilja 26
Virtualization with Intel x86 (1/2) Classical x86 architecture had some sensitive instructions that did not produce traps (critical instructions) e.g., critical instructions change processor or resource state without allowing the virtual machine monitor to intervene Causes problems when VMM runs multiple OS OS #1 issues SIDT (Store Interrupt Descriptor Table Register) instruction and install interrupt handler vector OS #2 issues SIDT OS #1 invokes an interrupt OS #1 ends up in the OS #2 interrupt handler trap-and-emulate would not work Classical x86 can perform virtualization by binary translation replace sensitive instructions not producing traps with instructions that transfer control to the virtual machine monitor but the performance for critical instructions is poor, T.Lilja 27
Virtualization with Intel x86 (2/2) AMD released virtualization extension called AMD-v in 2005 Intel followed in 2006 releasing extension called VT-x which modifies 86 behavior when running VMM two operation modes: VMM and Guest mode own address space for VMM and Guest OSes transfer control to VMM when OS uses sensitive instructions virtualized interrupt vectors for guest OS Virtual Machine Control Structure used for context switching between Guest OS and VMM This provided the basic HW virtualization of the CPU but peripheral device virtualization was still not very efficient, T.Lilja 28
QEMU An emulator using Dynamic Binary Translation (DBT) Full software virtualization (no specific HW support required) User-mode user code is natively executed after DBT processor ISA functional emulation OS and rest of the system is emulated by the QEMU System-mode all code is natively executed after DBT all HW is emulated by QEMU Supports various CPUs: x86, PowerPC, ARM, SPARC, a number of various peripherals, PCI and ISA bridges network cards, audio cards, USB controllers, hard disks, QEMU is lacking proper multicore support QEMU can emulate a multicore guest but using a unicore host (or one core of a multicore) memory sharing and coherency are issues here 29
Xen (1/2) Xen hypervisor is run when the system boots dom0: runs modified version of Linux kernel (host OS) guest is aware that it is a virtual machine makes hypercalls directly, rather than issuing privileged instructions provides device drivers for all guests uses Xend daemon to control execution of the Guest OS XenStore provides statistics collection domu: runs guest operating systems unmodified OS if hardware assisted virtualization is supported otherwise, guests must be paravirtualized (critical instructions translated and device access remapped), T.Lilja 30
Xen (2/2), T.Lilja 31
I/O virtualization 32
Virtual I/O devices Similarly as for computing, I/O can be virtualized I/O operations are done with virtual devices The underlying HW implementation may differ significantly A memory copy may realize as network operations A network transfer may realize as a memory copy Programmability Code portability, migrations, etc. are hard Do such things under the hood Performance I/O operations are typically very slow (Parallel) hardware acceleration is often the answer Forget about the hand coded assembler is faster This is obsolete: modern systems are far too complex 33
KVM KVM consists of loadable generic kernel module (kvm.ko) and specific modules for AMD/Intel guest OS are run under modified version of the QEMU emulator part of Linux kernel and uses its scheduler and memory management to do the resource division easy to setup (no boot needed) no paravirtualization for CPU but may support it for I/O wrt to QEMU QEMU purely software-based and somewhat slow wrt to Xen Xen is an external hypervisor host OS needs to be specifically compiled supports paravirtualization, T.Lilja 34
Docker Applications in software containers Abstract the platform structure away Operating system-level virtualization Not actually a virtual machine Basically a virtualization engine Use Linux container mechanisms Process isolation and co-operation By using the kernel mechanisms Toward distributed systems Abstracts the network connection Multiple processes, apps, tasks, etc. Run on single or multiple hosts Sllows for lightweight communication Docker uses directly the kernel Allows for using quotas Does not support live migration 35
I/O virtualization with Intel x86 Extended Page Tables Translate guest-physical host-physical address Guest OS can modify its own page tables without VMM IOMMU Allows a single Guest OS direct access for I/O devices techniques: DMA and interrupt remapping AMD-Vi and Intel VT-d Toward the CPU: Intel CMT and CAT Network Virtualization Network card hardware must support this Allows sharing a single network device with multiple guest OS Allows hardware accelerated I/O operations Intel VT-c, SR-IOV, MR-IOV PC-SIG I/O virtualization PCI-E standardized non x86-specific I/O virtualization methods 36
Resource management tools (1/2) Virtualization needs basic tools Virtualization is an abstraction by definition But: there must be resource management, too And security, dependability, etc. (remember the OS basic tasks) Linux provides a wide range of tools, e.g. cgroups (control groups) An evolution of various mechanisms (check the 2.6.x history) A unified interface to many different use cases E.g., memory usage limit for a subsystem Namespaces Isolating the namespace of a subsystem from the others pid, mount, NIC, hostnames, etc. E.g., virtual machine isolation Governors E.g., in CPUfreq subsystem e.g., Performance Governor, Powersave Governor, On-demand Governor, Conservative Governor (what is available depends on the system) 37
Resource management tools (2/2) 38