Kernel Support for Paravirtualized Guest OS

Similar documents
Xkvisor. responds to syscalls from that user process via the syscall redirection mechanism.

Virtual Machines Disco and Xen (Lecture 10, cs262a) Ion Stoica & Ali Ghodsi UC Berkeley February 26, 2018

Xen and the Art of Virtualization

Lightweight Remote Procedure Call. Brian N. Bershad, Thomas E. Anderson, Edward D. Lazowska, and Henry M. Levy Presented by Alana Sweat

OS structure. Process management. Major OS components. CSE 451: Operating Systems Spring Module 3 Operating System Components and Structure

Dynamic Translator-Based Virtualization

Lecture 7. Xen and the Art of Virtualization. Paul Braham, Boris Dragovic, Keir Fraser et al. 16 November, Advanced Operating Systems

Xen and the Art of Virtualization. CSE-291 (Cloud Computing) Fall 2016

Architectural Support. Processes. OS Structure. Threads. Scheduling. CSE 451: Operating Systems Spring Module 28 Course Review

6.033 Spring Lecture #6. Monolithic kernels vs. Microkernels Virtual Machines spring 2018 Katrina LaCurts

Review: Hardware user/kernel boundary

Virtual machine architecture and KVM analysis D 陳彥霖 B 郭宗倫

Unit 2 : Computer and Operating System Structure

Scheduler Activations. Including some slides modified from Raymond Namyst, U. Bordeaux

Operating System Security

Lightweight Remote Procedure Call

CS 350 Winter 2011 Current Topics: Virtual Machines + Solid State Drives

Background: I/O Concurrency

The Kernel Abstraction. Chapter 2 OSPP Part I

Advanced Memory Management

CS 5460/6460 Operating Systems

SPIN Operating System

Reserves time on a paper sign-up sheet. Programmer runs his own program. Relays or vacuum tube hardware. Plug board or punch card input.

Computer Architecture Background

Anne Bracy CS 3410 Computer Science Cornell University

ECE 550D Fundamentals of Computer Systems and Engineering. Fall 2017

OS Extensibility: SPIN and Exokernels. Robert Grimm New York University

Virtual Memory. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

What are some common categories of system calls? What are common ways of structuring an OS? What are the principles behind OS design and

Dan Noé University of New Hampshire / VeloBit

Xen and the Art of Virtualization

Chapter 5 B. Large and Fast: Exploiting Memory Hierarchy

The Challenges of X86 Hardware Virtualization. GCC- Virtualization: Rajeev Wankar 36

System Call. Preview. System Call. System Call. System Call 9/7/2018

Exokernel: An Operating System Architecture for Application Level Resource Management

Chapter 3: Operating-System Structures

Processes and Threads. Processes and Threads. Processes (2) Processes (1)

CSCE Introduction to Computer Systems Spring 2019

CS 550 Operating Systems Spring Interrupt

24-vm.txt Mon Nov 21 22:13: Notes on Virtual Machines , Fall 2011 Carnegie Mellon University Randal E. Bryant.

Operating Systems. Operating System Structure. Lecture 2 Michael O Boyle

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIT I

The Process Abstraction. CMPU 334 Operating Systems Jason Waterman

Design of Operating System

G Xen and Nooks. Robert Grimm New York University

OS Virtualization. Why Virtualize? Introduction. Virtualization Basics 12/10/2012. Motivation. Types of Virtualization.

Distributed Systems Operation System Support

Lecture 4: Mechanism of process execution. Mythili Vutukuru IIT Bombay

Faithful Virtualization on a Real-Time Operating System

Address spaces and memory management

CS2028 -UNIX INTERNALS

MODERN SYSTEMS: EXTENSIBLE KERNELS AND CONTAINERS

Operating System Architecture. CS3026 Operating Systems Lecture 03

Last time: introduction. Networks and Operating Systems ( ) Chapter 2: Processes. This time. General OS structure. The kernel is a program!

CS 326: Operating Systems. Process Execution. Lecture 5

Live Virtual Machine Migration with Efficient Working Set Prediction

Designing a True Direct-Access File System with DevFS

CSCE 410/611: Virtualization!

Linux and Xen. Andrea Sarro. andrea.sarro(at)quadrics.it. Linux Kernel Hacking Free Course IV Edition

Extensibility, Safety, and Performance in the Spin Operating System

Hakim Weatherspoon CS 3410 Computer Science Cornell University

Xen is not just paravirtualization

COS 318: Operating Systems

CSCE 410/611: Virtualization

Processes. CS 416: Operating Systems Design, Spring 2011 Department of Computer Science Rutgers University

Sandboxing. (1) Motivation. (2) Sandboxing Approaches. (3) Chroot

The UtePC/Yalnix Memory System

0x1A Great Papers in Computer Security

Opal. Robert Grimm New York University

Lecture 3: O/S Organization. plan: O/S organization processes isolation

Operating- System Structures

Introduction to Cloud Computing and Virtualization. Mayank Mishra Sujesha Sudevalayam PhD Students CSE, IIT Bombay

Today s Papers. Virtual Machines Background. Why Virtualize? EECS 262a Advanced Topics in Computer Systems Lecture 19

CPSC/ECE 3220 Fall 2017 Exam Give the definition (note: not the roles) for an operating system as stated in the textbook. (2 pts.

Chapter 5 (Part II) Large and Fast: Exploiting Memory Hierarchy. Baback Izadi Division of Engineering Programs

Anne Bracy CS 3410 Computer Science Cornell University

Templates what and why? Beware copying classes! Templates. A simple example:

EECS 482 Introduction to Operating Systems

Sistemi in Tempo Reale

Syscalls, exceptions, and interrupts, oh my!

Part 1: Introduction to device drivers Part 2: Overview of research on device driver reliability Part 3: Device drivers research at ERTOS

Xen and the Art of Virtualization. Nikola Gvozdiev Georgian Mihaila

Virtual Machine Security

COS 318: Operating Systems

Four Components of a Computer System

PROCESS STATES AND TRANSITIONS:

Design Overview of the FreeBSD Kernel CIS 657

Process Concepts. CSC400 - Operating Systems. 3. Process Concepts. J. Sumey

What s in a process?

Precept 2: Non-preemptive Scheduler. COS 318: Fall 2018

Lecture 19: File System Implementation. Mythili Vutukuru IIT Bombay

Project 3 Improved Process Management

CSE 120 PRACTICE FINAL EXAM, WINTER 2013

Operating System Control Structures

CSC369 Lecture 2. Larry Zhang

Processes. Process Management Chapter 3. When does a process gets created? When does a process gets terminated?

Microkernels. Overview. Required reading: Improving IPC by kernel design

CIT 480: Securing Computer Systems. Operating System Concepts

CSE 153 Design of Operating Systems Fall 18

PROCESS CONTROL BLOCK TWO-STATE MODEL (CONT D)

Transcription:

Kernel Support for Paravirtualized Guest OS Shibin(Jack) Xu University of Washington shibix@cs.washington.edu ABSTRACT Flexibility at the Operating System level is one of the most important factors for highly optimized applications. For example, a database server may perform better with a file-system optimized for its data. xkvisor [2] is a kernel that provides flexibility at Operating System level in an approach different from microkernel [4] and virtual machine monitor [5]. xkvisor allows the application to run on some guest OS, implemented on top of the interface provided by xkvisor, with specialized file-system. This paper presents the integration of virtual memory and file-system in xkvisor. This paper focuses on two parts. The first is a set of modifications to the original VM interface in xk [1]. The second is user level file-system. 1. INTRODUCTION Modern operating systems provide a fixed set of abstractions that hide details about the hardware. On one hand, these abstractions make software development easier by excluding the software developer from worrying about hardware management. On the other hand, these abstractions prevent the software developer from optimizing the software for specific hardware. xkvisor exposes hardware to user level guest OS and allows software developer to deploy customized guest OS. Unlike microkernels, xkvisor allows multiple user level guest OSes to run at the same time. Each different guest OS will have its own VM and file-system implemented on top of the interface provided by xkvisor. In this paper, hereafter, guest OS refers to some user level operating system implemented on top of the interface provided by xkvisor and guest Application refers to some user level application that runs on some guest OS. Section 2 provides background for this paper. Section 3 talks about VM design and implementation. Section 4 talks about FS design and implementation. Section 5 talks about testing. Section 6 talks about future work. Section 7 concludes. 2. BACKGROUND We use as a starting point xk [1] a monolithic, teaching operating system to build xkvisor. xk supports single core processor with single threaded, multi process execution. Furthermore, xk provides LRU page management, copy-on-write fork, exec, and a POSIX-like file-system. Moreover, the entire operating system is implemented inside the kernel. For example, open() system call causes user level application to trap into kernel, then kernel will execute in the same address space as user level application but in kernel mode, and finally kernel executes open() on user level application s behalf. The rest of this section describes how xkvisor is different from microkernel and virtual machine monitor (i.e. VMM). 2.1 Microkernel A microkernel [4] removes the limitation imposed by fixed OS abstractions by allowing user to implement these abstractions in user level. For example, some user process can represent the user level file-system. However, microkernels do not allow multiple user level OSes to run at the same time. This means, there can only be a single operating system running for any microkernel at runtime. Hence, user level applications must run on that single operating system running on the microkernel. xkvisor, however, allows multiple guest OSes to run on the same kernel at the same time. Thus, xkvisor allows user level applications to run on different guest OSes at the same time. 2.2 Virtual Machine Monitor A VMM [5] removes the limitation imposed by fixed OS abstractions by allowing user to boost an instance of operating system to run the user level application. For example, user can ask the VMM to boost an instance of some OS that optimizes for database servers and then runs a database server on that instance. Xen [5] provides a kernel that can run one hundred instances of commodity OSes with little performance cost. Xen needs very little modification on the commodity OSes. In xkvisor, hypervisor (i.e. the kernel) and the guest OS are codesigned for efficiency and simplicity. Hence, xkvisor only supports guest OSes that are implemented on top of the interface provided by the kernel.

3. VM DESIGN AND IMPLEMENTATION In xk, resource management is handled by kernel; thus, there is no defined interface for a user-level process to manage another user-level process resource (e.g. pages). Furthermore, in xkvisor, a guest OS is a user-level process. Thus, to support page management within guest OS, we have introduced privileged system calls [2], that allow guest OS to manage hardware resource through kernel. For security purposes, only guest OS can make privileged system calls and any guest OS is restricted to use privileged system calls to manage its own resource. To distinguish between guest OS and guest Application, which both are user-level processes, xkvisor keeps track of a flag denoting whether a process is guest OS. This flag is set on creation of a userlevel process and stays unchanged throughout the life time of the user-level process. In order to provide guest Application the flexibility to use any guest OS for system calls, xkvisor keeps track of the process id (i.e. PID) of guest OS for guest App and system calls from the guest App will be redirected to the guest OS associated with the PID. This PID is set on creation of a user-level process and stays unchanged throughout the life time of the user-level process. Notice that, the PID for guest OS is unset and is not used. grequest_proc gload_program gdeploy_program gquery_user_pages gaddmap gremovemap gbread gbwrite guest_syscall gresume get_ppn() Allocates a user level process Loads the executable ELF into specified user level process code region Sets up arguments for specified user level process Returns a copy of physical page mapping (i.e. if the guest OS owns some physical page) Map a physical page at specified virtual address in specified user level process address space Removes the mapping of specified virtual address in specified user level process address space Reads in specified virtual disk block Writes to specified virtual disk block Fetches next system call to handle Resumes specified guest Application Returns physical page number of the physical page mapped at specified virtual address Figure 1 Privileged System Calls Table 3.1 Kernel Level Page Management To implement virtual memory management at user level, xkvisor need to provide an interface for guest OS to share pages with guest Application. The memory sharing model in xk has many limitations. Firstly, page sharing in xk must be done through copy-on-write fork. Secondly, copy-on-write fork in xk requires kernel to manage page sharing for user level processes. Lastly, copy-on-write fork in xk requires shared pages mapped at the same virtual address in all involved user level processes. However, guest OS in xkvisor need a much more flexible interface to share pages with guest Application. To provide desired page sharing flexibility, xkvisor provides a set of privileged system calls (Figure 1) to guest OS and keeps track of a page map for guest OS. The page map is an integer array with size equals to the number of physical pages. For instance, page map[n] of a guest OS denotes the guest OS ownership of physical page with physical page number (i.e. PPN) = n. 0 means not owned, 1 means owned, and 2 means shared with some guest Application.

Figure 2 guest OS and guest Application s address space By convention, each guest OS is a user level process, guest OS is not trusted to manipulate all its virtual memory. For example, a guest OS should not be able to change its code region at runtime. Hence, the page map of a guest OS will map all virtual memory that is managed by the kernel (e.g. stack, heap, code region of the guest OS) to 0 (e.g. not owned by the guest OS). In fact, only physical pages mapped in Guest OS Free Pages region (Figure 2) will be mapped to 1 or 2 in the guest OS page map. 3.2 Guest OS Creation When xkvisor prepares a user level process to run some guest OS, xkvisor allocates a fixed number of aligned physical pages for the user level process. Those pages are shown as owned in the guest OS page map but have not yet been mapped at any virtual address in guest OS address space. Then, xkvisor loads in code for the guest OS, yields CPU, and eventually kernel level scheduler schedules the guest OS to run.

When the guest OS starts running, guest OS uses gquery_user_pages() to obtain a copy of its page map. Guest OS expects a fixed number of aligned physical pages shown as owned. Then, guest OS uses map_physical_pages() to map those aligned physical pages at virtual address SZ_5G in its address space. For example, if guest OS owns 10 physical pages starting at PPN = 0, then the guest OS owns physical pages with PPN = {0, 1,, 9}, physical page 0 is mapped at virtual address SZ_5G, and physical page 9 is mapped at virtual address SZ_5G + PGSZ*9. Finally, guest OS can share those aligned physical pages with some guest Application through gaddmap and guest OS can directly access those aligned physical pages by accessing virtual address starting at SZ_5G. Guest OS execution flow is as follow. for(;;) { } gnext_syscall(&syscall); // will put guest OS to sleep if no syscalls ready guest_syscall(syscall); gresume(app_pid); // handles system call // resumes the guest Application that has made the system call Guest OS keeps fetching next system call to handle. If there is a system call for guest OS to handle, gnext_syscall will return immediately with next system call. If not, gnext_syscall causes kernel to put the guest OS to sleep and wakes it when next system call arrives. More details on system redirection and handling in section 3.4. 3.3 Guest Application Creation Guest Application is a user level process allocated by guest OS through many privileged system calls. When an end user signals a guest OS to run some application as guest Application (see TESTING section for more details), the guest OS uses grequest_proc() to ask xkvisor to allocate a user level process as guest Application with PID equals the guest OS PID. The difference between a kernel allocated process (e.g. guest OS) and grequest_proc() allocated process (e.g. guest Application) is that xkvisor sets up virtual memory including stack, heap, and code region for the former, but xkvisor does not set up any virtual memory for the latter. In other worlds, grequest_proc() gives guest OS a generic user level process that can be used to run any user level application with configurable virtual memory layout (but still only contains three virtual memory regions stack, heap, and code). For simplicity, guest OS in xkvisor chooses to only manage guest Application s stack (leaving kernel to manage guest Application s code region and disabling guest Application s heap region). After a user level process P is set up by grequest_proc(), guest OS uses gload_program() to ask xkvisor to set up P s code region. Moreover, in gload_program(), kernel allocates kernel pages for P s code region and loads in code for P. Thus, kernel is responsible for P s code region deallocation. This design decision prevents guest OS from changing guest Application s code at runtime. Next, guest OS searches for ten free pages in its page map and uses gaddmap() to map those ten pages into P s stack region giving P both read and write permissions on those ten pages. Finally, guest OS uses gdeploy_program() to ask xkvisor to set up arguments for P and marks P as runnable. Eventually, kernel level scheduler schedules P and P will start running. 3.4 System Call Redirection One of the challenges for supporting file-system related system calls is argument fetching. For instance, file name can be arbitrarily long, and it lives within guest Application s address space. xkvisor supports argument fetching by mapping guest Application s stack into guest OS address space in the way described in section 3.1~2. 3.4.1 Guest Application Address Translation In xk, there is no notion of guest Application address translation. This is because when two user level processes share pages through copy-on-write fork, shared pages are guaranteed to be mapped at the same virtual address. Since in xkvisor only guest Application s stack is mapped into guest OS virtual address space, we assume address translation only refers to the case when guest OS needs to translate from its guest Application s virtual address within guest Application s stack region to guest OS virtual address.

However, in xkvisor, virtual address translation is non-trivial. This is because a guest OS can be associated with a set of guest Application s and the guest Application s stack can refer to any page mapped at SZ_5G in guest OS address space. With the goal of maximizing kernel bypass, address translation is handled in guest OS in the following way. 1. Guest OS uses get_ppn() to ask kernel to translate guest Application s virtual address into PPN 2. Guest OS translates PPN into some virtual address in its address space starting at SZ_5G (this virtual address is the start of the physical page) 3. Guest OS calculates offset of the virtual address within the physical page 3.4.2 Argument Fetching In xk, argument fetching is easy because when kernel handles a system call for some user level process P, kernel is mapped into P s address space. Thus, kernel can directly access P s arguments without any address translation. In xkvisor, when a guest Application makes a system call, kernel passes trap frame (which contains register values and guest Application s PID) into guest OS buffer. The register values contain arguments for each sys_call with the constrain that first argument must be the sys_call number. Thus, guest OS can determine what arguments to expect from sys_call number and use address translation discussed above to fetch appropriate arguments. The limitation of this design is that arguments for sys_call must not exceed the number of registers and arguments must be stored on guest Application s stack (otherwise will not be accessible to guest OS). However, argument number limitation can be easily resolved by combining arguments into a struct and pass a pointer to that struct. Currently, none of the sys_calls have met this limitation. 4. FS DESIGN AND IMPLEMENTATION In xk, kernel implements a fault tolerant POXIS-like file system. All user level processes share the same file system in xk with no isolation. In xkvisor, however, kernel only provides two basic disk I/O privileged system calls to guest OS for block reading and writing and each guest OS is expected to implement its own file system in isolation. For example, changes made in some guest OS disk blocks will not be visible to some other guest OS. The disk is assumed to be formatted before xkvisor boots up. Moreover, xkvisor reads its supper block to learn which contiguous physical disk blocks are allocated to which guest OS and guest OS accesses its disk block by using virtual disk block numbers through gbread() and gbwrite(). When xkvisor handles gbread() or gbwrite() from a guest OS, xkvisor translates virtual block number passed by the guest OS to the physical block number.

Figure 3 Disk Layout 4.1 Guest OS Disk Allocation In xkvisor, disk is formatted by mkfs.c before kernel boots. As shown in Figure 3, the entire disk is considered as Kernel Disk Space and is managed by xkvisor. Some portion of the disk is allocated to guest OSes and this portion is considered as Guest OS Disk Space and each Guest OS Disk is managed by the owner guest OS. 4.2 Guest OS File System In xkvisor, each running guest OS has its own file system and guest Application can choose which guest OS to use and hence choose which file system to use. xkvisor makes sure each guest OS file system is isolated at disk level by forcing guest OS to use virtual disk blocks. However, there is no isolation within each guest OS file system, which is all guest Applications relying on the same guest OS will share the same file system provided by that same guest OS.

Currently, we have only implemented one file system for guest OS, that is every running guest OS has a copy of the same file system implementation. This is for simplicity, and xkvisor allows different guest OS to have different file system implementations. In xkvisor, guest OS file system is the same as the file system in xk but without crash safety (see FUTURE WORK section for more details). The guest OS file system involves two parts, the in-memory representation of FS and the on-disk representation of FS. 4.2.1 Guest OS FS Bootup In xkvisor, after guest OS starts running, guest OS initializes its file system by reading in its super block through gbread(). The super block contains information about guest OS disk layout including the existing files on guest OS disk. 4.2.2 Guest OS In-memory FS In xkvisor, guest OS in-memory file system is made of an icache and an OFT (i.e. open-file-table). The icache contains an array of inodes [1] and an inodefile [1]. The inodefile is the first inode on each guest OS disk containing information of other inodes (e.g. size, starting block). The OFT is an array of files [1] inside each guest OS address space and is accessed and managed only by that guest OS. File-system related data structures and abstractions such as inode, inodefile, and OFT originate from xk. The difference is that, in xk, kernel manages these abstractions and all user level processes share the same file system, but in xkvisor, each guest OS manages its own copy of these abstractions and only guest Application associated with that guest OS share the same file system. 4.2.3 Guest OS On-disk FS In xkvisor, physical disk has two regions. The first region is the Kernel Disk Space (i.e. entire disk) consisting kernel s file system. The second region is the Guest OS Disk Space consisting guest OS file system. Inside Guest OS Disk Space, kernel allocates a fixed number of contiguous disk blocks to each guest OS. The metadata regarding guest OS disk is stored inside kernel s super block. 4.3 Guest Application FS System Calls In xk, kernel provides a set of POSIX-like file-system system calls; hence, user level processes do not have an alternative set of file-system system calls to use. In xkvisor, each guest OS implements its own file system; hence, guest Application can freely choose any guest OS with a desired file system interface. For simplicity, xkvisor reuses the same POSIX-like interface originates from xk. In xkvisor, when a guest Application makes a files-system system call (e.g. open), kernel and guest OS use the procedure described in 3.4.2 Argument Fetching. After guest OS has successfully fetched arguments passed in by guest Application, guest OS can then use sys_call number to decide which system call handler to evoke and what arguments to pass to that handler. In xk, system call handler executes in kernel mode. In xkvisor, however, guest OS system call handler executes in user mode and uses privileged system calls only when necessary. For example, for guest OS read() handler (read hereafter) to read content into guest Application s buffer, read uses gbread to read in the content and uses procedure described in 3.4.1 Guest Application Address Translation to directly write the content into guest Application s buffer. 5. TESTING The test cases focus on if two guest Applications can each use a different guest OS for file-system system calls, and the two guest Applications interact with two separated file systems in isolation.

5.1 Roadmap Before xkvisor boots up, mkfs.c formats the disk for xkvisor. After xkvisor boots up and kernel has started running, kernel creates first user level process init. Next, init asks kernel to create three user level processes, shell, guest OS 1, and guest OS 2. Finally, shell prompts end user input to run testing programs. 5.2 Testing Details Before end user enters any input, shell prints out guest OS 1, 2 s PIDs (PID1 and PID2 respectively hereafter). The end user can use PID1 or PID2 to denote which guest OS to execute guest Application on. The testing guest Application consists file read/write and file creation. To test file system isolation, end user can first run testing guest Application with PID1 and then with PID2 to see the file modification made in first run is not visible in second run. 6. FUTURE WORK 6.1 VM Design Currently in xkvisor, guest OS does not provide guest Application a heap, but this can be done by using the same mechanics used for guest OS to manage guest Application s stack. The modification will include adding system calls supporting dynamitic heap allocation and deallocation. Moreover, in xkvisor, guest Application relies on kernel to pass the trap frame to guest OS for system calls. Thus, every system call made by guest Application will cause a trap into kernel which is highly expensive. A design alternative is to use user-level RPC [3] for system calls which can bypass kernel more. 6.2 File-System Currently in xkvisor, guest OS only implements one file system that is not crash safe. After user level locking is added to code base, xkvisor should add crash safety into guest OS file system. To show that xkvisor can support multiple guest OS with different file systems running at the same time, xkvisor should include a new guest OS implementation with a different file system. 7. CONCLUSIONS In this paper, we presented xkvisor, a kernel that supports multiple paravirtualized guest OSes. VM interface and guest OS file-system implementation show an approach to kernel support for paravirtualized guest OS in physical resource allocation and management. REFERENCES [1] T. Anderson, D. Zhuo, M. Rockett. et. al. xk ( Experimental Kernel ). University of Washington. 2017 - present. [2] Bryan Yue, Omeed Magness, Sandip Samantaray, Edward Matsushima, Bernard Kim Xkvisor. University of Washington. [3] T. Anderson, Edward D. Lazowska, Henry M Levy, Brian Nathan Bershad. User-level interprocess communication for shared memory multiprocessors. University of Washington. [4] J. Liedtke. GMD -- German National Research Center for Information Technology, GMD SET-RS, 53754 Sankt Augustin, Germany. On micro-kernel construction. In SOSP, 1995. [5] P. Barham, B. Dragovic, H. Fraser, S. Hand, T. Harris, A. Ho, R. Neugebauer, I. Pratt, and A. Warfield. Xen and the Art of Virtualization. In SOSP, 2003.