Bridging the Gap between Software and Hardware Techniques for I/O Virtualization

Size: px
Start display at page:

Download "Bridging the Gap between Software and Hardware Techniques for I/O Virtualization"

Transcription

1 Bridging the Gap between Software and Hardware Techniques for I/O Virtualization Jose Renato Santos, Yoshio Turner, G.(John) Janakiraman, Ian Pratt HP Laboratories HPL Keyword(s): Virtualization, Networking, Operating System, I/O, Performance Analysis, Xen Abstract: The paravirtualized I/O driver domain model, used in Xen, provides several advantages including device driver isolation in a safe execution environment, support for guest VM transparent services including live migration, and hardware independence for guests. However, these advantages currently come at the cost of high CPU overhead which can lead to low throughput for high bandwidth links such as 1 gigabit Ethernet. Direct I/O has been proposed as the solution to this performance problem but at the cost of removing the benefits of the driver domain model. In this paper we show how to significantly narrow the performance gap by improving the performance of the driver domain model. In particular, we reduce execution costs for conventional NICs by 56% on the receive path, and we achieve close to direct I/O performance for network devices supporting multiple hardware receive queues. These results make the Xen driver domain model an attractive solution for I/O virtualization for a wider range of scenarios. External Posting Date: June 21, 28 [Fulltext] Approved for External Publication Internal Posting Date: June 21, 28 [Fulltext] To be published and presented at 28 Usenix Annual Technical Conference, June 22-27, 28, Boston, Massachusetts. Copyright 28 Usenix Annual Technical Conference

2 Bridging the Gap between Software and Hardware Techniques for I/O Virtualization Jose Renato Santos Yoshio Turner G.(John) Janakiraman Ian Pratt Hewlett Packard Laboratories, Palo Alto, CA University of Cambridge, Cambridge, UK Abstract The paravirtualized I/O driver domain model, used in Xen, provides several advantages including device driver isolation in a safe execution environment, support for guest VM transparent services including live migration, and hardware independence for guests. However, these advantages currently come at the cost of high CPU overhead which can lead to low throughput for high bandwidth links such as 1 gigabit Ethernet. Direct I/O has been proposed as the solution to this performance problem but at the cost of removing the benefits of the driver domain model. In this paper we show how to significantly narrow the performance gap by improving the performance of the driver domain model. In particular, we reduce execution costs for conventional NICs by 56% on the receive path, and we achieve close to direct I/O performance for network devices supporting multiple hardware receive queues. These results make the Xen driver domain model an attractive solution for I/O virtualization for a wider range of scenarios. 1 Introduction In virtual machine environments like VMware [12], Xen [7], and KVM [23], a major source of performance degradation is the cost of virtualizing I/O devices to allow multiple guest VMs to securely share a single device. While the techniques used for virtualizing CPU and memory resources have very low overhead leading to near native performance [28][7][4], it is challenging to efficiently virtualize most current I/O devices. Each interaction between a guest OS and an I/O device needs to undergo costly interception and validation by the virtualization layer for security isolation and for data multiplexing and demultiplexing [27]. This problem is particularly acute when virtualizing high-bandwidth network interface devices because frequent software interactions with Currently at Skytap. the device are needed to handle the high rate of packet arrivals. Paravirtualization [29] has been proposed and used (e.g., in Xen [7][13]) to significantly shrink the cost and complexity of I/O device virtualization compared to using full device emulation. In this approach, the guest OS executes a paravirtualized (PV) driver that operates on a simplified abstract device model exported to the guest. The real device driver that actually accesses the hardware can reside in the hypervisor, or in a separate device driver domain which has privileged access to the device hardware. Using device driver domains is attractive because they allow the use of legacy OS device drivers for portability, and because they provide a safe execution environment isolated from the hypervisor and other guest VMs [16][13]. Even with the use of PV drivers, there remains very high CPU overhead (e.g., factor of four) compared to running in non-virtualized environments [18][17], leading to throughput degradation for high bandwidth links (e.g., 1 gigabits/second Ethernet). While there has been significant recent progress making the transmit path more efficient for paravirtualized I/O [17], little has been done to streamline the receive path, the focus of this paper. To avoid the high performance overheads of softwarebased I/O device virtualization, efforts in academia and industry are working on adding, to varying degrees, hardware support for virtualization into I/O devices and platforms [1][24][31][22][3][6]. These approaches present a tradeoff between efficiency and transparency of I/O device virtualization. In particular, using hardware support for direct I/O [31][24][22], in which a device presents multiple logical interfaces which can be securely accessed by guest VMs bypassing the virtualization layer, results in the best possible performance, with CPU cost close to native performance. However, direct I/O lacks key advantages of a dedicated driver domain model: device driver isolation in a safe execution environment avoiding guest domain corruption by buggy drivers, etc.,

3 and full support for guest VM transparent services including live migration [26] [21] [11] and traffic monitoring. Restoring these services would require either additional support in the devices or breaking the transparency of the services to the guest VM. In addition, it is difficult to exploit direct I/O in emerging virtual appliance models of software distribution which rely on the ability to execute on arbitrary hardware platforms. To use direct I/O, virtual appliances would have to include device drivers for a large variety of devices increasing their complexity, size, and maintainability. In this paper, we significantly bridge the performance gap between the driver domain model and direct I/O in the Xen virtual machine environment, making the driver domain model a competitive and attractive approach in a wider range of scenarios. We first present a detailed analysis of the CPU costs of I/O operations for devices without hardware support for virtualization. Based on this analysis we present implementation and configuration optimizations, and we propose changes to the software architecture to reduce the remaining large costs identified by the analysis: per-packet overheads for memory management and protection, and per-byte data copy overheads. Our experimental results show that the proposed modifications reduce the CPU cost by 56% for a streaming receive microbenchmark. In addition to improving the virtualization of conventional network devices, we propose extensions to the Xen driver domain model to take advantage of emerging network interface devices that provide multiple hardware receive queues with packet demultiplexing performed by the device based on packet MAC address and VLAN ID [1]. The combination of multi-queue devices with our software architecture extensions provides a solution that retains all the advantages of the driver domain model and preserves all the benefits of virtualization including guest-transparent migration and other services. Our results show that this approach has low overhead, incurring CPU cost close to that of direct I/O for a streaming receive microbenchmark. The rest of the paper is organized as follows. Section 2 reviews the Xen driver domain model. Section 3 presents a detailed analysis and breakdown of the costs of I/O virtualization. Section 4 presents our implementation and architectural changes that significantly improve performance for the driver domain model. Section 5 discusses how configuration settings affect I/O virtualization performance, and Section 6 presents our conclusions. 2 Xen Network I/O Architecture Figure 1 shows the architecture of Xen paravirtualized (PV) networking. Guest domains (i.e., running virtual machines) host a paravirtualized device driver, netfront, dr iv er domain b r i d g e d e v i c e d r i v e r NIC map packet free skb copy packet netb a c k I O c h a n n e l s H ar dw T X RX X en h y p er v isor ar e guest domain TX buffer Free buffer RX buffer RX packet netfront Figure 1: Xen PV driver architecture which interacts with the device indirectly through a separate device driver domain, which has privileged access to the hardware. Driver domains directly access hardware devices that they own; however, interrupts from these devices are first handled by the hypervisor which then notifies the corresponding driver domain through virtual interrupts. Netfront communicates with a counterpart backend driver called netback in the driver domain, using shared memory I/O channels. The driver domain uses a software bridge to route packets among the physical device and multiple guests though their netback interfaces. Each I/O channel comprises of an event notification mechanism and a bidirection ring of asynchronous requests carrying I/O buffer descriptors from netfront to netback and the corresponding responses. The event notification mechanism enables netfront and netback to trigger a virtual interrupt on the other domain to indicate new requests or responses have been posted. To enable driver domains to access I/O buffers in guest memory Xen provides a page grant mechanism. A guest creates a grant reference providing access to a page and forwards the reference as part of the I/O buffer descriptor. By invoking a hypercall, the driver domain uses the grant reference to access the guest page. For transmit (TX) requests the driver domain uses a hypercall to map the guest page into its address space before sending the request through the bridge. When the physical device driver frees the page a callback function is automatically invoked to return a response to netfront which then revokes the grant. For RX requests netfront posts I/O buffer page grants to the RX I/O channel. When netback receives a packet from the bridge it retrieves a posted grant from the I/O channel and issues a grant copy hypercall to copy the packet to the guest page. Finally, netback sends a response to the guest via the RX channel indicating a packet is available.

4 Table 1: Classes grouping Linux functions Class Description driver network device driver and netfront network general network functions bridge network bridge netfilter network filter netback netback mem memory management interrupt interrupt, softirq, & Xen events schedule process scheduling & idle loop syscall system call time time functions dma dma interface hypercall call into Xen grant issuing & revoking grant Table 2: Classes grouping Xen Function Class Description grant grant map unmap or copy operation schedule domain scheduling hypercall hypercall handling time time functions event Xen events mem memory interrupt interrupt entry enter/exit Xen (hypercall, interrupt, fault), traps fault handling (also system call intercept) 3 Xen Networking Performance Analysis This section presents a detailed performance analysis of Xen network I/O virtualization. Our analysis focuses on the receive path, which has higher virtualization overhead than the transmit path and has received less attention in the literature. We quantify the cost of processing network packets in Xen and the distribution of cost among the various components of the system software. Our analysis compares the Xen driver domain model, the Xen direct I/O model, and native Linux. The analysis provides insight into the main sources of I/O virtualization overhead and guides the design changes and optimizations we present in Section Experimental Setup We run our experiments on two HP c-class blade servers BL46c connected through a gigabit Cisco Catalyst Blade Switch 32. Each server has two 3GHz Intel Xeon 516 CPUs (two dual-core CPUs) with 4MB of L2 cache each, 8GB of memory, and two Broadcom NetXtreme II BCM5785 gigabit Ethernet Network Interface Cards (NICs). Although each server had two NICs, only one was used in each experiment presented in this paper. Table 3: Global function grouping Class Description xen Xen functions in domain context kernel kernel functions in domain grantcopy data copy in grant code xen Xen functions in guest context kernel kernel funnctions in guest usercopy copy from kernel to user buffer To generate network traffic we used the netperf[1] UDP STREAM microbenchmark. It would be difficult to separate the CPU costs on the transmit and receive paths using TCP, which generates ACK packets in the opposite direction of data packets. Therefore, we used unidirectional UDP instead of TCP traffic. Although all results are based on UDP traffic, we expect TCP to have similar behavior at the I/O virtualization level. We used a recent Xen unstable 1 distribution with paravirtualized Linux domains (i.e. a modified Linux kernel that does not require CPU virtualization support) using linux xen 2. The system was configured with one guest domain and one privileged domain which was also used as driver domain. For direct I/O evaluation the application was executed directly in domain. Both domain and the guest domain were configured with 512MB of memory and a single virtual CPU each. The virtual CPUs of the guest and domain were pinned to different cores of different CPU sockets 3. We use OProfile [2][18] to determine the number of CPU cycles used in each Linux and Xen function when processing network packets. Given the large number of kernel and hypervisor functions we group them into a small number of classes based on their high level purpose as described in Tables 1 and 2. In addition, when presenting overall results we group the functions in global classes as described in Table 3. To provide safe direct I/O access to guests an IOMMU [8] is required to prevent device DMA operations from accessing other guests memory. However, we were not able to obtain a server with IOMMU support. Thus, our results for direct I/O are optimistic since they do not include IOMMU overheads. Evaluations of IOMMU overheads are provided in [9][3]. 3.2 Overall Cost of I/O Virtualization Figure 2 compares the CPU cycles consumed for processing received UDP packets in Linux, and in Xen with direct I/O access from the guest and with Xen paravirtualized (PV) driver. The graph shows results for three different sizes of UDP messages 4 : 52, 15 and 48 bytes. A message with 52 bytes corresponds to a typical TCP ACK,

5 X 25, X e n P V d r iv e r e n d ir e c t I / O Linux 2, 1,8 cycles/packet 2, 15, 1, 5, app usercopy kernel xen grantcopy kernel xen cycles/packet 1,6 1,4 1,2 1, other traps entry interrupt mem time hypercall schedule K K KK KKK Message size (bytes) K Message size (bytes) Figure 2: CPU usage to receive UDP packets Figure 4: Hypervisor CPU cost for direct I/O cycles/packet 6, 5, 4, 3, 2, 1, Message=15 Message=4 8 K Xen Xen-T L B L i nu x Xen Xen-T L B L i nu x Figure 3: Kernel CPU cost for direct I/O other hypercall dma time syscall schedule interrupt mem network driver while a message with 15 bytes corresponds to the maximum ethernet packet size. Using messages with 48 bytes also generates maximum size packets but reduces the overhead of system calls at the kernel and application interface by delivering data in larger chunks. The experiments with maximum size packets in Figure 2 and in the remainder results presented in this paper were able to saturate the gigabit link. Experiments with 52-byte packets were not able to saturate the network link as the receive CPU becomes the bottleneck. To avoid having an overloaded CPU with a high number of dropped packets, we throttled the sender rate for small packets to have the same packet rate as with the large packet sizes. The results show that in the current Xen implementation, the PV driver model consumes significantly more CPU cycles to process received packets than Linux. Direct I/O has much better performance than the PV driver, but it still has significant overhead when compared to Linux, especially for small message sizes. In the rest of this paper we analyze the sources of I/O virtualization overheads in detail, and based on this analysis we propose several changes in the design and implementation of network I/O virtualization in Xen. 3.3 System Call Overhead We start by looking at the overheads when running guests with direct I/O access. It is surprising that for small mes- sage sizes Xen with direct I/O uses twice the number of CPU cycles to process received packets compared to non-virtualized Linux. This is a consequence of memory protection limitations of the 64-bit x86 architecture that are not present in the 32-bit X86 architecture. For paravirtualized guests, the entire hypervisor memory is mapped into a small region at the end of every guest virtual address space. In the 32-bit x86 architecture this memory is protected from guest accesses using segments. Segments containing hypervisor memory are only accessible in privilege level. The 3 additional privilege levels (rings) in the x86 architecture are available to the guest and enable the kernel to protect its memory from user level code. The 64-bit x86 architecture does not have support for segmentation, forcing Xen to rely on page table protection mechanisms to protect its memory from guest accesses. However, page table entries use only a single bit to distinguish supervisor and user mode, effectively limiting the processor to just two privilege levels. Since Xen uses one privilege level to protect its memory, kernel and user-level applications have to be executed at the same privilege level and therefore cannot share the same set of page tables. This requires Xen to intercept every system call to switch between user and kernel level page tables. In addition to the overhead of the interception, TLB flushes associated with the page table switches reduce performance of the kernel and application code. TLB flushes have negligible effect on hypervisor performance since Xen pages use global mappings and are not flushed from the TLB on system calls. Figure 3 shows the cost of TLB flushes on the guest kernel. The graph for Xen-TLB shows the performance improvement when the TLB flush on system call interception is disabled. This is not a safe optimization as it exposes guest kernel memory to user-level processes but it is used here for performance evaluation purposes. The results in Figure 3 show that the cost of TLB flushes is approximately 3% of the Linux cost when we have one system call per packet (message = 15 bytes). In contrast, with large message sizes of 48 bytes the overhead due to TLB flushes is negligible. Figure 4 shows the

6 cost of Xen code used to intercept system calls. The cost difference between the two message sizes corresponds approximately to the system call interception cost. Note that the cost of intercepting system calls and TLB flushes affects guests using PV driver in the same way as they affect guests using direct I/O. We note that system call interception is an artifact of current hardware limitations which will be eliminated with time. CPU hardware support for virtualization[2][5] provides different partitions for hypervisor and guests and eliminates the need for system call interception. Over time, as the performance of hardware virtualization mechanisms improves, it is expected that paravirtualized guests will also use this mechanism (currently used only by fully virtualized guests) and eliminate the need for system call interception. Given that the overhead for system call interception is temporary and not specific to I/O virtualization we will avoid its effect in the remaining results presented in this paper. In the following graphs we only present results for large message sizes (48 bytes) where the overhead due to system call interception is negligible. 3.4 Analysis of Direct I/O Performance Ignoring system call interception we observe that the kernel CPU cost for Xen with direct I/O is similar to Linux, as illustrated in the large message results in Figure 3. Xen with direct I/O consumes approximately 9 more CPU cycles per packet than native Linux (i.e. 31% overhead) for the receive path. Of these 9 cycles, 25 cycles are due to paravirtualization changes in the kernel code. In particular, direct I/O in Xen has more CPU cycles compared to Linux in DMA functions. This is due to the use of a different implementation of the Linux DMA interface. The DMA interface is a kernel service used by device drivers to translate a virtual address to a bus address used in device DMA operations. Native Linux uses the default (pci-nommu.c) implementation, which simply returns the physical memory address associated with the virtual address. Para-virtualized Linux uses the software emulated I/O TLB code (swioltb.c), which implements the Linux bounce buffer mechanism. This is needed in Xen because guest I/O buffers spanning multiple pages may not be contiguous in physical memory. The I/O bounce buffer mechanism uses an intermediary contiguous buffer and copies the data to/from its original memory location after/before the DMA operation, in case the original buffer is not contiguous in physical memory. However, the I/O buffers used for regular ethernet packets do not span across multiple pages and thus are not copied into a bounce buffer. The different number of observed CPU cycles is due to the different logic and extra checks needed to verify if the buffer is contiguous, and not due to an extra data copy. Of the total 9 cycles/packet overhead for direct I/O, the hypervisor accounts for 65 cycles/packet as shown in Figure 7. Most of these cycles are in timer related functions, interrupt processing, and entering and exiting (entry) the hypervisor due to interrupts and hypercalls. 3.5 Analysis of PV Driver Performance Xen PV driver consumes significantly more CPU cycles than direct I/O as shown in Figure 2. This is expected since Xen PV driver runs an additional domain to host the device driver. Extra CPU cycles are consumed both by the driver domain kernel and by the hypervisor which is now executed in the context of two domains compared to one domain with direct I/O. Additional CPU cycles are consumed to copy I/O data across domains using the Xen grant mechanism. The end result is that the CPU cost to process received network packets for Xen PV driver is approximately 18,2 cycles per packet which corresponds to 4.5 times the cost of native Linux. However, as we show in this paper, implementation and design optimizations can significantly reduce this CPU cost. Of the total 18,2 CPU cycles consumed, approximately 17 cycles are for copying data from kernel to user buffer (usercopy), 43 for guest kernel functions (kernel), 9 for Xen functions in guest context (xen), 3 for copying data from driver domain to guest domain (grantcopy), 54 for driver domain kernel functions (kernel), and 29 for Xen functions in driver domain context (xen). In contrast native Linux consumes approximately 4 CPU cycles for each packets where 11 cycles are used to copy data from kernel to user space and 29 cycles are consumed in other kernel functions. In the following subsections we examine each component of the CPU cost for Xen PV driver to identify the specific causes of high overhead Copy overhead We note in Figure 2 that both data copies (usercopy and grantcopy) in Xen PV driver consume a significantly higher number of CPU cycles than the single data copy in native Linux (usercopy). This is a consequence of using different memory address alignments for source and destination buffers. Intel procesor manuals [15] indicate that the processor is more efficient when copying data between memory locations that have the same 64-bit word alignment and even more efficient when they have the same cache line alignment. But packets received from the network are non aligned in order to align IP headers following the typical 14-byte Ethernet header. Netback copies the

7 4, 4, cycles/packet 3,5 3, 2,5 2, 1,5 1, 5 cycles/packet 3,5 3, 2,5 2, 1,5 1, 5 other traps entry interrupt mem event time hypercall schedule grant_table non aligned 6 4 b it aligned c ac h e aligned Direct I/O (X en ) P V d riv er (X en in g u es t) P V d riv er (X en in d o m a in ) Figure 5: Alignment effect on data copy cost Figure 7: Hypervisor CPU cost cycles/packet 5, 4,5 4, 3,5 3, 2,5 2, 1,5 1, 5 direct I/O P V driv er P V driv er n o f ra g m en t other grant hypercall dma time syscall schedule interrupt mem network driver cycles/packet 6, 5, 4, 3, 2, 1, L i nu x Xen P V d r i v er Linux kernel D o m a in kernel D o m a in kernel (n e t f i l t e r b r i d g e d i s a b l e d ) Figure 8: Driver domain CPU cost (kernel) other hypercall dma time syscall schedule interrupt mem netback netfilter bridge network driver Figure 6: Kernel CPU cost non aligned packets to the beginning of the granted page which by definition is aligned. In addition, since now the packet starts at a 64-bit word boundary in the guest, the packet payload will start at a non word boundary in the destination buffer, due to the Ethernet header. This causes the second copy from the kernel to the user buffer in the guest to also be misaligned. The two unaligned data copies consume significantly more CPU cycles than aligned copies. To evaluate this overhead, we modified netback to copy the packet into the guest buffer with an offset that makes the destination of the copy have the same alignment as the source. This is possible because we use a Linux guest which permits changing the socket buffer boundaries after it is allocated. Figure 5 shows the CPU cost of the grant copy for different copy alignments. The first bar shows the number of CPU cycles used to copy a 15 byte packet when the source and destination have different word alignments. The second bar shows the number of CPU cycles when source and destination have the same 64-bit word alignment but have different cache line alignment (128 bytes), while the third bar shows the number of CPU cycles when source and destination have the same cache line alignment. These results show that proper alignment can reduce the cost of the copy by a factor of two Kernel overhead Figure 6 compares the kernel cost for the Xen PV driver model and for the direct I/O model. Xen PV driver uses significantly more CPU cycles than direct I/O, especially in device driver functions. While the device drivers are different (i.e. physical device driver for direct I/O and netfront for PV driver), we would expect that the physical device driver in direct I/O would be more expensive as it has to interact with a real device, while netfront in the PV driver model only accesses the I/O channels hosted in main memory. After careful investigation we determined that the main source of increased CPU cycles in Xen PV driver is the use of fragments in guest socket buffers. Xen supports TCP Segmentation Offloading (TSO)[17] and Large Receive Offloading (LRO)[14][19] in its virtual interfaces enabling the use of large packets spanning multiple pages for efficient packet processing. For this reason netfront posts full page buffers which are used as fragments in socket buffers, even for normal size ethernet frames. Netfront copies the first 2 bytes of the first fragment into the main linear socket buffer area, since the network stack requires the packet headers in this area. Any remaining bytes past the 2 bytes are kept in the fragment. Use of fragments thereby introduces copy overheads and socket buffer memory allocation overheads. To measure the cost of using fragments we modified netfront to avoid using fragments. This was ac-

8 complished by pre-allocating socket buffers and posting those buffers directly into the I/O channel instead of posting fragment pages. This modification assumes packets will not use multiple pages and thus will not require fragments, since the posted buffers are regular socket buffers and not full page fragments. Since our NIC does not support LRO [14] and it is not configured with jumbo packets, all packets received from the external network use single-page buffers in our experiments. Of course this modification cannot be used for guest to guest communication since a guest can always send large fragmented packets. The third bar in Figure 6 shows the performance results using the modified netfront. The result shows that using socket buffers with fragments is responsible for most of the additional kernel cycles for the PV driver case when compared to direct I/O. Without the overhead of fragments the kernel CPU cost for PV driver is very close to that of direct I/O, except for small variations in the distribution of cycles among the different kernel functions. In particular, netfront in Xen PV driver now has lower cost than the physical device driver in direct I/O, as expected. Also since netfront does not access a physical device it does not need to use the software I/O TLB implementation of the DMA interface used by direct I/O. On the other hand PV drivers have the cost of issuing and revoking grants, which are not used with direct I/O. Most of this grant cost is due to the use of expensive atomic compare and swap instructions to revoke grant privileges in the guest grant table. This is necessary because Xen uses different bits of the same grant table field to store the status of the grant (grant in use or not by the driver domain) updated by Xen, and the grant access permission (enable or revoke grant access) updated by the issuing domain. The guest must revoke and check that the grant is no longer in use by the driver domain using an atomic operation, to ensure that a driver domain does not race with grant revoking and keep an undesired reference to the page after the grant is revoked. The use of atomic compare and swap instructions to revoke a grant adds a significant number of CPU cycles in the guest kernel cost. If however these two different grant bits are stored in different words, we can ensure atomicity using less expensive operations such as memory barriers. The result with implementation optimizations presented later in Section 4 include this and other grant optimizations discussed in the following Section Hypervisor overhead I/O processing consumes CPU cycles executing hypervisor functions in addition to guest kernel code. Figure 7 shows the CPU cost in Xen hypervisor functions for receiving network packets with PV driver. The graph shows the CPU cost when executing Xen functions in the context of the guest and in the context of the driver domain (i.e. domain ), and compares them with the CPU cost for direct I/O. The graph shows that most of the CPU cost for the hypervisor is due to code executing in the context of the driver domain, and the larger cost components are due to grant operations and schedule functions. The hypervisor CPU cost in guest context for PV drivers is similar to the hypervisor cost for direct I/O except for a higher cost in schedule functions. The higher cost in scheduling functions for PV driver is due to increased cache misses when accessing Xen data structures for domain scheduling purposes. With PV driver, the guest and the driver domain run on different CPUs, causing domain data structures used by scheduling functions to bounce between the two CPUs. We identified the most expensive operations for executing grant code by selectively removing code from the grant functions. Basically they are the following: 1) acquiring and releasing spin locks, 2) pinning pages, and 3) use of expensive atomic swap operations to update grant status as previously described. All of these operations can be optimized. The number of spinlock operations can be significantly reduced by combining the operations of multiple grants in a single critical section. The use of atomic swap operations can be avoided by separating grant fields in different words as described in section Note that this optimization reduces overhead in Xen and in the guest, since both of them have to access the same grant field atomically. The cost of pinning pages can also be optimized when using grants for data copy. During a grant copy operation, the hypervisor creates temporary mappings into hypervisor address space for both source and destination of the copy. The hypervisor also pins (i.e. increment a reference counter) both pages to prevent the pages from being freed while the grant is active. However, usually one of the pages is already pinned and mapped in the address space of the current domain which issued the grant operation hypercall. Thus we can avoid mapping and pinning the domain local page and just pin the foreign page referred by the grant. It turns out that pinning pages for writing in Xen is significantly more expensive than pinning pages for read, as it requires to increment an additional reference counter using an expensive atomic instruction. Therefore, this optimization has higher performance impact when the grant is used to copy data from a granted page to a local page (as we propose below in Section 4.1) instead of the other way around.

9 3.5.4 Driver domain overhead Figure 8 shows CPU cycles consumed by the driver domain kernel (2nd bar graph; domain ) and compares it with the kernel cost in native Linux (1st bar graph). The results show that the kernel cost in the driver domain is almost twice the cost of the Linux kernel. This is somewhat surprising, since in the driver domain the packet is only processed by the lower level of the network stack (ethernet), although it is handled by two device drivers: native device driver and netback. We observe that a large number of the CPU cycles in driver domain kernel are due to bridge and netfilter functions. Note that although the kernel has netfilter support enabled, no netfilter rule or filter is used in our experiments. The cost shown in the graph is the cost of netfilter hooks in the bridge code that are always executed to test if a filter needs to be applied. The third bar graph in the figure shows the performance of a driver domain when the kernel is compiled with the bridge netfilter disabled. The results show that most of the bridge cost and all netfilter cost can be eliminated if the kernel is configured appropriately when netfilter rules are not needed. 4 Xen Network Design Changes In the previous section we identified several sources of overhead for Xen PV network drivers. In this section we propose architectural changes to the Xen PV driver model that significantly improve performance. These changes modify the behavior of netfront and netback, and the I/O channel protocol. Some inefficiencies identified in Section 3 can be reduced through implementation optimizations that do not constitute architectural or protocol changes. Although some of the implementation optimizations are easier to implement in the new architecture, we evaluated their performance impact in the current architecture in order to separate their performance benefits from that of the architectural changes. The implementation optimizations include: disabling the netfilter bridge, using aligned data copies, avoiding socket buffer fragments and the various grant optimizations discussed in Sections and The second bar in Figure 9 shows the cumulative performance impact of all these implementation optimizations and is used as a reference point for the performance improvements of the architectural changes presented in this section. In summary, the implementation optimizations reduce the CPU cost of Xen PV driver by approximately 495 CPU cycles per packet. Of these, 155 cycles are due to disabling bridge netfilter, 185 cycles due to using cache aligned data copies, 9 cycles due to avoiding socket buffer fragments and 65 cycles due to grant optimizations. 4.1 Move Data Copy to Guest As described in Section 3 a primary source of overhead for Xen PV driver is the additional data copy between driver domain and guest. In native Linux there is only one data copy, as the received packet is placed directly into a kernel socket buffer by the NIC and later copied from there to the application buffer. In Xen PV driver model there are two data copies, as the received packet is first placed in kernel memory of the driver domain by the NIC, and then it is copied to kernel memory of the guest before it can be delivered to the application. This extra cost could be avoided if we could transfer the ownership of the page containing the received packet from the driver domain to the guest, instead of copying the data. In fact this was the original approach used in previous versions of Xen [13]. The first versions of Xen PV network driver used a page flipping mechanism which swapped the page containing the received packet with a free guest page, avoiding the data copy. The original page flip mechanism was replaced by the data copy mechanism in later versions of Xen for performance reasons. The cost of mapping and unmapping pages in both guest and driver domain was equivalent to the copy cost for large 15 byte packets, which means that page flipping was less efficient than copy for small packet sizes [25]. In addition, the page flip mechanism increases memory fragmentation and prevents the use of super-page mappings with Xen. Menon et al. [17] have shown that super-pages provide superior performance in Xen, making page flipping unattractive. One problem with the current data copy mechanism is that the two data copies per packet are usually performed by different CPUs leading to poor data cache behavior. On an SMP machine, it is expected that an I/O intensive guest and the driver domain will be executing on different CPUs, especially if there is high I/O demand. The overhead introduced by the extra data copy can be reduced if both copies are performed by the same CPU and benefit from cache locality. The CPU will bring the packet to its cache during the first data copy. If the data is still present in the CPU cache during the second copy, this copy will be significantly faster using fewer CPU cycles. Of course there is no guarantee that the data will not be evicted from the cache before the second copy, which can be delayed arbitrarily depending on the application and overall system workload behavior. At high I/O rates, however, it is expected that the data will be delivered to the application as soon as it is received and will benefit from L2 cache locality. We modified the architecture of Xen PV driver and moved the grant copy operation from the driver domain to the guest domain, improving cache locality for the second data copy. In the new design, when a packet is re-

10 X ceived netback issues a grant for the packet page to the guest and notifies the guest of the packet arrival through the I/O channel. When netfront receives the I/O channel notification, it issues a grant copy operation to copy the packet from a driver domain page to a local socket buffer, and then delivers the packet to the kernel network stack. Moving the grant copy to the guest has benefits beyond speeding up the second data copy. It avoids polluting the cache of the driver domain CPU with data that will not be used, and thus should improve cache behavior in the driver domain as well. It also provides better CPU usage accounting. The CPU cycles used to copy the packet will now be accounted to the guest instead of to the driver domain, increasing fairness when accounting for CPU usage in Xen scheduling. Another benefit is improved scalability for multiple guests. For high speed networks (e.g. 1GigE), the driver domain can become the bottleneck and reduce I/O throughput. Offloading some of the CPU cycles to multiple guests CPUs avoids the driver domain from becoming a bottleneck improving I/O scalability. Finally, some implementation optimizations described earlier are easier to implement when the copy is done by the guest. For example, it is easier to avoid the extra socket buffer fragment discussed in Section Moving the copy to the guest allows the guest to allocate the buffer after the packet is received from netback at netfront. Having knowledge of the received packet allows netfront to allocate the appropriate buffers with the right sizes and alignments. Netfront can allocate one socket buffer for the first page of each packet and additional fragment pages only if the packet spans multiple pages. For the same reason, it is also easier to make aligned data copies, as the right socket buffer alignment can be selected at buffer allocation time. The third bar in Figure 9 shows the performance benefit of moving the grant copy from the driver domain to the guest. The cost of the data copy is reduced because of better use of the L2 cache. In addition, the number of cycles consumed by Xen in guest context (xen) increases while it decreases for Xen in driver domain context (xen). This is because the cycles used by the grant operation are shifted from the driver domain to the guest. We observe that the cost decrease in xen is higher than the cost increase in xen leading to an overall reduction in the CPU cost of Xen. This is because the grant optimizations described in Section are more effective when the grant operations are performed by the guest, as previously discussed. In summary, moving the copy from the driver domain to the guest reduces the CPU cost for Xen PV driver by approximately 225 cycles per packet. Of these, 14 cycles are due to better cache locality for guest copies (usercopy, grantcopy, and also kernel for faster accesses to packet headers in protocol processing), 6 cycles are cycles/packet 2, 18, 16, 14, 12, 1, 8, 6, 4, 2, % % current s o f tw a re en o p ti m i z a ti o ns % % 19 8 % 1% g m m L i p y g ues t ul ti -q ueue ul ti -q ueue+ nux co ra nt reus e Figure 9: Xen PV driver Optimizations app usercopy kernel xen grantcopy kernel xen due to grant optimizations being more effective (xen + xen) and 25 cycles are due to less cache pollution in driver domain (kernel). 4.2 Extending Grant Mechanism Moving the grant copy to the guest requires a couple of extensions to the Xen grant mechanism. We have not yet implemented these extensions but we discuss them here. To support copy in the receiving guest, the driver domain has to issue a grant to the memory page containing the received packet. This works fine for packets received on the physical device since they are placed in driver domain memory. In contrast, the memory buffers with packets received from other guests are not owned by the driver domain and cannot be granted to the receiving guest using the current Xen grant mechanism. Thus, this mechanism needs to be extended to provide grant transitivity allowing a domain that was granted access to another domain s page, to transfer this right to a third domain. We must also ensure memory isolation and prevent guests from accessing packets destined to other guests. The main problem is that the same I/O buffer can be reused to receive new packets destined to different guests. When a small sized packet is received on a buffer previously used to receive a large packet for another domain, the buffer may still contain data from the old packet. Scrubbing the I/O pages after or before every I/O to remove old data would have a high overhead wiping out the benefits of moving the copy to the guest. Instead we can just extend the Xen grant copy mechanism with offset and size fields to constrain the receiving guest to access only the region of the granted page containing the received packet data. 4.3 Support for Multi-queue Devices Although moving the data copy to the guest reduces the CPU cost, eliminating the extra copy altogether should provide even better performance. The extra copy can be

11 dr iv er domain vmq_alloc_page() vmq_alloc_skb() vmq_netif_rx() d e v i c e d r i v e r NIC buffer pools packet free skb netb a c k packet I O c h a n n e l s H ar dw P o s t RX RX T X X en h y p er v isor ar e guest domain RX buffer RX packet TX buffer Free buffer netfront Figure 1: Multi-queue device support avoided if the NIC can place received packets directly into the guest kernel buffer. This is only possible if the NIC can identify the destination guest for each packet and select a buffer from the respective guest s memory to place the packet. As discussed in Section 1, NICs are now becoming available that have multiple receive (RX) queues that can be used to directly place received packets in guest memory [1]. These NICs can demultiplex incoming traffic into the multiple RX queues based on the packet destination MAC address and VLAN IDs. Each individual RX queue can be dedicated to a particular guest and programmed with the guest MAC address. If the buffer descriptors posted at each RX queue point to kernel buffers of the respective guest, the device can place incoming packets directly into guest buffers, avoiding the extra data copy. Figure 1 illustrates how the Xen PV network driver model can be modified to support multi-queue devices. Netfront posts grants to I/O buffers for use by the multiqueue device drivers using the I/O channel. For multiqueue devices the driver domain must validate if the page belongs to the (untrusted) guest and needs to pin the page for the I/O duration to prevent the page being reassigned to the hypervisor or other guests. The grant map and unmap operations accomplish these tasks in addition to mapping the page in the driver domain. Mapping the page is needed for guest to guest traffic which traverses the driver domain network stack (bridge). Experimental results not presented here due to space limitations show that the additional cost of mapping the page is small compared to the overall cost of the grant operation. Netfront allocates two different types of I/O buffers which are posted to the I/O channel: regular socket buffers with the right alignments required by the network stack, and full pages for use as fragments in non linear socket buffers. Posting fragment pages is optional and just needed if the device can receive large packets spanning multiple pages, either because it is configured with jumbo frames or because it supports Large Receive Offload (LRO). Netback uses these grants to map the guest pages into driver domain address space buffers and keeps them in two different pools for each guest: one pool for each type of page. These buffers are provided to the physical device driver on demand when the driver needs to post RX descriptors to the RX queue associated with the guest. This requires that the device driver use new kernel functions to request I/O buffers from the guest memory. This can be accomplished by providing two new I/O buffer allocation functions in the driver domain kernel, vmq alloc skb() and vmq alloc page(). These are equivalent to the traditional Linux functions netdev alloc skb() and alloc page() except that they take an additional parameter specifying the RX queue for which the buffer is being allocated. These functions return a buffer from the respective pool of guest I/O buffers. When a packet is received the NIC consumes one (or more in case of LRO or jumbo frame) of the posted RX buffers and places the data directly into the corresponding guest memory. The device driver is notified by the NIC that a packet arrived and then forwards the packet directly to netback. Since the NIC already demultiplexes packets, there is no need to use the bridge in the driver domain. The device driver can instead send the packet directly to netback using a new kernel function vmq netif rx(). This function is equivalent to the traditional Linux netif rx() typically used by network drivers to deliver received messages, except that the new function takes an additional parameter that specifies which RX queue received the packet. Note that these new kernel functions are not specific to Xen but enable use of multi-queue devices with any virtualization technology based on the Linux kernel, such as for example KVM[23]. These new functions only need to be used in new device drivers for modern devices that have multi-queue support. These new functions extend the current interface between Linux and network devices enabling the use of multi-queue devices for virtualization. This means that the same device driver for a multiqueue device can be used with Xen, KVM or any other Linux based virtualization technology. We observe that the optimization that moves the copy to the guest is still useful even when using multi-queue devices. One reason is that the number of guests may exceed the number of RX queues causing some guests to share the same RX queue. In this case guests that share the same queue should still use the grant copy mechanism to copy packets from the driver domain memory to the guest. Also, grant copy is still needed to deliver local guest to guest traffic. When a guest sends a packet to another local guest the data needs to be copied from

12 one guest memory to another, instead of being sent to the physical device. In these cases moving the copy to the guest still provides performance benefits. To support receiving packets from both the multi-queue device and other guests, netfront receives both types of packets on the RX I/O channel shown in Figure 1. Packets from other guests arrive with copy grants that are used by netfront, whereas packets from the device use pre-posted buffers. We have implemented a PV driver prototype with multi-queue support to evaluate its perfomance impact. However, we did not have a NIC with multi-queue support available in our prototype. We used instead a traditional single queue NIC to estimate the benefits of a multi-queue device. We basically modified the device driver to dedicate the single RX queue of the NIC to the guest. We also modified netback to forward guest buffers posted by netfront to the physical device driver, such that the single queue device could accurately emulate the behavior of a multi-queue device. The fourth bar in Figure 9 shows the performance impact of using multi-queue devices. As expected the cost of grant copy is eliminated as now the packet is placed directly in guest kernel memory. Also the number of CPU cycles in xen for guest context is reduced since there is no grant copy operation being performed. On the other hand the driver domain has to use two grant operations per packet to map and unmap the guest page as opposed to one operation for the grant copy, increasing the number of cycles in xen and reducing the benefits of removing the copy cost. However, the number of CPU cycles consumed in driver domain kernel is also reduced when using multi-queue devices. This is due to two reasons. First, packets are forwarded directly from the device driver to the guest avoiding the forwarding costs in the bridge. Second, since netback is now involved in both allocating and deallocating socket buffer structures for the driver domain, it can avoid the costs of their allocation and deallocation. Instead, netback recycles the same set of socket buffer structures in multiple I/O operations. It only has to change their memory mappings for every new I/O, using the grant mechanism to make the socket buffers point to the right physical pages containing the guest I/O buffers. Surprisingly, the simplifications in driver domain have a higher impact on the CPU cost than the elimination of the extra data copy. In summary, direct placement of data in guest memory reduces PV driver cost by 7 CPU cycles per packet, and simplified socket buffer allocation and simpler packet routing in driver domain reduces the cost by additional 115 cycles per packet. On the other hand the higher cost of grant mapping over the grant copy, increases the cost by 65 cycles per packet providing a net cost reduction of 12 CPU cycles per packet. However, dr iv er domain vmq_alloc_skb() or vmq_alloc_page() vmq_netif_rx() d e v i c e d r i v e r NIC netb a c k packet packet grant cache I O c h a n n e l s H ar dw P o s t RX RX T X C o n t r o l X en h y p er v isor ar e guest domain netfront RX buffer RX packet TX buffer Free buffer invalidate grant Figure 11: Caching and reusing grants the benefit of multi-queue devices can actually be larger when we avoid the costs associated with grants as discussed in the next section. 4.4 Caching and Reusing Grants As shown in Section 3.5.3, a large fraction of Xen CPU cycles consumed during I/O operations are due to grant operation functions. In this section we describe a grant reuse mechanism that can eliminate most of this cost. The number of grant operations performed in the driver domain can be reduced if we relax the memory isolation property slightly and allow the driver domain to keep guest I/O buffers mapped in its address space even after the I/O is completed. If the guest recycles I/O memory and reuses previously used I/O pages for new I/O operations, the cost of mapping the guest pages using the grant mechanism is amortized over multiple I/O operations. Fortunately, most operating systems tend to recycle I/O buffers. For example, the Linux slab allocator used to allocate socket buffers keeps previously used buffers in a cache which is then used to allocate new I/O buffers. In practice, keeping I/O buffer mappings for longer times does not compromise the fault isolation properties of driver domains, as the driver domain still can only access the same set of I/O pages and no pages containing any other guest data or code. In order to reuse grants, Xen PV driver needs to be modified as illustrated in Figure 11. Netback keeps a cache of currently mapped grants for every guest. On every RX buffer posted by the guest (when using a multiqueue device) and on every TX request, netback checks if the granted page is already mapped in its address space, mapping it only if necessary. When the I/O completes, the mapping is not removed allowing it to be reused in future I/O operations. It is important, though, to enable the

Netchannel 2: Optimizing Network Performance

Netchannel 2: Optimizing Network Performance Netchannel 2: Optimizing Network Performance J. Renato Santos +, G. (John) Janakiraman + Yoshio Turner +, Ian Pratt * + HP Labs - * XenSource/Citrix Xen Summit Nov 14-16, 2007 2003 Hewlett-Packard Development

More information

Xen Network I/O Performance Analysis and Opportunities for Improvement

Xen Network I/O Performance Analysis and Opportunities for Improvement Xen Network I/O Performance Analysis and Opportunities for Improvement J. Renato Santos G. (John) Janakiraman Yoshio Turner HP Labs Xen Summit April 17-18, 27 23 Hewlett-Packard Development Company, L.P.

More information

Network optimizations for PV guests

Network optimizations for PV guests Network optimizations for PV guests J. Renato Santos G. (John) Janakiraman Yoshio Turner HP Labs Summit September 7-8, 26 23 Hewlett-Packard Development Company, L.P. The information contained herein is

More information

Support for Smart NICs. Ian Pratt

Support for Smart NICs. Ian Pratt Support for Smart NICs Ian Pratt Outline Xen I/O Overview Why network I/O is harder than block Smart NIC taxonomy How Xen can exploit them Enhancing Network device channel NetChannel2 proposal I/O Architecture

More information

Optimizing TCP Receive Performance

Optimizing TCP Receive Performance Optimizing TCP Receive Performance Aravind Menon and Willy Zwaenepoel School of Computer and Communication Sciences EPFL Abstract The performance of receive side TCP processing has traditionally been dominated

More information

Implementation and Analysis of Large Receive Offload in a Virtualized System

Implementation and Analysis of Large Receive Offload in a Virtualized System Implementation and Analysis of Large Receive Offload in a Virtualized System Takayuki Hatori and Hitoshi Oi The University of Aizu, Aizu Wakamatsu, JAPAN {s1110173,hitoshi}@u-aizu.ac.jp Abstract System

More information

Virtualization, Xen and Denali

Virtualization, Xen and Denali Virtualization, Xen and Denali Susmit Shannigrahi November 9, 2011 Susmit Shannigrahi () Virtualization, Xen and Denali November 9, 2011 1 / 70 Introduction Virtualization is the technology to allow two

More information

Background. IBM sold expensive mainframes to large organizations. Monitor sits between one or more OSes and HW

Background. IBM sold expensive mainframes to large organizations. Monitor sits between one or more OSes and HW Virtual Machines Background IBM sold expensive mainframes to large organizations Some wanted to run different OSes at the same time (because applications were developed on old OSes) Solution: IBM developed

More information

Network device virtualization: issues and solutions

Network device virtualization: issues and solutions Network device virtualization: issues and solutions Ph.D. Seminar Report Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy by Debadatta Mishra Roll No: 114050005

More information

Xenoprof overview & Networking Performance Analysis

Xenoprof overview & Networking Performance Analysis Xenoprof overview & Networking Performance Analysis J. Renato Santos G. (John) Janakiraman Yoshio Turner Aravind Menon HP Labs Xen Summit January 17-18, 2006 2003 Hewlett-Packard Development Company, L.P.

More information

Advanced Operating Systems (CS 202) Virtualization

Advanced Operating Systems (CS 202) Virtualization Advanced Operating Systems (CS 202) Virtualization Virtualization One of the natural consequences of the extensibility research we discussed What is virtualization and what are the benefits? 2 Virtualization

More information

Xen and the Art of Virtualization. CSE-291 (Cloud Computing) Fall 2016

Xen and the Art of Virtualization. CSE-291 (Cloud Computing) Fall 2016 Xen and the Art of Virtualization CSE-291 (Cloud Computing) Fall 2016 Why Virtualization? Share resources among many uses Allow heterogeneity in environments Allow differences in host and guest Provide

More information

CSE 120 Principles of Operating Systems

CSE 120 Principles of Operating Systems CSE 120 Principles of Operating Systems Spring 2018 Lecture 16: Virtual Machine Monitors Geoffrey M. Voelker Virtual Machine Monitors 2 Virtual Machine Monitors Virtual Machine Monitors (VMMs) are a hot

More information

A low-overhead networking mechanism for virtualized high-performance computing systems

A low-overhead networking mechanism for virtualized high-performance computing systems J Supercomput (2012) 59:443 468 DOI 10.1007/s11227-010-0444-9 A low-overhead networking mechanism for virtualized high-performance computing systems Jae-Wan Jang Euiseong Seo Heeseung Jo Jin-Soo Kim Published

More information

Redesigning Xen's Memory Sharing Mechanism for Safe and Efficient I/O Virtualization Kaushik Kumar Ram, Jose Renato Santos, Yoshio Turner

Redesigning Xen's Memory Sharing Mechanism for Safe and Efficient I/O Virtualization Kaushik Kumar Ram, Jose Renato Santos, Yoshio Turner Redesigning Xen's Memory Sharing Mechanism for Safe and Efficient I/O Virtualization Kaushik Kumar Ram, Jose Renato Santos, Yoshio Turner HP Laboratories HPL-21-39 Keyword(s): No keywords available. Abstract:

More information

6.9. Communicating to the Outside World: Cluster Networking

6.9. Communicating to the Outside World: Cluster Networking 6.9 Communicating to the Outside World: Cluster Networking This online section describes the networking hardware and software used to connect the nodes of cluster together. As there are whole books and

More information

Advanced Computer Networks. End Host Optimization

Advanced Computer Networks. End Host Optimization Oriana Riva, Department of Computer Science ETH Zürich 263 3501 00 End Host Optimization Patrick Stuedi Spring Semester 2017 1 Today End-host optimizations: NUMA-aware networking Kernel-bypass Remote Direct

More information

Utilizing the IOMMU scalably

Utilizing the IOMMU scalably Utilizing the IOMMU scalably Omer Peleg, Adam Morrison, Benjamin Serebrin, and Dan Tsafrir USENIX ATC 15 2017711456 Shin Seok Ha 1 Introduction What is an IOMMU? Provides the translation between IO addresses

More information

Virtualization with XEN. Trusted Computing CS599 Spring 2007 Arun Viswanathan University of Southern California

Virtualization with XEN. Trusted Computing CS599 Spring 2007 Arun Viswanathan University of Southern California Virtualization with XEN Trusted Computing CS599 Spring 2007 Arun Viswanathan University of Southern California A g e n d a Introduction Virtualization approaches Basic XEN Architecture Setting up XEN Bootstrapping

More information

IsoStack Highly Efficient Network Processing on Dedicated Cores

IsoStack Highly Efficient Network Processing on Dedicated Cores IsoStack Highly Efficient Network Processing on Dedicated Cores Leah Shalev Eran Borovik, Julian Satran, Muli Ben-Yehuda Outline Motivation IsoStack architecture Prototype TCP/IP over 10GE on a single

More information

Tolerating Malicious Drivers in Linux. Silas Boyd-Wickizer and Nickolai Zeldovich

Tolerating Malicious Drivers in Linux. Silas Boyd-Wickizer and Nickolai Zeldovich XXX Tolerating Malicious Drivers in Linux Silas Boyd-Wickizer and Nickolai Zeldovich How could a device driver be malicious? Today's device drivers are highly privileged Write kernel memory, allocate memory,...

More information

Keeping up with the hardware

Keeping up with the hardware Keeping up with the hardware Challenges in scaling I/O performance Jonathan Davies XenServer System Performance Lead XenServer Engineering, Citrix Cambridge, UK 18 Aug 2015 Jonathan Davies (Citrix) Keeping

More information

Virtual Machines Disco and Xen (Lecture 10, cs262a) Ion Stoica & Ali Ghodsi UC Berkeley February 26, 2018

Virtual Machines Disco and Xen (Lecture 10, cs262a) Ion Stoica & Ali Ghodsi UC Berkeley February 26, 2018 Virtual Machines Disco and Xen (Lecture 10, cs262a) Ion Stoica & Ali Ghodsi UC Berkeley February 26, 2018 Today s Papers Disco: Running Commodity Operating Systems on Scalable Multiprocessors, Edouard

More information

Part 1: Introduction to device drivers Part 2: Overview of research on device driver reliability Part 3: Device drivers research at ERTOS

Part 1: Introduction to device drivers Part 2: Overview of research on device driver reliability Part 3: Device drivers research at ERTOS Some statistics 70% of OS code is in device s 3,448,000 out of 4,997,000 loc in Linux 2.6.27 A typical Linux laptop runs ~240,000 lines of kernel code, including ~72,000 loc in 36 different device s s

More information

CSC 5930/9010 Cloud S & P: Virtualization

CSC 5930/9010 Cloud S & P: Virtualization CSC 5930/9010 Cloud S & P: Virtualization Professor Henry Carter Fall 2016 Recap Network traffic can be encrypted at different layers depending on application needs TLS: transport layer IPsec: network

More information

The Price of Safety: Evaluating IOMMU Performance

The Price of Safety: Evaluating IOMMU Performance The Price of Safety: Evaluating IOMMU Performance Muli Ben-Yehuda 1 Jimi Xenidis 2 Michal Ostrowski 2 Karl Rister 3 Alexis Bruemmer 3 Leendert Van Doorn 4 1 muli@il.ibm.com 2 {jimix,mostrows}@watson.ibm.com

More information

Lecture 7. Xen and the Art of Virtualization. Paul Braham, Boris Dragovic, Keir Fraser et al. 16 November, Advanced Operating Systems

Lecture 7. Xen and the Art of Virtualization. Paul Braham, Boris Dragovic, Keir Fraser et al. 16 November, Advanced Operating Systems Lecture 7 Xen and the Art of Virtualization Paul Braham, Boris Dragovic, Keir Fraser et al. Advanced Operating Systems 16 November, 2011 SOA/OS Lecture 7, Xen 1/38 Contents Virtualization Xen Memory CPU

More information

Software Routers: NetMap

Software Routers: NetMap Software Routers: NetMap Hakim Weatherspoon Assistant Professor, Dept of Computer Science CS 5413: High Performance Systems and Networking October 8, 2014 Slides from the NetMap: A Novel Framework for

More information

A Novel Approach to Gain High Throughput and Low Latency through SR- IOV

A Novel Approach to Gain High Throughput and Low Latency through SR- IOV A Novel Approach to Gain High Throughput and Low Latency through SR- IOV Usha Devi G #1, Kasthuri Theja Peduru #2, Mallikarjuna Reddy B #3 School of Information Technology, VIT University, Vellore 632014,

More information

Virtual Machine Monitors (VMMs) are a hot topic in

Virtual Machine Monitors (VMMs) are a hot topic in CSE 120 Principles of Operating Systems Winter 2007 Lecture 16: Virtual Machine Monitors Keith Marzullo and Geoffrey M. Voelker Virtual Machine Monitors Virtual Machine Monitors (VMMs) are a hot topic

More information

Abstract. Testing Parameters. Introduction. Hardware Platform. Native System

Abstract. Testing Parameters. Introduction. Hardware Platform. Native System Abstract In this paper, we address the latency issue in RT- XEN virtual machines that are available in Xen 4.5. Despite the advantages of applying virtualization to systems, the default credit scheduler

More information

Linux and Xen. Andrea Sarro. andrea.sarro(at)quadrics.it. Linux Kernel Hacking Free Course IV Edition

Linux and Xen. Andrea Sarro. andrea.sarro(at)quadrics.it. Linux Kernel Hacking Free Course IV Edition Linux and Xen Andrea Sarro andrea.sarro(at)quadrics.it Linux Kernel Hacking Free Course IV Edition Andrea Sarro (andrea.sarro(at)quadrics.it) Linux and Xen 07/05/2008 1 / 37 Introduction Xen and Virtualization

More information

OS Virtualization. Why Virtualize? Introduction. Virtualization Basics 12/10/2012. Motivation. Types of Virtualization.

OS Virtualization. Why Virtualize? Introduction. Virtualization Basics 12/10/2012. Motivation. Types of Virtualization. Virtualization Basics Motivation OS Virtualization CSC 456 Final Presentation Brandon D. Shroyer Types of Virtualization Process virtualization (Java) System virtualization (classic, hosted) Emulation

More information

Enabling Efficient and Scalable Zero-Trust Security

Enabling Efficient and Scalable Zero-Trust Security WHITE PAPER Enabling Efficient and Scalable Zero-Trust Security FOR CLOUD DATA CENTERS WITH AGILIO SMARTNICS THE NEED FOR ZERO-TRUST SECURITY The rapid evolution of cloud-based data centers to support

More information

Embedded Systems Dr. Santanu Chaudhury Department of Electrical Engineering Indian Institute of Technology, Delhi

Embedded Systems Dr. Santanu Chaudhury Department of Electrical Engineering Indian Institute of Technology, Delhi Embedded Systems Dr. Santanu Chaudhury Department of Electrical Engineering Indian Institute of Technology, Delhi Lecture - 13 Virtual memory and memory management unit In the last class, we had discussed

More information

The Convergence of Storage and Server Virtualization Solarflare Communications, Inc.

The Convergence of Storage and Server Virtualization Solarflare Communications, Inc. The Convergence of Storage and Server Virtualization 2007 Solarflare Communications, Inc. About Solarflare Communications Privately-held, fabless semiconductor company. Founded 2001 Top tier investors:

More information

references Virtualization services Topics Virtualization

references Virtualization services Topics Virtualization references Virtualization services Virtual machines Intel Virtualization technology IEEE xplorer, May 2005 Comparison of software and hardware techniques for x86 virtualization ASPLOS 2006 Memory resource

More information

PCI Express x8 Single Port SFP+ 10 Gigabit Server Adapter (Intel 82599ES Based) Single-Port 10 Gigabit SFP+ Ethernet Server Adapters Provide Ultimate

PCI Express x8 Single Port SFP+ 10 Gigabit Server Adapter (Intel 82599ES Based) Single-Port 10 Gigabit SFP+ Ethernet Server Adapters Provide Ultimate NIC-PCIE-1SFP+-PLU PCI Express x8 Single Port SFP+ 10 Gigabit Server Adapter (Intel 82599ES Based) Single-Port 10 Gigabit SFP+ Ethernet Server Adapters Provide Ultimate Flexibility and Scalability in Virtual

More information

Xen and the Art of Virtualization

Xen and the Art of Virtualization Xen and the Art of Virtualization Paul Barham, Boris Dragovic, Keir Fraser, Steven Hand, Tim Harris, Alex Ho, Rolf Neugebauer, Ian Pratt, Andrew Warfield Presented by Thomas DuBuisson Outline Motivation

More information

QuickSpecs. HP Z 10GbE Dual Port Module. Models

QuickSpecs. HP Z 10GbE Dual Port Module. Models Overview Models Part Number: 1Ql49AA Introduction The is a 10GBASE-T adapter utilizing the Intel X722 MAC and X557-AT2 PHY pairing to deliver full line-rate performance, utilizing CAT 6A UTP cabling (or

More information

Virtual Leverage: Server Consolidation in Open Source Environments. Margaret Lewis Commercial Software Strategist AMD

Virtual Leverage: Server Consolidation in Open Source Environments. Margaret Lewis Commercial Software Strategist AMD Virtual Leverage: Server Consolidation in Open Source Environments Margaret Lewis Commercial Software Strategist AMD What Is Virtualization? Abstraction of Hardware Components Virtual Memory Virtual Volume

More information

Large Receive Offload implementation in Neterion 10GbE Ethernet driver

Large Receive Offload implementation in Neterion 10GbE Ethernet driver Large Receive Offload implementation in Neterion 10GbE Ethernet driver Leonid Grossman Neterion, Inc. leonid@neterion.com Abstract 1 Introduction The benefits of TSO (Transmit Side Offload) implementation

More information

COMPUTER ARCHITECTURE. Virtualization and Memory Hierarchy

COMPUTER ARCHITECTURE. Virtualization and Memory Hierarchy COMPUTER ARCHITECTURE Virtualization and Memory Hierarchy 2 Contents Virtual memory. Policies and strategies. Page tables. Virtual machines. Requirements of virtual machines and ISA support. Virtual machines:

More information

I/O virtualization. Jiang, Yunhong Yang, Xiaowei Software and Service Group 2009 虚拟化技术全国高校师资研讨班

I/O virtualization. Jiang, Yunhong Yang, Xiaowei Software and Service Group 2009 虚拟化技术全国高校师资研讨班 I/O virtualization Jiang, Yunhong Yang, Xiaowei 1 Legal Disclaimer INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE,

More information

vnetwork Future Direction Howie Xu, VMware R&D November 4, 2008

vnetwork Future Direction Howie Xu, VMware R&D November 4, 2008 vnetwork Future Direction Howie Xu, VMware R&D November 4, 2008 Virtual Datacenter OS from VMware Infrastructure vservices and Cloud vservices Existing New - roadmap Virtual Datacenter OS from VMware Agenda

More information

Intel Virtualization Technology Roadmap and VT-d Support in Xen

Intel Virtualization Technology Roadmap and VT-d Support in Xen Intel Virtualization Technology Roadmap and VT-d Support in Xen Jun Nakajima Intel Open Source Technology Center Legal Disclaimer INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS.

More information

The Challenges of X86 Hardware Virtualization. GCC- Virtualization: Rajeev Wankar 36

The Challenges of X86 Hardware Virtualization. GCC- Virtualization: Rajeev Wankar 36 The Challenges of X86 Hardware Virtualization GCC- Virtualization: Rajeev Wankar 36 The Challenges of X86 Hardware Virtualization X86 operating systems are designed to run directly on the bare-metal hardware,

More information

Protection Strategies for Direct Access to Virtualized I/O Devices

Protection Strategies for Direct Access to Virtualized I/O Devices Protection Strategies for Direct Access to Virtualized I/O Devices Paul Willmann, Scott Rixner, and Alan L. Cox Rice University {willmann, rixner, alc}@rice.edu Abstract Commodity virtual machine monitors

More information

Introduction to Oracle VM (Xen) Networking

Introduction to Oracle VM (Xen) Networking Introduction to Oracle VM (Xen) Networking Dongli Zhang Oracle Asia Research and Development Centers (Beijing) dongli.zhang@oracle.com May 30, 2017 Dongli Zhang (Oracle) Introduction to Oracle VM (Xen)

More information

Much Faster Networking

Much Faster Networking Much Faster Networking David Riddoch driddoch@solarflare.com Copyright 2016 Solarflare Communications, Inc. All rights reserved. What is kernel bypass? The standard receive path The standard receive path

More information

Xen and the Art of Virtualization. Nikola Gvozdiev Georgian Mihaila

Xen and the Art of Virtualization. Nikola Gvozdiev Georgian Mihaila Xen and the Art of Virtualization Nikola Gvozdiev Georgian Mihaila Outline Xen and the Art of Virtualization Ian Pratt et al. I. The Art of Virtualization II. Xen, goals and design III. Xen evaluation

More information

Disco: Running Commodity Operating Systems on Scalable Multiprocessors

Disco: Running Commodity Operating Systems on Scalable Multiprocessors Disco: Running Commodity Operating Systems on Scalable Multiprocessors Edouard Bugnion, Scott Devine, Kinskuk Govil and Mendel Rosenblum Stanford University Presented by : Long Zhang Overiew Background

More information

Virtualization. Dr. Yingwu Zhu

Virtualization. Dr. Yingwu Zhu Virtualization Dr. Yingwu Zhu Virtualization Definition Framework or methodology of dividing the resources of a computer into multiple execution environments. Types Platform Virtualization: Simulate a

More information

Virtual Machines. Part 2: starting 19 years ago. Operating Systems In Depth IX 1 Copyright 2018 Thomas W. Doeppner. All rights reserved.

Virtual Machines. Part 2: starting 19 years ago. Operating Systems In Depth IX 1 Copyright 2018 Thomas W. Doeppner. All rights reserved. Virtual Machines Part 2: starting 19 years ago Operating Systems In Depth IX 1 Copyright 2018 Thomas W. Doeppner. All rights reserved. Operating Systems In Depth IX 2 Copyright 2018 Thomas W. Doeppner.

More information

Virtual Virtual Memory

Virtual Virtual Memory Virtual Virtual Memory Jason Power 3/20/2015 With contributions from Jayneel Gandhi and Lena Olson 4/17/2015 UNIVERSITY OF WISCONSIN 1 Virtual Machine History 1970 s: VMMs 1997: Disco 1999: VMWare (binary

More information

Dynamic Translator-Based Virtualization

Dynamic Translator-Based Virtualization Dynamic Translator-Based Virtualization Yuki Kinebuchi 1,HidenariKoshimae 1,ShuichiOikawa 2, and Tatsuo Nakajima 1 1 Department of Computer Science, Waseda University {yukikine, hide, tatsuo}@dcl.info.waseda.ac.jp

More information

Lecture 5: February 3

Lecture 5: February 3 CMPSCI 677 Operating Systems Spring 2014 Lecture 5: February 3 Lecturer: Prashant Shenoy Scribe: Aditya Sundarrajan 5.1 Virtualization Virtualization is a technique that extends or replaces an existing

More information

Fast packet processing in the cloud. Dániel Géhberger Ericsson Research

Fast packet processing in the cloud. Dániel Géhberger Ericsson Research Fast packet processing in the cloud Dániel Géhberger Ericsson Research Outline Motivation Service chains Hardware related topics, acceleration Virtualization basics Software performance and acceleration

More information

Enabling Fast, Dynamic Network Processing with ClickOS

Enabling Fast, Dynamic Network Processing with ClickOS Enabling Fast, Dynamic Network Processing with ClickOS Joao Martins*, Mohamed Ahmed*, Costin Raiciu, Roberto Bifulco*, Vladimir Olteanu, Michio Honda*, Felipe Huici* * NEC Labs Europe, Heidelberg, Germany

More information

What s New in VMware vsphere 4.1 Performance. VMware vsphere 4.1

What s New in VMware vsphere 4.1 Performance. VMware vsphere 4.1 What s New in VMware vsphere 4.1 Performance VMware vsphere 4.1 T E C H N I C A L W H I T E P A P E R Table of Contents Scalability enhancements....................................................................

More information

System Virtual Machines

System Virtual Machines System Virtual Machines Outline Need and genesis of system Virtual Machines Basic concepts User Interface and Appearance State Management Resource Control Bare Metal and Hosted Virtual Machines Co-designed

More information

Got Loss? Get zovn! Daniel Crisan, Robert Birke, Gilles Cressier, Cyriel Minkenberg, and Mitch Gusat. ACM SIGCOMM 2013, August, Hong Kong, China

Got Loss? Get zovn! Daniel Crisan, Robert Birke, Gilles Cressier, Cyriel Minkenberg, and Mitch Gusat. ACM SIGCOMM 2013, August, Hong Kong, China Got Loss? Get zovn! Daniel Crisan, Robert Birke, Gilles Cressier, Cyriel Minkenberg, and Mitch Gusat ACM SIGCOMM 2013, 12-16 August, Hong Kong, China Virtualized Server 1 Application Performance in Virtualized

More information

Spring 2017 :: CSE 506. Introduction to. Virtual Machines. Nima Honarmand

Spring 2017 :: CSE 506. Introduction to. Virtual Machines. Nima Honarmand Introduction to Virtual Machines Nima Honarmand Virtual Machines & Hypervisors Virtual Machine: an abstraction of a complete compute environment through the combined virtualization of the processor, memory,

More information

I, J A[I][J] / /4 8000/ I, J A(J, I) Chapter 5 Solutions S-3.

I, J A[I][J] / /4 8000/ I, J A(J, I) Chapter 5 Solutions S-3. 5 Solutions Chapter 5 Solutions S-3 5.1 5.1.1 4 5.1.2 I, J 5.1.3 A[I][J] 5.1.4 3596 8 800/4 2 8 8/4 8000/4 5.1.5 I, J 5.1.6 A(J, I) 5.2 5.2.1 Word Address Binary Address Tag Index Hit/Miss 5.2.2 3 0000

More information

FAQ. Release rc2

FAQ. Release rc2 FAQ Release 19.02.0-rc2 January 15, 2019 CONTENTS 1 What does EAL: map_all_hugepages(): open failed: Permission denied Cannot init memory mean? 2 2 If I want to change the number of hugepages allocated,

More information

Virtualization. Starting Point: A Physical Machine. What is a Virtual Machine? Virtualization Properties. Types of Virtualization

Virtualization. Starting Point: A Physical Machine. What is a Virtual Machine? Virtualization Properties. Types of Virtualization Starting Point: A Physical Machine Virtualization Based on materials from: Introduction to Virtual Machines by Carl Waldspurger Understanding Intel Virtualization Technology (VT) by N. B. Sahgal and D.

More information

Virtualization. ! Physical Hardware Processors, memory, chipset, I/O devices, etc. Resources often grossly underutilized

Virtualization. ! Physical Hardware Processors, memory, chipset, I/O devices, etc. Resources often grossly underutilized Starting Point: A Physical Machine Virtualization Based on materials from: Introduction to Virtual Machines by Carl Waldspurger Understanding Intel Virtualization Technology (VT) by N. B. Sahgal and D.

More information

High Performance Packet Processing with FlexNIC

High Performance Packet Processing with FlexNIC High Performance Packet Processing with FlexNIC Antoine Kaufmann, Naveen Kr. Sharma Thomas Anderson, Arvind Krishnamurthy University of Washington Simon Peter The University of Texas at Austin Ethernet

More information

Server Virtualization Approaches

Server Virtualization Approaches Server Virtualization Approaches Virtual Machine Applications Emulation Replication Composition Emulation: Mix-and-match cross-platform portability Replication: Multiple VMs on single platform Composition:

More information

Virtualization. Pradipta De

Virtualization. Pradipta De Virtualization Pradipta De pradipta.de@sunykorea.ac.kr Today s Topic Virtualization Basics System Virtualization Techniques CSE506: Ext Filesystem 2 Virtualization? A virtual machine (VM) is an emulation

More information

W H I T E P A P E R. What s New in VMware vsphere 4: Performance Enhancements

W H I T E P A P E R. What s New in VMware vsphere 4: Performance Enhancements W H I T E P A P E R What s New in VMware vsphere 4: Performance Enhancements Scalability Enhancements...................................................... 3 CPU Enhancements............................................................

More information

CS-580K/480K Advanced Topics in Cloud Computing. VM Virtualization II

CS-580K/480K Advanced Topics in Cloud Computing. VM Virtualization II CS-580K/480K Advanced Topics in Cloud Computing VM Virtualization II 1 How to Build a Virtual Machine? 2 How to Run a Program Compiling Source Program Loading Instruction Instruction Instruction Instruction

More information

What is KVM? KVM patch. Modern hypervisors must do many things that are already done by OSs Scheduler, Memory management, I/O stacks

What is KVM? KVM patch. Modern hypervisors must do many things that are already done by OSs Scheduler, Memory management, I/O stacks LINUX-KVM The need for KVM x86 originally virtualization unfriendly No hardware provisions Instructions behave differently depending on privilege context(popf) Performance suffered on trap-and-emulate

More information

RiceNIC. Prototyping Network Interfaces. Jeffrey Shafer Scott Rixner

RiceNIC. Prototyping Network Interfaces. Jeffrey Shafer Scott Rixner RiceNIC Prototyping Network Interfaces Jeffrey Shafer Scott Rixner RiceNIC Overview Gigabit Ethernet Network Interface Card RiceNIC - Prototyping Network Interfaces 2 RiceNIC Overview Reconfigurable and

More information

Chapter 5 (Part II) Large and Fast: Exploiting Memory Hierarchy. Baback Izadi Division of Engineering Programs

Chapter 5 (Part II) Large and Fast: Exploiting Memory Hierarchy. Baback Izadi Division of Engineering Programs Chapter 5 (Part II) Baback Izadi Division of Engineering Programs bai@engr.newpaltz.edu Virtual Machines Host computer emulates guest operating system and machine resources Improved isolation of multiple

More information

Knut Omang Ifi/Oracle 20 Oct, Introduction to virtualization (Virtual machines) Aspects of network virtualization:

Knut Omang Ifi/Oracle 20 Oct, Introduction to virtualization (Virtual machines) Aspects of network virtualization: Software and hardware support for Network Virtualization part 2 Knut Omang Ifi/Oracle 20 Oct, 2015 32 Overview Introduction to virtualization (Virtual machines) Aspects of network virtualization: Virtual

More information

Xen and the Art of Virtualization

Xen and the Art of Virtualization Xen and the Art of Virtualization Paul Barham,, Boris Dragovic, Keir Fraser, Steven Hand, Tim Harris, Alex Ho, Rolf Neugebauer,, Ian Pratt, Andrew Warfield University of Cambridge Computer Laboratory Presented

More information

Multiprocessor Scheduling. Multiprocessor Scheduling

Multiprocessor Scheduling. Multiprocessor Scheduling Multiprocessor Scheduling Will consider only shared memory multiprocessor or multi-core CPU Salient features: One or more caches: cache affinity is important Semaphores/locks typically implemented as spin-locks:

More information

440GX Application Note

440GX Application Note Overview of TCP/IP Acceleration Hardware January 22, 2008 Introduction Modern interconnect technology offers Gigabit/second (Gb/s) speed that has shifted the bottleneck in communication from the physical

More information

RDMA-like VirtIO Network Device for Palacios Virtual Machines

RDMA-like VirtIO Network Device for Palacios Virtual Machines RDMA-like VirtIO Network Device for Palacios Virtual Machines Kevin Pedretti UNM ID: 101511969 CS-591 Special Topics in Virtualization May 10, 2012 Abstract This project developed an RDMA-like VirtIO network

More information

Virtualization and memory hierarchy

Virtualization and memory hierarchy Virtualization and memory hierarchy Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering Department

More information

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective. Part I: Operating system overview: Memory Management

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective. Part I: Operating system overview: Memory Management ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective Part I: Operating system overview: Memory Management 1 Hardware background The role of primary memory Program

More information

Operating Systems 4/27/2015

Operating Systems 4/27/2015 Virtualization inside the OS Operating Systems 24. Virtualization Memory virtualization Process feels like it has its own address space Created by MMU, configured by OS Storage virtualization Logical view

More information

Intel PRO/1000 PT and PF Quad Port Bypass Server Adapters for In-line Server Appliances

Intel PRO/1000 PT and PF Quad Port Bypass Server Adapters for In-line Server Appliances Technology Brief Intel PRO/1000 PT and PF Quad Port Bypass Server Adapters for In-line Server Appliances Intel PRO/1000 PT and PF Quad Port Bypass Server Adapters for In-line Server Appliances The world

More information

by Brian Hausauer, Chief Architect, NetEffect, Inc

by Brian Hausauer, Chief Architect, NetEffect, Inc iwarp Ethernet: Eliminating Overhead In Data Center Designs Latest extensions to Ethernet virtually eliminate the overhead associated with transport processing, intermediate buffer copies, and application

More information

Introduction to Operating Systems. Chapter Chapter

Introduction to Operating Systems. Chapter Chapter Introduction to Operating Systems Chapter 1 1.3 Chapter 1.5 1.9 Learning Outcomes High-level understand what is an operating system and the role it plays A high-level understanding of the structure of

More information

Network Design Considerations for Grid Computing

Network Design Considerations for Grid Computing Network Design Considerations for Grid Computing Engineering Systems How Bandwidth, Latency, and Packet Size Impact Grid Job Performance by Erik Burrows, Engineering Systems Analyst, Principal, Broadcom

More information

Architecture and Performance Implications

Architecture and Performance Implications VMWARE WHITE PAPER VMware ESX Server 2 Architecture and Performance Implications ESX Server 2 is scalable, high-performance virtualization software that allows consolidation of multiple applications in

More information

Knut Omang Ifi/Oracle 6 Nov, 2017

Knut Omang Ifi/Oracle 6 Nov, 2017 Software and hardware support for Network Virtualization part 1 Knut Omang Ifi/Oracle 6 Nov, 2017 1 Motivation Goal: Introduction to challenges in providing fast networking to virtual machines Prerequisites:

More information

Data Path acceleration techniques in a NFV world

Data Path acceleration techniques in a NFV world Data Path acceleration techniques in a NFV world Mohanraj Venkatachalam, Purnendu Ghosh Abstract NFV is a revolutionary approach offering greater flexibility and scalability in the deployment of virtual

More information

Cloud Computing Virtualization

Cloud Computing Virtualization Cloud Computing Virtualization Anil Madhavapeddy anil@recoil.org Contents Virtualization. Layering and virtualization. Virtual machine monitor. Virtual machine. x86 support for virtualization. Full and

More information

A Userspace Packet Switch for Virtual Machines

A Userspace Packet Switch for Virtual Machines SHRINKING THE HYPERVISOR ONE SUBSYSTEM AT A TIME A Userspace Packet Switch for Virtual Machines Julian Stecklina OS Group, TU Dresden jsteckli@os.inf.tu-dresden.de VEE 2014, Salt Lake City 1 Motivation

More information

Video capture using GigE Vision with MIL. What is GigE Vision

Video capture using GigE Vision with MIL. What is GigE Vision What is GigE Vision GigE Vision is fundamentally a standard for transmitting video from a camera (see Figure 1) or similar device over Ethernet and is primarily intended for industrial imaging applications.

More information

Optimizing Performance: Intel Network Adapters User Guide

Optimizing Performance: Intel Network Adapters User Guide Optimizing Performance: Intel Network Adapters User Guide Network Optimization Types When optimizing network adapter parameters (NIC), the user typically considers one of the following three conditions

More information

HIGH-PERFORMANCE NETWORKING :: USER-LEVEL NETWORKING :: REMOTE DIRECT MEMORY ACCESS

HIGH-PERFORMANCE NETWORKING :: USER-LEVEL NETWORKING :: REMOTE DIRECT MEMORY ACCESS HIGH-PERFORMANCE NETWORKING :: USER-LEVEL NETWORKING :: REMOTE DIRECT MEMORY ACCESS CS6410 Moontae Lee (Nov 20, 2014) Part 1 Overview 00 Background User-level Networking (U-Net) Remote Direct Memory Access

More information

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience Jithin Jose, Mingzhe Li, Xiaoyi Lu, Krishna Kandalla, Mark Arnold and Dhabaleswar K. (DK) Panda Network-Based Computing Laboratory

More information

Chapter 5 C. Virtual machines

Chapter 5 C. Virtual machines Chapter 5 C Virtual machines Virtual Machines Host computer emulates guest operating system and machine resources Improved isolation of multiple guests Avoids security and reliability problems Aids sharing

More information

An Intelligent NIC Design Xin Song

An Intelligent NIC Design Xin Song 2nd International Conference on Advances in Mechanical Engineering and Industrial Informatics (AMEII 2016) An Intelligent NIC Design Xin Song School of Electronic and Information Engineering Tianjin Vocational

More information

CS370 Operating Systems

CS370 Operating Systems CS370 Operating Systems Colorado State University Yashwant K Malaiya Spring 2018 Lecture 2 Slides based on Text by Silberschatz, Galvin, Gagne Various sources 1 1 2 What is an Operating System? What is

More information

Distributed Systems Operation System Support

Distributed Systems Operation System Support Hajussüsteemid MTAT.08.009 Distributed Systems Operation System Support slides are adopted from: lecture: Operating System(OS) support (years 2016, 2017) book: Distributed Systems: Concepts and Design,

More information