Bridging the Gap between Software and Hardware Techniques for I/O Virtualization

Size: px

Start display at page:

Download "Bridging the Gap between Software and Hardware Techniques for I/O Virtualization"

Lynne Hamilton
5 years ago
Views:

1 Bridging the Gap between Software and Hardware Techniques for I/O Virtualization Jose Renato Santos, Yoshio Turner, G.(John) Janakiraman, Ian Pratt HP Laboratories HPL Keyword(s): Virtualization, Networking, Operating System, I/O, Performance Analysis, Xen Abstract: The paravirtualized I/O driver domain model, used in Xen, provides several advantages including device driver isolation in a safe execution environment, support for guest VM transparent services including live migration, and hardware independence for guests. However, these advantages currently come at the cost of high CPU overhead which can lead to low throughput for high bandwidth links such as 1 gigabit Ethernet. Direct I/O has been proposed as the solution to this performance problem but at the cost of removing the benefits of the driver domain model. In this paper we show how to significantly narrow the performance gap by improving the performance of the driver domain model. In particular, we reduce execution costs for conventional NICs by 56% on the receive path, and we achieve close to direct I/O performance for network devices supporting multiple hardware receive queues. These results make the Xen driver domain model an attractive solution for I/O virtualization for a wider range of scenarios. External Posting Date: June 21, 28 [Fulltext] Approved for External Publication Internal Posting Date: June 21, 28 [Fulltext] To be published and presented at 28 Usenix Annual Technical Conference, June 22-27, 28, Boston, Massachusetts. Copyright 28 Usenix Annual Technical Conference

2 Bridging the Gap between Software and Hardware Techniques for I/O Virtualization Jose Renato Santos Yoshio Turner G.(John) Janakiraman Ian Pratt Hewlett Packard Laboratories, Palo Alto, CA University of Cambridge, Cambridge, UK Abstract The paravirtualized I/O driver domain model, used in Xen, provides several advantages including device driver isolation in a safe execution environment, support for guest VM transparent services including live migration, and hardware independence for guests. However, these advantages currently come at the cost of high CPU overhead which can lead to low throughput for high bandwidth links such as 1 gigabit Ethernet. Direct I/O has been proposed as the solution to this performance problem but at the cost of removing the benefits of the driver domain model. In this paper we show how to significantly narrow the performance gap by improving the performance of the driver domain model. In particular, we reduce execution costs for conventional NICs by 56% on the receive path, and we achieve close to direct I/O performance for network devices supporting multiple hardware receive queues. These results make the Xen driver domain model an attractive solution for I/O virtualization for a wider range of scenarios. 1 Introduction In virtual machine environments like VMware [12], Xen [7], and KVM [23], a major source of performance degradation is the cost of virtualizing I/O devices to allow multiple guest VMs to securely share a single device. While the techniques used for virtualizing CPU and memory resources have very low overhead leading to near native performance [28][7][4], it is challenging to efficiently virtualize most current I/O devices. Each interaction between a guest OS and an I/O device needs to undergo costly interception and validation by the virtualization layer for security isolation and for data multiplexing and demultiplexing [27]. This problem is particularly acute when virtualizing high-bandwidth network interface devices because frequent software interactions with Currently at Skytap. the device are needed to handle the high rate of packet arrivals. Paravirtualization [29] has been proposed and used (e.g., in Xen [7][13]) to significantly shrink the cost and complexity of I/O device virtualization compared to using full device emulation. In this approach, the guest OS executes a paravirtualized (PV) driver that operates on a simplified abstract device model exported to the guest. The real device driver that actually accesses the hardware can reside in the hypervisor, or in a separate device driver domain which has privileged access to the device hardware. Using device driver domains is attractive because they allow the use of legacy OS device drivers for portability, and because they provide a safe execution environment isolated from the hypervisor and other guest VMs [16][13]. Even with the use of PV drivers, there remains very high CPU overhead (e.g., factor of four) compared to running in non-virtualized environments [18][17], leading to throughput degradation for high bandwidth links (e.g., 1 gigabits/second Ethernet). While there has been significant recent progress making the transmit path more efficient for paravirtualized I/O [17], little has been done to streamline the receive path, the focus of this paper. To avoid the high performance overheads of softwarebased I/O device virtualization, efforts in academia and industry are working on adding, to varying degrees, hardware support for virtualization into I/O devices and platforms [1][24][31][22][3][6]. These approaches present a tradeoff between efficiency and transparency of I/O device virtualization. In particular, using hardware support for direct I/O [31][24][22], in which a device presents multiple logical interfaces which can be securely accessed by guest VMs bypassing the virtualization layer, results in the best possible performance, with CPU cost close to native performance. However, direct I/O lacks key advantages of a dedicated driver domain model: device driver isolation in a safe execution environment avoiding guest domain corruption by buggy drivers, etc.,

3 and full support for guest VM transparent services including live migration [26] [21] [11] and traffic monitoring. Restoring these services would require either additional support in the devices or breaking the transparency of the services to the guest VM. In addition, it is difficult to exploit direct I/O in emerging virtual appliance models of software distribution which rely on the ability to execute on arbitrary hardware platforms. To use direct I/O, virtual appliances would have to include device drivers for a large variety of devices increasing their complexity, size, and maintainability. In this paper, we significantly bridge the performance gap between the driver domain model and direct I/O in the Xen virtual machine environment, making the driver domain model a competitive and attractive approach in a wider range of scenarios. We first present a detailed analysis of the CPU costs of I/O operations for devices without hardware support for virtualization. Based on this analysis we present implementation and configuration optimizations, and we propose changes to the software architecture to reduce the remaining large costs identified by the analysis: per-packet overheads for memory management and protection, and per-byte data copy overheads. Our experimental results show that the proposed modifications reduce the CPU cost by 56% for a streaming receive microbenchmark. In addition to improving the virtualization of conventional network devices, we propose extensions to the Xen driver domain model to take advantage of emerging network interface devices that provide multiple hardware receive queues with packet demultiplexing performed by the device based on packet MAC address and VLAN ID [1]. The combination of multi-queue devices with our software architecture extensions provides a solution that retains all the advantages of the driver domain model and preserves all the benefits of virtualization including guest-transparent migration and other services. Our results show that this approach has low overhead, incurring CPU cost close to that of direct I/O for a streaming receive microbenchmark. The rest of the paper is organized as follows. Section 2 reviews the Xen driver domain model. Section 3 presents a detailed analysis and breakdown of the costs of I/O virtualization. Section 4 presents our implementation and architectural changes that significantly improve performance for the driver domain model. Section 5 discusses how configuration settings affect I/O virtualization performance, and Section 6 presents our conclusions. 2 Xen Network I/O Architecture Figure 1 shows the architecture of Xen paravirtualized (PV) networking. Guest domains (i.e., running virtual machines) host a paravirtualized device driver, netfront, dr iv er domain b r i d g e d e v i c e d r i v e r NIC map packet free skb copy packet netb a c k I O c h a n n e l s H ar dw T X RX X en h y p er v isor ar e guest domain TX buffer Free buffer RX buffer RX packet netfront Figure 1: Xen PV driver architecture which interacts with the device indirectly through a separate device driver domain, which has privileged access to the hardware. Driver domains directly access hardware devices that they own; however, interrupts from these devices are first handled by the hypervisor which then notifies the corresponding driver domain through virtual interrupts. Netfront communicates with a counterpart backend driver called netback in the driver domain, using shared memory I/O channels. The driver domain uses a software bridge to route packets among the physical device and multiple guests though their netback interfaces. Each I/O channel comprises of an event notification mechanism and a bidirection ring of asynchronous requests carrying I/O buffer descriptors from netfront to netback and the corresponding responses. The event notification mechanism enables netfront and netback to trigger a virtual interrupt on the other domain to indicate new requests or responses have been posted. To enable driver domains to access I/O buffers in guest memory Xen provides a page grant mechanism. A guest creates a grant reference providing access to a page and forwards the reference as part of the I/O buffer descriptor. By invoking a hypercall, the driver domain uses the grant reference to access the guest page. For transmit (TX) requests the driver domain uses a hypercall to map the guest page into its address space before sending the request through the bridge. When the physical device driver frees the page a callback function is automatically invoked to return a response to netfront which then revokes the grant. For RX requests netfront posts I/O buffer page grants to the RX I/O channel. When netback receives a packet from the bridge it retrieves a posted grant from the I/O channel and issues a grant copy hypercall to copy the packet to the guest page. Finally, netback sends a response to the guest via the RX channel indicating a packet is available.

4 Table 1: Classes grouping Linux functions Class Description driver network device driver and netfront network general network functions bridge network bridge netfilter network filter netback netback mem memory management interrupt interrupt, softirq, & Xen events schedule process scheduling & idle loop syscall system call time time functions dma dma interface hypercall call into Xen grant issuing & revoking grant Table 2: Classes grouping Xen Function Class Description grant grant map unmap or copy operation schedule domain scheduling hypercall hypercall handling time time functions event Xen events mem memory interrupt interrupt entry enter/exit Xen (hypercall, interrupt, fault), traps fault handling (also system call intercept) 3 Xen Networking Performance Analysis This section presents a detailed performance analysis of Xen network I/O virtualization. Our analysis focuses on the receive path, which has higher virtualization overhead than the transmit path and has received less attention in the literature. We quantify the cost of processing network packets in Xen and the distribution of cost among the various components of the system software. Our analysis compares the Xen driver domain model, the Xen direct I/O model, and native Linux. The analysis provides insight into the main sources of I/O virtualization overhead and guides the design changes and optimizations we present in Section Experimental Setup We run our experiments on two HP c-class blade servers BL46c connected through a gigabit Cisco Catalyst Blade Switch 32. Each server has two 3GHz Intel Xeon 516 CPUs (two dual-core CPUs) with 4MB of L2 cache each, 8GB of memory, and two Broadcom NetXtreme II BCM5785 gigabit Ethernet Network Interface Cards (NICs). Although each server had two NICs, only one was used in each experiment presented in this paper. Table 3: Global function grouping Class Description xen Xen functions in domain context kernel kernel functions in domain grantcopy data copy in grant code xen Xen functions in guest context kernel kernel funnctions in guest usercopy copy from kernel to user buffer To generate network traffic we used the netperf[1] UDP STREAM microbenchmark. It would be difficult to separate the CPU costs on the transmit and receive paths using TCP, which generates ACK packets in the opposite direction of data packets. Therefore, we used unidirectional UDP instead of TCP traffic. Although all results are based on UDP traffic, we expect TCP to have similar behavior at the I/O virtualization level. We used a recent Xen unstable 1 distribution with paravirtualized Linux domains (i.e. a modified Linux kernel that does not require CPU virtualization support) using linux xen 2. The system was configured with one guest domain and one privileged domain which was also used as driver domain. For direct I/O evaluation the application was executed directly in domain. Both domain and the guest domain were configured with 512MB of memory and a single virtual CPU each. The virtual CPUs of the guest and domain were pinned to different cores of different CPU sockets 3. We use OProfile [2][18] to determine the number of CPU cycles used in each Linux and Xen function when processing network packets. Given the large number of kernel and hypervisor functions we group them into a small number of classes based on their high level purpose as described in Tables 1 and 2. In addition, when presenting overall results we group the functions in global classes as described in Table 3. To provide safe direct I/O access to guests an IOMMU [8] is required to prevent device DMA operations from accessing other guests memory. However, we were not able to obtain a server with IOMMU support. Thus, our results for direct I/O are optimistic since they do not include IOMMU overheads. Evaluations of IOMMU overheads are provided in [9][3]. 3.2 Overall Cost of I/O Virtualization Figure 2 compares the CPU cycles consumed for processing received UDP packets in Linux, and in Xen with direct I/O access from the guest and with Xen paravirtualized (PV) driver. The graph shows results for three different sizes of UDP messages 4 : 52, 15 and 48 bytes. A message with 52 bytes corresponds to a typical TCP ACK,

5 X 25, X e n P V d r iv e r e n d ir e c t I / O Linux 2, 1,8 cycles/packet 2, 15, 1, 5, app usercopy kernel xen grantcopy kernel xen cycles/packet 1,6 1,4 1,2 1, other traps entry interrupt mem time hypercall schedule K K KK KKK Message size (bytes) K Message size (bytes) Figure 2: CPU usage to receive UDP packets Figure 4: Hypervisor CPU cost for direct I/O cycles/packet 6, 5, 4, 3, 2, 1, Message=15 Message=4 8 K Xen Xen-T L B L i nu x Xen Xen-T L B L i nu x Figure 3: Kernel CPU cost for direct I/O other hypercall dma time syscall schedule interrupt mem network driver while a message with 15 bytes corresponds to the maximum ethernet packet size. Using messages with 48 bytes also generates maximum size packets but reduces the overhead of system calls at the kernel and application interface by delivering data in larger chunks. The experiments with maximum size packets in Figure 2 and in the remainder results presented in this paper were able to saturate the gigabit link. Experiments with 52-byte packets were not able to saturate the network link as the receive CPU becomes the bottleneck. To avoid having an overloaded CPU with a high number of dropped packets, we throttled the sender rate for small packets to have the same packet rate as with the large packet sizes. The results show that in the current Xen implementation, the PV driver model consumes significantly more CPU cycles to process received packets than Linux. Direct I/O has much better performance than the PV driver, but it still has significant overhead when compared to Linux, especially for small message sizes. In the rest of this paper we analyze the sources of I/O virtualization overheads in detail, and based on this analysis we propose several changes in the design and implementation of network I/O virtualization in Xen. 3.3 System Call Overhead We start by looking at the overheads when running guests with direct I/O access. It is surprising that for small mes- sage sizes Xen with direct I/O uses twice the number of CPU cycles to process received packets compared to non-virtualized Linux. This is a consequence of memory protection limitations of the 64-bit x86 architecture that are not present in the 32-bit X86 architecture. For paravirtualized guests, the entire hypervisor memory is mapped into a small region at the end of every guest virtual address space. In the 32-bit x86 architecture this memory is protected from guest accesses using segments. Segments containing hypervisor memory are only accessible in privilege level. The 3 additional privilege levels (rings) in the x86 architecture are available to the guest and enable the kernel to protect its memory from user level code. The 64-bit x86 architecture does not have support for segmentation, forcing Xen to rely on page table protection mechanisms to protect its memory from guest accesses. However, page table entries use only a single bit to distinguish supervisor and user mode, effectively limiting the processor to just two privilege levels. Since Xen uses one privilege level to protect its memory, kernel and user-level applications have to be executed at the same privilege level and therefore cannot share the same set of page tables. This requires Xen to intercept every system call to switch between user and kernel level page tables. In addition to the overhead of the interception, TLB flushes associated with the page table switches reduce performance of the kernel and application code. TLB flushes have negligible effect on hypervisor performance since Xen pages use global mappings and are not flushed from the TLB on system calls. Figure 3 shows the cost of TLB flushes on the guest kernel. The graph for Xen-TLB shows the performance improvement when the TLB flush on system call interception is disabled. This is not a safe optimization as it exposes guest kernel memory to user-level processes but it is used here for performance evaluation purposes. The results in Figure 3 show that the cost of TLB flushes is approximately 3% of the Linux cost when we have one system call per packet (message = 15 bytes). In contrast, with large message sizes of 48 bytes the overhead due to TLB flushes is negligible. Figure 4 shows the

6 cost of Xen code used to intercept system calls. The cost difference between the two message sizes corresponds approximately to the system call interception cost. Note that the cost of intercepting system calls and TLB flushes affects guests using PV driver in the same way as they affect guests using direct I/O. We note that system call interception is an artifact of current hardware limitations which will be eliminated with time. CPU hardware support for virtualization[2][5] provides different partitions for hypervisor and guests and eliminates the need for system call interception. Over time, as the performance of hardware virtualization mechanisms improves, it is expected that paravirtualized guests will also use this mechanism (currently used only by fully virtualized guests) and eliminate the need for system call interception. Given that the overhead for system call interception is temporary and not specific to I/O virtualization we will avoid its effect in the remaining results presented in this paper. In the following graphs we only present results for large message sizes (48 bytes) where the overhead due to system call interception is negligible. 3.4 Analysis of Direct I/O Performance Ignoring system call interception we observe that the kernel CPU cost for Xen with direct I/O is similar to Linux, as illustrated in the large message results in Figure 3. Xen with direct I/O consumes approximately 9 more CPU cycles per packet than native Linux (i.e. 31% overhead) for the receive path. Of these 9 cycles, 25 cycles are due to paravirtualization changes in the kernel code. In particular, direct I/O in Xen has more CPU cycles compared to Linux in DMA functions. This is due to the use of a different implementation of the Linux DMA interface. The DMA interface is a kernel service used by device drivers to translate a virtual address to a bus address used in device DMA operations. Native Linux uses the default (pci-nommu.c) implementation, which simply returns the physical memory address associated with the virtual address. Para-virtualized Linux uses the software emulated I/O TLB code (swioltb.c), which implements the Linux bounce buffer mechanism. This is needed in Xen because guest I/O buffers spanning multiple pages may not be contiguous in physical memory. The I/O bounce buffer mechanism uses an intermediary contiguous buffer and copies the data to/from its original memory location after/before the DMA operation, in case the original buffer is not contiguous in physical memory. However, the I/O buffers used for regular ethernet packets do not span across multiple pages and thus are not copied into a bounce buffer. The different number of observed CPU cycles is due to the different logic and extra checks needed to verify if the buffer is contiguous, and not due to an extra data copy. Of the total 9 cycles/packet overhead for direct I/O, the hypervisor accounts for 65 cycles/packet as shown in Figure 7. Most of these cycles are in timer related functions, interrupt processing, and entering and exiting (entry) the hypervisor due to interrupts and hypercalls. 3.5 Analysis of PV Driver Performance Xen PV driver consumes significantly more CPU cycles than direct I/O as shown in Figure 2. This is expected since Xen PV driver runs an additional domain to host the device driver. Extra CPU cycles are consumed both by the driver domain kernel and by the hypervisor which is now executed in the context of two domains compared to one domain with direct I/O. Additional CPU cycles are consumed to copy I/O data across domains using the Xen grant mechanism. The end result is that the CPU cost to process received network packets for Xen PV driver is approximately 18,2 cycles per packet which corresponds to 4.5 times the cost of native Linux. However, as we show in this paper, implementation and design optimizations can significantly reduce this CPU cost. Of the total 18,2 CPU cycles consumed, approximately 17 cycles are for copying data from kernel to user buffer (usercopy), 43 for guest kernel functions (kernel), 9 for Xen functions in guest context (xen), 3 for copying data from driver domain to guest domain (grantcopy), 54 for driver domain kernel functions (kernel), and 29 for Xen functions in driver domain context (xen). In contrast native Linux consumes approximately 4 CPU cycles for each packets where 11 cycles are used to copy data from kernel to user space and 29 cycles are consumed in other kernel functions. In the following subsections we examine each component of the CPU cost for Xen PV driver to identify the specific causes of high overhead Copy overhead We note in Figure 2 that both data copies (usercopy and grantcopy) in Xen PV driver consume a significantly higher number of CPU cycles than the single data copy in native Linux (usercopy). This is a consequence of using different memory address alignments for source and destination buffers. Intel procesor manuals [15] indicate that the processor is more efficient when copying data between memory locations that have the same 64-bit word alignment and even more efficient when they have the same cache line alignment. But packets received from the network are non aligned in order to align IP headers following the typical 14-byte Ethernet header. Netback copies the

7 4, 4, cycles/packet 3,5 3, 2,5 2, 1,5 1, 5 cycles/packet 3,5 3, 2,5 2, 1,5 1, 5 other traps entry interrupt mem event time hypercall schedule grant_table non aligned 6 4 b it aligned c ac h e aligned Direct I/O (X en ) P V d riv er (X en in g u es t) P V d riv er (X en in d o m a in ) Figure 5: Alignment effect on data copy cost Figure 7: Hypervisor CPU cost cycles/packet 5, 4,5 4, 3,5 3, 2,5 2, 1,5 1, 5 direct I/O P V driv er P V driv er n o f ra g m en t other grant hypercall dma time syscall schedule interrupt mem network driver cycles/packet 6, 5, 4, 3, 2, 1, L i nu x Xen P V d r i v er Linux kernel D o m a in kernel D o m a in kernel (n e t f i l t e r b r i d g e d i s a b l e d ) Figure 8: Driver domain CPU cost (kernel) other hypercall dma time syscall schedule interrupt mem netback netfilter bridge network driver Figure 6: Kernel CPU cost non aligned packets to the beginning of the granted page which by definition is aligned. In addition, since now the packet starts at a 64-bit word boundary in the guest, the packet payload will start at a non word boundary in the destination buffer, due to the Ethernet header. This causes the second copy from the kernel to the user buffer in the guest to also be misaligned. The two unaligned data copies consume significantly more CPU cycles than aligned copies. To evaluate this overhead, we modified netback to copy the packet into the guest buffer with an offset that makes the destination of the copy have the same alignment as the source. This is possible because we use a Linux guest which permits changing the socket buffer boundaries after it is allocated. Figure 5 shows the CPU cost of the grant copy for different copy alignments. The first bar shows the number of CPU cycles used to copy a 15 byte packet when the source and destination have different word alignments. The second bar shows the number of CPU cycles when source and destination have the same 64-bit word alignment but have different cache line alignment (128 bytes), while the third bar shows the number of CPU cycles when source and destination have the same cache line alignment. These results show that proper alignment can reduce the cost of the copy by a factor of two Kernel overhead Figure 6 compares the kernel cost for the Xen PV driver model and for the direct I/O model. Xen PV driver uses significantly more CPU cycles than direct I/O, especially in device driver functions. While the device drivers are different (i.e. physical device driver for direct I/O and netfront for PV driver), we would expect that the physical device driver in direct I/O would be more expensive as it has to interact with a real device, while netfront in the PV driver model only accesses the I/O channels hosted in main memory. After careful investigation we determined that the main source of increased CPU cycles in Xen PV driver is the use of fragments in guest socket buffers. Xen supports TCP Segmentation Offloading (TSO)[17] and Large Receive Offloading (LRO)[14][19] in its virtual interfaces enabling the use of large packets spanning multiple pages for efficient packet processing. For this reason netfront posts full page buffers which are used as fragments in socket buffers, even for normal size ethernet frames. Netfront copies the first 2 bytes of the first fragment into the main linear socket buffer area, since the network stack requires the packet headers in this area. Any remaining bytes past the 2 bytes are kept in the fragment. Use of fragments thereby introduces copy overheads and socket buffer memory allocation overheads. To measure the cost of using fragments we modified netfront to avoid using fragments. This was ac-

8 complished by pre-allocating socket buffers and posting those buffers directly into the I/O channel instead of posting fragment pages. This modification assumes packets will not use multiple pages and thus will not require fragments, since the posted buffers are regular socket buffers and not full page fragments. Since our NIC does not support LRO [14] and it is not configured with jumbo packets, all packets received from the external network use single-page buffers in our experiments. Of course this modification cannot be used for guest to guest communication since a guest can always send large fragmented packets. The third bar in Figure 6 shows the performance results using the modified netfront. The result shows that using socket buffers with fragments is responsible for most of the additional kernel cycles for the PV driver case when compared to direct I/O. Without the overhead of fragments the kernel CPU cost for PV driver is very close to that of direct I/O, except for small variations in the distribution of cycles among the different kernel functions. In particular, netfront in Xen PV driver now has lower cost than the physical device driver in direct I/O, as expected. Also since netfront does not access a physical device it does not need to use the software I/O TLB implementation of the DMA interface used by direct I/O. On the other hand PV drivers have the cost of issuing and revoking grants, which are not used with direct I/O. Most of this grant cost is due to the use of expensive atomic compare and swap instructions to revoke grant privileges in the guest grant table. This is necessary because Xen uses different bits of the same grant table field to store the status of the grant (grant in use or not by the driver domain) updated by Xen, and the grant access permission (enable or revoke grant access) updated by the issuing domain. The guest must revoke and check that the grant is no longer in use by the driver domain using an atomic operation, to ensure that a driver domain does not race with grant revoking and keep an undesired reference to the page after the grant is revoked. The use of atomic compare and swap instructions to revoke a grant adds a significant number of CPU cycles in the guest kernel cost. If however these two different grant bits are stored in different words, we can ensure atomicity using less expensive operations such as memory barriers. The result with implementation optimizations presented later in Section 4 include this and other grant optimizations discussed in the following Section Hypervisor overhead I/O processing consumes CPU cycles executing hypervisor functions in addition to guest kernel code. Figure 7 shows the CPU cost in Xen hypervisor functions for receiving network packets with PV driver. The graph shows the CPU cost when executing Xen functions in the context of the guest and in the context of the driver domain (i.e. domain ), and compares them with the CPU cost for direct I/O. The graph shows that most of the CPU cost for the hypervisor is due to code executing in the context of the driver domain, and the larger cost components are due to grant operations and schedule functions. The hypervisor CPU cost in guest context for PV drivers is similar to the hypervisor cost for direct I/O except for a higher cost in schedule functions. The higher cost in scheduling functions for PV driver is due to increased cache misses when accessing Xen data structures for domain scheduling purposes. With PV driver, the guest and the driver domain run on different CPUs, causing domain data structures used by scheduling functions to bounce between the two CPUs. We identified the most expensive operations for executing grant code by selectively removing code from the grant functions. Basically they are the following: 1) acquiring and releasing spin locks, 2) pinning pages, and 3) use of expensive atomic swap operations to update grant status as previously described. All of these operations can be optimized. The number of spinlock operations can be significantly reduced by combining the operations of multiple grants in a single critical section. The use of atomic swap operations can be avoided by separating grant fields in different words as described in section Note that this optimization reduces overhead in Xen and in the guest, since both of them have to access the same grant field atomically. The cost of pinning pages can also be optimized when using grants for data copy. During a grant copy operation, the hypervisor creates temporary mappings into hypervisor address space for both source and destination of the copy. The hypervisor also pins (i.e. increment a reference counter) both pages to prevent the pages from being freed while the grant is active. However, usually one of the pages is already pinned and mapped in the address space of the current domain which issued the grant operation hypercall. Thus we can avoid mapping and pinning the domain local page and just pin the foreign page referred by the grant. It turns out that pinning pages for writing in Xen is significantly more expensive than pinning pages for read, as it requires to increment an additional reference counter using an expensive atomic instruction. Therefore, this optimization has higher performance impact when the grant is used to copy data from a granted page to a local page (as we propose below in Section 4.1) instead of the other way around.

9 3.5.4 Driver domain overhead Figure 8 shows CPU cycles consumed by the driver domain kernel (2nd bar graph; domain ) and compares it with the kernel cost in native Linux (1st bar graph). The results show that the kernel cost in the driver domain is almost twice the cost of the Linux kernel. This is somewhat surprising, since in the driver domain the packet is only processed by the lower level of the network stack (ethernet), although it is handled by two device drivers: native device driver and netback. We observe that a large number of the CPU cycles in driver domain kernel are due to bridge and netfilter functions. Note that although the kernel has netfilter support enabled, no netfilter rule or filter is used in our experiments. The cost shown in the graph is the cost of netfilter hooks in the bridge code that are always executed to test if a filter needs to be applied. The third bar graph in the figure shows the performance of a driver domain when the kernel is compiled with the bridge netfilter disabled. The results show that most of the bridge cost and all netfilter cost can be eliminated if the kernel is configured appropriately when netfilter rules are not needed. 4 Xen Network Design Changes In the previous section we identified several sources of overhead for Xen PV network drivers. In this section we propose architectural changes to the Xen PV driver model that significantly improve performance. These changes modify the behavior of netfront and netback, and the I/O channel protocol. Some inefficiencies identified in Section 3 can be reduced through implementation optimizations that do not constitute architectural or protocol changes. Although some of the implementation optimizations are easier to implement in the new architecture, we evaluated their performance impact in the current architecture in order to separate their performance benefits from that of the architectural changes. The implementation optimizations include: disabling the netfilter bridge, using aligned data copies, avoiding socket buffer fragments and the various grant optimizations discussed in Sections and The second bar in Figure 9 shows the cumulative performance impact of all these implementation optimizations and is used as a reference point for the performance improvements of the architectural changes presented in this section. In summary, the implementation optimizations reduce the CPU cost of Xen PV driver by approximately 495 CPU cycles per packet. Of these, 155 cycles are due to disabling bridge netfilter, 185 cycles due to using cache aligned data copies, 9 cycles due to avoiding socket buffer fragments and 65 cycles due to grant optimizations. 4.1 Move Data Copy to Guest As described in Section 3 a primary source of overhead for Xen PV driver is the additional data copy between driver domain and guest. In native Linux there is only one data copy, as the received packet is placed directly into a kernel socket buffer by the NIC and later copied from there to the application buffer. In Xen PV driver model there are two data copies, as the received packet is first placed in kernel memory of the driver domain by the NIC, and then it is copied to kernel memory of the guest before it can be delivered to the application. This extra cost could be avoided if we could transfer the ownership of the page containing the received packet from the driver domain to the guest, instead of copying the data. In fact this was the original approach used in previous versions of Xen [13]. The first versions of Xen PV network driver used a page flipping mechanism which swapped the page containing the received packet with a free guest page, avoiding the data copy. The original page flip mechanism was replaced by the data copy mechanism in later versions of Xen for performance reasons. The cost of mapping and unmapping pages in both guest and driver domain was equivalent to the copy cost for large 15 byte packets, which means that page flipping was less efficient than copy for small packet sizes [25]. In addition, the page flip mechanism increases memory fragmentation and prevents the use of super-page mappings with Xen. Menon et al. [17] have shown that super-pages provide superior performance in Xen, making page flipping unattractive. One problem with the current data copy mechanism is that the two data copies per packet are usually performed by different CPUs leading to poor data cache behavior. On an SMP machine, it is expected that an I/O intensive guest and the driver domain will be executing on different CPUs, especially if there is high I/O demand. The overhead introduced by the extra data copy can be reduced if both copies are performed by the same CPU and benefit from cache locality. The CPU will bring the packet to its cache during the first data copy. If the data is still present in the CPU cache during the second copy, this copy will be significantly faster using fewer CPU cycles. Of course there is no guarantee that the data will not be evicted from the cache before the second copy, which can be delayed arbitrarily depending on the application and overall system workload behavior. At high I/O rates, however, it is expected that the data will be delivered to the application as soon as it is received and will benefit from L2 cache locality. We modified the architecture of Xen PV driver and moved the grant copy operation from the driver domain to the guest domain, improving cache locality for the second data copy. In the new design, when a packet is re-

10 X ceived netback issues a grant for the packet page to the guest and notifies the guest of the packet arrival through the I/O channel. When netfront receives the I/O channel notification, it issues a grant copy operation to copy the packet from a driver domain page to a local socket buffer, and then delivers the packet to the kernel network stack. Moving the grant copy to the guest has benefits beyond speeding up the second data copy. It avoids polluting the cache of the driver domain CPU with data that will not be used, and thus should improve cache behavior in the driver domain as well. It also provides better CPU usage accounting. The CPU cycles used to copy the packet will now be accounted to the guest instead of to the driver domain, increasing fairness when accounting for CPU usage in Xen scheduling. Another benefit is improved scalability for multiple guests. For high speed networks (e.g. 1GigE), the driver domain can become the bottleneck and reduce I/O throughput. Offloading some of the CPU cycles to multiple guests CPUs avoids the driver domain from becoming a bottleneck improving I/O scalability. Finally, some implementation optimizations described earlier are easier to implement when the copy is done by the guest. For example, it is easier to avoid the extra socket buffer fragment discussed in Section Moving the copy to the guest allows the guest to allocate the buffer after the packet is received from netback at netfront. Having knowledge of the received packet allows netfront to allocate the appropriate buffers with the right sizes and alignments. Netfront can allocate one socket buffer for the first page of each packet and additional fragment pages only if the packet spans multiple pages. For the same reason, it is also easier to make aligned data copies, as the right socket buffer alignment can be selected at buffer allocation time. The third bar in Figure 9 shows the performance benefit of moving the grant copy from the driver domain to the guest. The cost of the data copy is reduced because of better use of the L2 cache. In addition, the number of cycles consumed by Xen in guest context (xen) increases while it decreases for Xen in driver domain context (xen). This is because the cycles used by the grant operation are shifted from the driver domain to the guest. We observe that the cost decrease in xen is higher than the cost increase in xen leading to an overall reduction in the CPU cost of Xen. This is because the grant optimizations described in Section are more effective when the grant operations are performed by the guest, as previously discussed. In summary, moving the copy from the driver domain to the guest reduces the CPU cost for Xen PV driver by approximately 225 cycles per packet. Of these, 14 cycles are due to better cache locality for guest copies (usercopy, grantcopy, and also kernel for faster accesses to packet headers in protocol processing), 6 cycles are cycles/packet 2, 18, 16, 14, 12, 1, 8, 6, 4, 2, % % current s o f tw a re en o p ti m i z a ti o ns % % 19 8 % 1% g m m L i p y g ues t ul ti -q ueue ul ti -q ueue+ nux co ra nt reus e Figure 9: Xen PV driver Optimizations app usercopy kernel xen grantcopy kernel xen due to grant optimizations being more effective (xen + xen) and 25 cycles are due to less cache pollution in driver domain (kernel). 4.2 Extending Grant Mechanism Moving the grant copy to the guest requires a couple of extensions to the Xen grant mechanism. We have not yet implemented these extensions but we discuss them here. To support copy in the receiving guest, the driver domain has to issue a grant to the memory page containing the received packet. This works fine for packets received on the physical device since they are placed in driver domain memory. In contrast, the memory buffers with packets received from other guests are not owned by the driver domain and cannot be granted to the receiving guest using the current Xen grant mechanism. Thus, this mechanism needs to be extended to provide grant transitivity allowing a domain that was granted access to another domain s page, to transfer this right to a third domain. We must also ensure memory isolation and prevent guests from accessing packets destined to other guests. The main problem is that the same I/O buffer can be reused to receive new packets destined to different guests. When a small sized packet is received on a buffer previously used to receive a large packet for another domain, the buffer may still contain data from the old packet. Scrubbing the I/O pages after or before every I/O to remove old data would have a high overhead wiping out the benefits of moving the copy to the guest. Instead we can just extend the Xen grant copy mechanism with offset and size fields to constrain the receiving guest to access only the region of the granted page containing the received packet data. 4.3 Support for Multi-queue Devices Although moving the data copy to the guest reduces the CPU cost, eliminating the extra copy altogether should provide even better performance. The extra copy can be

11 dr iv er domain vmq_alloc_page() vmq_alloc_skb() vmq_netif_rx() d e v i c e d r i v e r NIC buffer pools packet free skb netb a c k packet I O c h a n n e l s H ar dw P o s t RX RX T X X en h y p er v isor ar e guest domain RX buffer RX packet TX buffer Free buffer netfront Figure 1: Multi-queue device support avoided if the NIC can place received packets directly into the guest kernel buffer. This is only possible if the NIC can identify the destination guest for each packet and select a buffer from the respective guest s memory to place the packet. As discussed in Section 1, NICs are now becoming available that have multiple receive (RX) queues that can be used to directly place received packets in guest memory [1]. These NICs can demultiplex incoming traffic into the multiple RX queues based on the packet destination MAC address and VLAN IDs. Each individual RX queue can be dedicated to a particular guest and programmed with the guest MAC address. If the buffer descriptors posted at each RX queue point to kernel buffers of the respective guest, the device can place incoming packets directly into guest buffers, avoiding the extra data copy. Figure 1 illustrates how the Xen PV network driver model can be modified to support multi-queue devices. Netfront posts grants to I/O buffers for use by the multiqueue device drivers using the I/O channel. For multiqueue devices the driver domain must validate if the page belongs to the (untrusted) guest and needs to pin the page for the I/O duration to prevent the page being reassigned to the hypervisor or other guests. The grant map and unmap operations accomplish these tasks in addition to mapping the page in the driver domain. Mapping the page is needed for guest to guest traffic which traverses the driver domain network stack (bridge). Experimental results not presented here due to space limitations show that the additional cost of mapping the page is small compared to the overall cost of the grant operation. Netfront allocates two different types of I/O buffers which are posted to the I/O channel: regular socket buffers with the right alignments required by the network stack, and full pages for use as fragments in non linear socket buffers. Posting fragment pages is optional and just needed if the device can receive large packets spanning multiple pages, either because it is configured with jumbo frames or because it supports Large Receive Offload (LRO). Netback uses these grants to map the guest pages into driver domain address space buffers and keeps them in two different pools for each guest: one pool for each type of page. These buffers are provided to the physical device driver on demand when the driver needs to post RX descriptors to the RX queue associated with the guest. This requires that the device driver use new kernel functions to request I/O buffers from the guest memory. This can be accomplished by providing two new I/O buffer allocation functions in the driver domain kernel, vmq alloc skb() and vmq alloc page(). These are equivalent to the traditional Linux functions netdev alloc skb() and alloc page() except that they take an additional parameter specifying the RX queue for which the buffer is being allocated. These functions return a buffer from the respective pool of guest I/O buffers. When a packet is received the NIC consumes one (or more in case of LRO or jumbo frame) of the posted RX buffers and places the data directly into the corresponding guest memory. The device driver is notified by the NIC that a packet arrived and then forwards the packet directly to netback. Since the NIC already demultiplexes packets, there is no need to use the bridge in the driver domain. The device driver can instead send the packet directly to netback using a new kernel function vmq netif rx(). This function is equivalent to the traditional Linux netif rx() typically used by network drivers to deliver received messages, except that the new function takes an additional parameter that specifies which RX queue received the packet. Note that these new kernel functions are not specific to Xen but enable use of multi-queue devices with any virtualization technology based on the Linux kernel, such as for example KVM[23]. These new functions only need to be used in new device drivers for modern devices that have multi-queue support. These new functions extend the current interface between Linux and network devices enabling the use of multi-queue devices for virtualization. This means that the same device driver for a multiqueue device can be used with Xen, KVM or any other Linux based virtualization technology. We observe that the optimization that moves the copy to the guest is still useful even when using multi-queue devices. One reason is that the number of guests may exceed the number of RX queues causing some guests to share the same RX queue. In this case guests that share the same queue should still use the grant copy mechanism to copy packets from the driver domain memory to the guest. Also, grant copy is still needed to deliver local guest to guest traffic. When a guest sends a packet to another local guest the data needs to be copied from

12 one guest memory to another, instead of being sent to the physical device. In these cases moving the copy to the guest still provides performance benefits. To support receiving packets from both the multi-queue device and other guests, netfront receives both types of packets on the RX I/O channel shown in Figure 1. Packets from other guests arrive with copy grants that are used by netfront, whereas packets from the device use pre-posted buffers. We have implemented a PV driver prototype with multi-queue support to evaluate its perfomance impact. However, we did not have a NIC with multi-queue support available in our prototype. We used instead a traditional single queue NIC to estimate the benefits of a multi-queue device. We basically modified the device driver to dedicate the single RX queue of the NIC to the guest. We also modified netback to forward guest buffers posted by netfront to the physical device driver, such that the single queue device could accurately emulate the behavior of a multi-queue device. The fourth bar in Figure 9 shows the performance impact of using multi-queue devices. As expected the cost of grant copy is eliminated as now the packet is placed directly in guest kernel memory. Also the number of CPU cycles in xen for guest context is reduced since there is no grant copy operation being performed. On the other hand the driver domain has to use two grant operations per packet to map and unmap the guest page as opposed to one operation for the grant copy, increasing the number of cycles in xen and reducing the benefits of removing the copy cost. However, the number of CPU cycles consumed in driver domain kernel is also reduced when using multi-queue devices. This is due to two reasons. First, packets are forwarded directly from the device driver to the guest avoiding the forwarding costs in the bridge. Second, since netback is now involved in both allocating and deallocating socket buffer structures for the driver domain, it can avoid the costs of their allocation and deallocation. Instead, netback recycles the same set of socket buffer structures in multiple I/O operations. It only has to change their memory mappings for every new I/O, using the grant mechanism to make the socket buffers point to the right physical pages containing the guest I/O buffers. Surprisingly, the simplifications in driver domain have a higher impact on the CPU cost than the elimination of the extra data copy. In summary, direct placement of data in guest memory reduces PV driver cost by 7 CPU cycles per packet, and simplified socket buffer allocation and simpler packet routing in driver domain reduces the cost by additional 115 cycles per packet. On the other hand the higher cost of grant mapping over the grant copy, increases the cost by 65 cycles per packet providing a net cost reduction of 12 CPU cycles per packet. However, dr iv er domain vmq_alloc_skb() or vmq_alloc_page() vmq_netif_rx() d e v i c e d r i v e r NIC netb a c k packet packet grant cache I O c h a n n e l s H ar dw P o s t RX RX T X C o n t r o l X en h y p er v isor ar e guest domain netfront RX buffer RX packet TX buffer Free buffer invalidate grant Figure 11: Caching and reusing grants the benefit of multi-queue devices can actually be larger when we avoid the costs associated with grants as discussed in the next section. 4.4 Caching and Reusing Grants As shown in Section 3.5.3, a large fraction of Xen CPU cycles consumed during I/O operations are due to grant operation functions. In this section we describe a grant reuse mechanism that can eliminate most of this cost. The number of grant operations performed in the driver domain can be reduced if we relax the memory isolation property slightly and allow the driver domain to keep guest I/O buffers mapped in its address space even after the I/O is completed. If the guest recycles I/O memory and reuses previously used I/O pages for new I/O operations, the cost of mapping the guest pages using the grant mechanism is amortized over multiple I/O operations. Fortunately, most operating systems tend to recycle I/O buffers. For example, the Linux slab allocator used to allocate socket buffers keeps previously used buffers in a cache which is then used to allocate new I/O buffers. In practice, keeping I/O buffer mappings for longer times does not compromise the fault isolation properties of driver domains, as the driver domain still can only access the same set of I/O pages and no pages containing any other guest data or code. In order to reuse grants, Xen PV driver needs to be modified as illustrated in Figure 11. Netback keeps a cache of currently mapped grants for every guest. On every RX buffer posted by the guest (when using a multiqueue device) and on every TX request, netback checks if the granted page is already mapped in its address space, mapping it only if necessary. When the I/O completes, the mapping is not removed allowing it to be reused in future I/O operations. It is important, though, to enable the

Netchannel 2: Optimizing Network Performance

Netchannel 2: Optimizing Network Performance J. Renato Santos +, G. (John) Janakiraman + Yoshio Turner +, Ian Pratt * + HP Labs - * XenSource/Citrix Xen Summit Nov 14-16, 2007 2003 Hewlett-Packard Development