RDMA-like VirtIO Network Device for Palacios Virtual Machines

Size: px

Start display at page:

Download "RDMA-like VirtIO Network Device for Palacios Virtual Machines"

Cynthia Robertson
5 years ago
Views:

1 RDMA-like VirtIO Network Device for Palacios Virtual Machines Kevin Pedretti UNM ID: CS-591 Special Topics in Virtualization May 10, 2012 Abstract This project developed an RDMA-like VirtIO network device for the Palacios virtual machine monitor that enables virtual machines running on the same host system to communicate with one another directly, without any intermediate memory copies. The project required gaining a deep understanding of the VirtIO interface, creating a Portals4-light VirtIO device in Palacios, adding VirtIO support to the Kitten operating system (which was used as the guest), and performing a simple benchmark analysis. The foundation created by this project will be used in future work to create a complete and highperformance virtual Portals4 network device for Palacios, as part of a funded DOE/ASCR X-stack project. 1 Introduction Virtualization is potentially useful for high-performance computing workloads [1], but only if virtual machines have access to near-native network performance. Direct device pass-through can provide that, but has the downside that only one virtual machine (VM) can use a pass-through device at a time, and it also greatly complicates VM checkpointing and migration. Virtualized network interfaces and networks avoid these problems, but typically use interfaces that are inappropriate for HPC (e.g., TCP/IP over Ethernet). This project takes a step in the direction of implementing an HPC-appropriate virtual network interface. The original project plan was to implement a virtual Portals network device, but time-constraints required the project scope to be scaled back. Instead, a simple RDMA-like VirtIO network device was developed for the Palacios [2] virtual machine monitor (VMM) that allows VMs running on the same host to communicate with one another directly, without any intermediate memory copies. The remainder of this report is organized as follows. Section?? describes the overall software architecture that was developed. Section 3 discusses the implementation and several interesting observations made along the way. Section 4 presents some initial performance results. Finally, Section 5 concludes the report. 2 Overall Architecture Figure 1 provides of an overview of the overall system. Two virtual machines are shown running on a Linux host. The virtual machine on the left is sending a buffer (labeled TX Buffer ) to a buffer in the virtual machine on the right (labeled RX Buffer ). To do this, the sending VM must generate a Portals Put command describing the send buffer, the target buffer, and the target Portals process ID (i.e., the Portals ID of the right VM). Once the command is constructed, the Portals driver (labeled P4 Driver ) writes the command to the virtual Portals network interface (labeled VirtIO P4 NIC ). This is accomplished via a VirtIO Ring [3]. A VirtIO ring consists of three parts: 1) a descriptor array (labeled Desc ), 2) an available descriptor ring (labeled Avail ), and 3) a used descriptor ring (labeled Used ). The Portals driver allocates a descriptor 1

Kitten Guest Kitten Guest P4 Put Cmd Buf P4 Driver TX Buffer RX Buffer Desc Avail Used VirtIO P4 NIC VirtIO P4 NIC NID_MAP[ ] Palacios Host Linux Kernel Figure 1: Block diagram of system.

2 Kitten Guest Kitten Guest P4 Put Cmd Buf P4 Driver TX Buffer RX Buffer Desc Avail Used VirtIO P4 NIC VirtIO P4 NIC NID_MAP[ ] Palacios Host Linux Kernel Figure 1: Block diagram of system. from the descriptor array, sets its address and length fields to map the Portals command buffer, adds the index of the descriptor to the available descriptor ring, kicks the virtual Portals network interface by issuing an IO port write, then polls the used descriptor ring until the index of the command descriptor appears. When the virtual Portals network interface receives the Put command, it looks up the target node via the global node ID map (labeled NID MAP ). Assuming the node is valid, it then uses the target VM s state structure to convert guest physical address of the receive buffer passed in by the sending VM to the appropriate host kernel virtual address. Similarly, the guest physical address of the transmit buffer must be converted to a host kernel virtual address using the sending VM s state structure. Once both addresses are known, the virtual Portals NIC uses memcpy() to copy the TX buffer to the RX buffer. The receiving VM can poll on the RX buffer to notice when the transfer is complete. The virtual Portals NIC takes care to ensure that the last byte of the transfer is truly written last, so polling on the last byte of the buffer is sufficient. A full Portals implementation would include event queues, which the receiver could wait on to detect completion. 3 Implementation All source code developed for this project is available on the class Wiki, at ssl/doku.php/bridges:classes:cs591:portals4. The majority of development effort was put into understanding the VirtIO queue and ring interfaces, and adding VirtIO driver support to the Kitten [4] lightweight kernel. Since the Portals VirtIO device is exposed as a PCI device, code had to be added to Kitten to enumerate the PCI bus and read each device s base address registers. For Palacios, the most important file to look at is lnx virtio portals4.c in 0001-VirtIO-Portals4.patch. This file implements the VMM-side of the VirtIO Portals device. Note, Portals4 is a misnomer here, as this isn t really a Portals4 device, but rather a simple shmem putmem() like mechanism for copying data between VMs on the same node. The APIs implemented are: 2

3 int PtlNIInit(unsigned int pid); int PtlGetId(unsigned int *nid, unsigned int *pid); int PtlPut(void *local addr, void *target addr, size t length, unsigned int target nid, unsigned int target pid); The PtlNIInit() call initializes the virtual Portals NIC. This adds the calling VM s state structure to the Portals NID MAP so other VMs can fund it. The PtlGetID() call returns a callers Portals Node ID and Portals Process ID. The PtlPut() call sends a buffer in the caller s address space to a buffer in a remote (target) address space (e.g., a process in another VM running on the same node). For Kitten, the most important file to look at is virtio portals4.c in 003-p4drv.patch. This file implements a Guest-side driver for the VirtIO Portals device. The driver initializes the device, then does a simple Ping-Pong test between two VMs running on the same node. Results are presented in Section 4. 4 Results Performance of the VirtIO Portals network interface was evaluated using a Ping-Pong benchmark. The Ping-Pong is performed between two hard-coded Portals NID/PID (Node ID / Process ID) pairs. Each guest has its own config file which defines a LNX VIRTIO PORTALS4 device with a hard-coded Portals NID. Node 4 is the pinger and Node 5 is the ponger. When each guest starts, its Portals driver calls PtlNIInit() with the Portals Process ID that it wants to be associated with this is hard-coded to be 16 for both nodes. Then, Portals NID/PID 4.16 sends a ping to is polling on memory. When it notices that the message has arrived, it sends a pong message back of the same size to 4.16, which also polls on memory to notice completion. The round-trip time is measured by The benchmark tests message sizes from 1 to 4 MB in power of two increments. Bandwidth results are shown in Figure 2. The VirtIO P4 line is the performance of the virtual Portals NIC when moving data between two virtual machines running on the same node. The v3 migrate core command was used to ensure that each VM was running on its own CPU, and the two CPUs used were selected to be on the same processor socket. Palacios s default behavior is to run both VMs on the same CPU, which was found to provide very poor performance due to the Ping-Pong benchmark s use of busy polling. The MPI line in Figure 2 is the performance of OpenMPI when sending MPI messages between two MPI native processes running on the same node. The MPI processes were bound to the same two CPUs that were used for the VirtIO Portals tests. The Intel Messaging Benchmarks(IMB) Ping-Pong benchmark was used for MPI testing. The VirtIO Portals bandwidth curve ramps up more slowly than MPI, but reaches a much higher peak bandwidth. This is likely due to MPI having an extra memory copy in the data path, where the sender first copies the message into a shared memory region and then the receiver copies the message out of the shared memory region. In contrast, the VirtIO Portals NIC copies directly from the send buffer to the receive buffer. MPI s double copy requires double the bandwidth compared to the VirtIO Portals NIC s single copy data path. For both MPI and VirtIO Portals, bandwidth drops significantly after the 1 MB message size. This is due to cache effects the Xeon processor that was used has a shared 4 MB level 3 cache. Figure 3 plots the latency for both the VirtIO Portals NIC and MPI. The higher overhead of the VirtIO Portals NIC is due to the extra overhead of entering and exiting into Palacios. Each VMExit was benchmarked to require around 5 microseconds of overhead. Assuming this is split evenly between entering and exiting Palacios, this likely accounts for the roughly 2.5 microsecond higher latency of the VirtIO Portals NIC. Error bars are shown for the VirtIO Portals NIC results in both figures. The reason for the outliers is not known at this time. The outliers are not repeatable, and move around from run to run. This suggests that host OS or VMM noise may be an issue. 3

4 VirtIO P4 MPI Bandwidth (MB/s) K 8K 32K 128K 512K 2M Message Size (Bytes) Figure 2: Bandwidth between two VMs/Processes on the same node. 20 VirtIO P4 MPI Latency (microseconds) K 2K 4K 8K 16K 32K Message Size (Bytes) Figure 3: Latency between two VMs/Processes on the same node. 4

5 5 Conclusion This project developed an RDMA-like VirtIO network device for the Palacios virtual machine monitor that enables virtual machines running on the same host system to communicate with one another directly, without any intermediate memory copies. The project required gaining a deep understanding of the VirtIO interface, creating a VirtIO Portals network interface device in Palacios, adding VirtIO support to the Kitten operating system (which was used as the guest), and performing a benchmark analysis using a Ping-Pong benchmark. The foundation created by this project will be used in future work to create a complete and high-performance virtual Portals4 network device for Palacios, as part of a funded DOE/ASCR X-stack project. References [1] W. Huang, J. Liu, B. Abali, and D. Panda, A case for high performance computing with virtual machines, in Proceedings of the 20th International Conference on Supercomputing (ICS), June [2] J. Lange, K. Pedretti, P. Dinda, P. G. Bridges, C. Bae, P. Soltero, and A. Merritt, Minimal-overhead virtualization of a large scale supercomputer, in Proceedings of the 2011 ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments (VEE 11), (Newport Beach, CA), March [3] R. Russell, virtio: Towards a De-Facto Standard For Virtual I/O Devices, ACM SIGOPS Operating Systems Review, vol. 42, pp , July [4] Sandia National Laboratories, The Kitten Lightweight Kernel. kitten. 5

Implementing a GDB Stub in Lightweight Kitten OS

Implementing a GDB Stub in Lightweight Kitten OS Angen Zheng, Jack Lange Department of Computer Science University of Pittsburgh {anz28, jacklange}@cs.pitt.edu ABSTRACT Because of the increasing complexity