A Modular High Performance Implementation of the Virtual Interface Architecture

Size: px

Start display at page:

Download "A Modular High Performance Implementation of the Virtual Interface Architecture"

Judith Burns
5 years ago
Views:

1 A Modular High Performance Implementation of the Virtual Interface Architecture Patrick Bozeman Bill Saphir National Energy Research Scientific Computing Center (NERSC) Lawrence Berkeley National Laboratory 1. Overview The Virtual Interface Architecture (VIA) is an industry standard for low-latency high-bandwidth interprocess communication over system area networks (SANs). The VIA specification describes a software interface for fully protected user level communication that can be accelerated by relatively inexpensive VIA-aware hardware. We describe M-VIA, a modular, high-performance and freely available implementation of VIA for Linux. M-VIA makes two significant contributions to the state of the art. First, M-VIA s modularity allows it to support many types of network interfaces (NICs), including legacy NICs and newer smart NICs that have special support for VIA. This high degree of portability has not been achieved or attempted by other userspace communication projects. M-VIA s modularity introduces little overhead, so that M-VIA achieves high-performance. Second, M-VIA provides to applications a portable and robust interface, verifiably conforming to the VIA standard including connection management, error detection, error recovery, and precisely defined semantics. These features make it suitable as a reference implementation and as a base for commercial software development. Previous proof-of-concept research projects have demonstrated high performance but have not emphasized robustness in either the interface or the implementation. M-VIA is implemented as a set of loadable kernel modules for Linux and a user level library. It supports so-called VIA doorbells where they are provided by VIA-aware hardware, and implements software doorbells with a fast trap (a trap to privileged mode that does not incur the overhead of a system call) for legacy hardware. Transfer of data occurs directly from an application s address space, with no copy other than what is required by the network interface, and no operating system overhead in the critical path. M- VIA coexists with traditional networking, allowing a single network to be used for both VIA and IP traffic. 2. Overview of Virtual Interface Architecture (VIA) Academic researchers have developed a variety of techniques for performing very low overhead communication on almost any network. Their research has shown that one can avoid the copying and processing overhead associated with TCP, as well as the overhead of a system call, while still providing full protection. Well-known examples are Active Messages [Eicken92], U-Net [Basu96], and Fast Messages [Pakin95]. These projects have demonstrated a proof-of-concept, but they have not been widely adopted, even within the high performance computing community. IP remains the only protocol that is widely available. Virtual Interface Architecture (VIA) is a production oriented, high-performance communication mechanism for system area networks (SANs). Its design was strongly influenced by the academic research on low-overhead communication as well as experience with MPPs [Pierce94]. Like these projects, VIA provides fully protected user-level access to a network interface. Because of its widespread industry support (Intel, Compaq and Microsoft are the three primary promoters of VIA), it is likely that VIA will become widely adopted. Moreover, VIA can be accelerated by relatively inexpensive VIA-aware hardware, and such hardware will more naturally support VIA than competing communication mechanisms. Examples include Giganet [Giganet98], Synfinity [Larson98] and ServerNet-II [Tandem95].

2 The VIA 1.0 specification [VIA97] was finished in December 1997, after feedback from over a hundred industrial and academic contributors, including the authors of this paper. It provides send and receive operations for message passing, as well as remote memory access operations, which allow read/write access to the memory of a remote process without the explicit cooperation of that process. VIA communication is categorized as unreliable, reliable delivery or reliable reception. Implementations may provide one or more of these modes, usually depending on characteristics of the hardware (though software may implement reliable VIA on unreliable hardware). VIA provides protected zero-copy data transfer (where supported by network hardware), without requiring operating system kernel assistance. VIA requires that memory used in communication be registered by the application prior to communication to avoid page faults on transmission or reception of data. Higher-level communication APIs such as the Message Passing Interface (MPI) can be efficiently layered on VIA [Dimitrov99]. While we are primarily interested in scientific computation, VIA has a number of commercial applications in the area of high performance servers Other important commercial drivers for VIA are the forthcoming NGIO and Future IO standards for high performance peripherals. NGIO is expected to rely on VIA as its transport mechanism. The VI Architecture consists of three components. The user-visible component is a library known as VIPL (VI Provider Library) that contains routines for data transfer, connection management, queue management memory registration and error handling. The second component, the VI Kernel Agent, provides necessary kernel services, including connection management and memory registration. The third component, the VI Network Interface (VI NIC), performs the actual data transfer. It is conceptually a piece of hardware, but may be implemented as a combination of hardware and software. The NIC can directly access user memory and provides a doorbell (usually a memory-mapped register) that VIPL uses to notify the NIC that new entries have been placed in VI work queues. To send and receive messages, a user application writes a VIA descriptor in an area of registered memory, and calls a VIPL routine that presses the doorbell to let the NIC know that that the descriptor is available for processing. Of the three major components, only VIPL is specified in detail by the VIA standard, and even this specification is only a recommendation. To enable truly portable applications, Intel wrote the VI Architecture Developer s Guide [Intel98] that specifies the VIA API in much more detail. Intel released an extensive conformance test suite to determine whether VIA implementations are in compliance. VIA applications that use VIPL (as clarified by the Developer s Guide) should be portable between different conforming VIA implementations. The majority of the VIA community supports the adoption of the standard interface specified in the Developer s Guide. 3. The M-VIA Implementation of VIA We have developed a high-performance modular implementation of VIA for the Linux operating system called Modular VIA (M-VIA). M-VIA is implemented as a user-level library (libvipl.a) and at least two loadable kernel modules for Linux. The core module is device-independent and provides the majority of functionality needed by VIA. One or more device-specific modules, called device modules, implement device-specific functionality. A device module is essentially a device driver, and includes the standard device driver code plus M-VIA-specific modifications M-VIA Modular Design A primary design goal of M-VIA is to enable the rapid implementation of VIA for new network interfaces, including legacy dumb NICs as well as newer smart NICs with either special VIA support (e.g. support for VIA doorbells and VIA descriptor processing) or programmable processors. M-VIA achieves this goal through a modular implementation. It provides a complete VIA framework, but allows a device module to

3 replace a subset of VIA functionality in a device-specific way. With no hardware support, we describe a VIA implementation as software-only, and otherwise call it hardware-accelerated. This modular division between core management and device specific operations facilitates the rapid development of support for new devices. In a hardware-accelerated implementation, the device module can register hardware functionality to allow the hardware to take over core functions, such as memory and doorbell management. In particular, VIA doorbells for VIA-aware hardware are usually implemented as memory-mapped registers read and written by user-level code to tell the network interface that new descriptors have been posted. With hardware acceleration, M-VIA requires no memory-to-memory copies to transfer data. In a software-only implementation, it is critical that the doorbell operation have as little overhead as possible. M-VIA uses a fast trap to execute privileged code with minimum overhead. A fast trap incurs significantly less overhead than a system call, which performs additional operations related to scheduling and signal processing. The 38 instructions written in assembly code to implement the fast trap constitute the only processor-specific code in M-VIA (currently the x86 architecture is supported; Alpha support and PowerPC support are planned). In a software-only implementation, data transmission requires a single memory-to-memory copy inside the interrupt handler at the receiver. This copy is unavoidable for protected communication without special hardware support. M-VIA provides wire level interoperability among software-only NICs. This is facilitated by an additional abstraction called a Device Class, which is a framework within the device module for handling devices with similar characteristics. For instance, an EtherRing class can be used for Ethernet devices with a circular queue of buffer descriptors. The majority of Ethernet devices use this as their internal architecture. Of course, wire level interoperability is not restricted to such a class, only facilitated by it. Modularity does not adversely affect performance. Time-sensitive operations, such as the actual transmission of data, are fast-pathed. Specifically, communication between Devices and Device Classes is through macros; rather than through function calls, and VIA doorbell operations for software NICs are implemented with fast traps. M-VIA achieves high bandwidth for software-only NICs by incorporating virtual memory management into the core module, enabling the transfer of data directly from an application s address space, with no additional memory copies other than those required by the network interface. A side benefit of this approach is that communication within an SMP requires only a single memory copy, whereas arbitrary message passing between separate address spaces requires two copies for any mechanism that is implemented purely in user-space. Thus, bandwidth of non-pipelined VIA communication between two processes on an SMP is approximately two times higher than achievable through other mechanisms M-VIA Core Module The M-VIA core module is divided into device independent, reusable, functional components. Connection Manager: Establishes logical point-to-point connections between VIs. Protection Tag Manager: Allocates, deallocates, and validates memory protection tags. Registered Memory Manager: Handles the registration of user communication buffers and descriptor buffers. Completion Queue Manager: Manages the optional completion queues associated with VI work queues, as well as user requests to block on completion. Error Queue Manager: Provides a mechanism for posting asynchronous error by VIA devices and for blocking on errors by asynchronous error handling threads of VI applications. Linux Kernel Extensions: Provides functionality required for efficient implementation, including: condition variables; user to kernel memory remapping; and user address to physical address translation.

4 The core module provides the default functionality for all VIA operations. To perform device specific functions, the framework components listed above call routines registered by specific device modules. For example, the Connection Manager handles the common support issues relating to queuing requests: blocking for connection completion; verifying connection attributes; assigning a unique connection id; etc. However, the Connection Manager calls functions registered by the device module to actually perform the transmission of the request, acceptance, or rejection of a connection to a remote device. Operations that are entirely device specific, such as the creation and destruction of VIs and the transmission of data to and from the wire, are passed directly to the appropriate device. However, the core framework provides some functions to make the design of such operations easier to implement. For example, generic descriptor processing routines are provided for use by software-only devices Device Modules A device module provides the abstraction of a VI NIC. When a device module registers itself with the core module, the device module informs the core module of its capabilities, such as whether it supports VIA directly in hardware, its native MTU size, the maximum number of VIA descriptors that can be queued for transmission, etc. The device module also registers device specific functions to be used by the modular managers from the core module. The developer of the device module has the option of overriding any and all of the default functionality provided by the core module. For example, if a device that provides native VIA hardware support uses its own mechanism for registering memory, it may completely replace the Registered Memory Manager with an implementation of its own Device Classes Many commodity network interfaces can be logically grouped into common categories such as Ethernet, ATM, FDDI, etc. In order to promote wire level interoperability and rapid development through code reuse, device modules can be written using an internal abstraction called a Device Class. M-VIA devices classes are slightly finer-grained than network types, such as the EtherRing category mentioned above. Device Classes enable common routines for a class of network interfaces to be shared by device modules. Such routines include operations such as the construction and interpretation of media-specific VIA headers and mechanisms for enabling VIA to co-exist with traditional networking protocols, i.e. TCP/IP. While Device Classes are not explicitly supported by the device module, the device module interface is designed to facilitate their use. Macros are used for communication between a device-specific code and a device class, and these are integrated into a device module The VI Provider Library (VIPL) M-VIA contains a single VI Provider Library, VIPL, which is interoperable with software-only and hardware-accelerated VIA devices developed within the M-VIA framework. Device modules specify to VIPL whether the VI Provider Library should use ioctl system calls or fast traps to call time-sensitive VI Kernel Agent services. Device modules also specify whether the VIA Doorbell mechanism is supported directly in hardware as a true memory mapped doorbell or should be emulated with a fast trap M-VIA 2 Based on experiences gained with M-VIA 1, we have begun the design of a modified internal organization in M-VIA 2. The modifications affect both the VIPL and the Core Module. M-VIA 2 design documents are available at

5 The original design of M-VIA was based upon early drafts of the Virtual Interface Architecture Specification. Unfortunately, when the VI Architecture 1.0 specification was released, it relaxed the specification in areas relating to hardware interaction, becoming a specification of the user level VIA component only. This change requires devices to be capable of providing custom user-level functionality to operate efficiently. Currently NIC-specific functionality can be substituted in the Kernel Agent only. Two specific examples of this are doorbells and completion queues. In pre-1.0 versions of the VI Architecture specification, doorbell operations used a standardized Doorbell Token format. The Doorbell Token format is no longer specified in VIA 1.0 (including in the Developer s Guide). A similar problem occurred with the introduction of Completion Queues in the VI Architecture Specification 1.0. To be implemented efficiently, Completion Queues require direct communication between the VI NIC and VIPL. However, the mechanism and data structures used to accomplish this are not defined. The modularized VIPL implementation in M-VIA 2 will enable the substitution of device specific functionality at the user level as well as inside the kernel Functionality and Conformance The Intel Virtual Interface Architecture Developer's Guide describes three levels of conformance to the VIPL API: Early Adopter; Functional; and Full conformance. The Intel VI Architecture Conformance Suite [Intel98a] tests an implementation's conformance to the VIPL API. The conformance suite, consisting of over lines of code, performs thousands of individual tests grouped into functional categories: 34 for Early Adopter; 134 for Functional Conformance; and 156 for Full Conformance. Basic VIPL semantic compliance, resource management, proper handling of error conditions, invalid inputs, and network stress tests are included in the conformance suite. M-VIA passes all of the Early Adopter conformance tests on unreliable networks and includes RDMW Write capability. Reliable Delivery and Reliable Reception will be supported for networks that support these. At the Functional Conformance level, M-VIA implements all functionality except peer-to-peer connection management and resizing of completion queues, including synchronous error handling, remote disconnect notification, and Protection Tag support. M-VIA passes 109 of the 134 Functional Conformance tests included in the test suite. The tests that M-VIA does not pass either contain bugs or calls to the peerto-peer connection management routines. The only additional functions missing from the Full Conformance level are the notify routines, which are essentially syntactic sugar. M-VIA uses Posix threads (pthreads) internally for asynchronous error notification, and is pthreadscompatible, enabling the development of multi-threaded user applications. Operations performed within a multi-threaded application on different VIs are inherently thread safe, but an application must currently provide its own explicit locks if multiple threads access a single VI. A fully thread safe VIPL will be part of M-VIA Implementation Status M-VIA 1.0 supports four NIC types: loopback, fast ethernet cards based on the DEC Tulip chip, the Packet Engines GNIC-1 Gigabit Ethernet Card, and the Packet Engines GNIC-II Gigabit Ethernet Card. We have focused on only a small number of interfaces for two reasons. First, we anticipated fine-tuning the internal interfaces, and did not want to redo the work of implementing all the drivers. Second, our primary goal was a complete and robust implementation of VIA.

6 As described in section 3.1.5, we are currently redesigning internal interfaces for VIPL and the VI Kernel Agent, based on experience with the original design, in order to improve support for smart NICs. This redesign will form the basis of M-VIA 2. With M-VIA 2, we expect an explosion of third-party driver development. There are third-party plans for to implement drivers for Giganet, Myrinet, Servernet, Alteon Gigabit Ethernet and several Intel NICs. M-VIA is freely available for download over the Internet at Performance While the primary focus of M-VIA development so far has been functionality, robustness and modularity, it achieves excellent performance as well. We present here some basic performance comparisons to demonstrate this fact, leaving a more detailed analysis for another report. Latency and bandwidth reported below are measured using a simple pingpong benchmark, in which two processes send a buffer of data back and forth. Latency reported below is one-half the round-trip time for 4- byte messages, and bandwidth is message size divided by one-half the round trip time for byte messages. (This number is an artifact of our benchmark program, which uses exponentially increasing message sizes up to 32K. Bandwidth is not sensitive to this value). Although this is a crude measure, the same conclusions hold up under more detailed analysis. The following tables show M-VIA performance (under Linux) and TCP performance under several operating systems, using identical processors and each of three NIC types Loopback (a virtual loopback device, not involving a PCI device), Tulip-based Fast Ethernet (Kingston), and the Packet Engines GNIC-II Gigabit Ethernet NIC. We used PCs with 400 MHz Pentium II processors and Corsair CAS-2 PC-100 memory on ASUS-P2X motherboards. The Tulip and GNIC-II measurements were made with uniprocessor systems connected back-to-back. The loopback measurements were made on a 2-processor system with the same processors and memory, and an ASUS motherboard from the same family. Linux measurements are based on the SMP kernel; Solaris measurements are based on Solaris 7 for x86; Windows NT measurements are for NT 4. We observe that M-VIA performance is significantly better than TCP performance in all cases, and that the relative performance of VIA is better for faster networks. While a comparison to TCP performance is not a definitive assessment, it does demonstrate that M-VIA performance is respectable. M-VIA/Linux TCP/Linux TCP/Solaris TCP/NT Loopback GNIC-II Gigabit Ethernet NA* 82.5 Tulip Fast Ethernet Table 1: Latency (in microseconds). Lower is better. M-VIA/Linux TCP/Linux TCP/Solaris TCP/NT Loopback GNIC-II Gigabit Ethernet NA 14.8 Tulip Fast Ethernet Table 2: Bandwidth (in Megabytes/s). Higher is better. * A GNIC-II driver is not available for Solaris 7/x86.

7 Comparisons to other VIA implementations are difficult because we do not have an apples-to-apples comparison on the same hardware. We mention here some other results to provide perspective, though a direct comparison is not appropriate. An early proof-of-concept VIA implementation from Intel on fast Ethernet hardware had a latency of about 60 µs [Berry 97] more than twice the latency we report here for Tulip-based Ethernet. Berkeley VIA, a partial implementation of VIA oriented towards research, has a latency of 35 µs and bandwidth of 51 MB/s on Myrinet. U-Net on Tulip Fast Ethernet [Welsh88] with 200 MHz Pentium Pro processors has a latency of approximately 25 µs. The M-VIA fast-trap mechanism is essentially the same as that used by U-Net, so that we expect performance to be nearly identical. Giganet reports a VIA latency of 8.5 µs for their NT implementation of VIA with specialized VIA-aware hardware. In all cases, the biggest bottleneck is ultimately the PCI interface. When NGIO and/or Future IO devices become available, we expect latency to fall considerably. 4. Conclusions and Plans M-VIA is the first non-proprietary implementation of VIA. Its modular design facilitates rapid implementation on new network adapters and interoperability, without compromising high performance. An important goal of our work is to provide a reference implementation of VIA that will promote and facilitate the development of high-performance portable VIA applications, and facilitate the development of VIA on other systems. As we have described, M-VIA enables the rapid development of drivers for new NICs, providing portability among NICs. We have additional plans or know of plans to port M-VIA to new processors (the only processor-specific code is related to the fast trap mechanism for software-only drivers), to provide portability among processors. Furthermore, although M-VIA obviously has operating system dependencies, we do not believe there are any fundamental difficulties in porting it to new operating systems. M-VIA development started on FreeBSD before moving to Linux, and a preliminary assessment of the feasibility of an NT port [Buonodonna99] is encouraging.

8 5. References [Basu88] A. Basu, V. Buch, W. Vogels, T. von Eicken. U-Net: A User-Level Network Interface for Parallel and Distributed Computing. Proceedings of the 15 th ACM Symposium on Operating Systems Principles (SOSP), Copper Mountain, Colorado, December [Berry97] F. Berry, E. Deleganes, A. M. Merritt, Intel Corporation. The Virtual Interface Architecture Proof of Concept Performance Results. Available at [Boden95] N. J. Boden, D. Cohen, R. E. Felderman A. E. Kulawik, C. L. Seitz, J. N. Seizovic, W. Su, "Myrinet -- A Gigabit-per-Second Local Area Network," IEEE Micro, Vol. 15, February 1995, pp [Buonadonna98] P. Buonadonna, A. Geweke, D. Culler, An Implementation and Analysis of the Virtual Interface Architecture. Proceedings of SC98, Orlando, Florida, November [Buonadonna99] P. Buonadonna, private communication. April [Clark89] D.D. Clark, V. Jacobson, J. Romkey, and H. Salwen. An Analysis of TCP Processing Overhead. IEEE Communications Magazine, Jun [Dimitrov99] Rossen Dimitrov and Anthony Skjellum. An Efficient MPI Implementation for Virtual Interface (VI) Architecture-Enabled Cluster Computing. Proceedings of the MPI Developers Conference, [Eicken92] T. von Eicken, D. Culler, S. C. Goldstein,and K. Schauser, Active Messages: a Mechanism for Integrated Communication and Computation. Proceedings of the 19th Int'l Symposium on Computer Architecture, Gold Coast, Australia, May [Giganet98] GigaNet Corporation, High Performance clan Host Adapters. Available at [Intel98] Intel Corporation. The Intel VI Architecture Developer s Guide V1.0. September Available at ftp://download.intel.com/design/servers/vi/intel.pdf. [Intel98a] Intel Corporation. The Intel VI Architecture Conformance Suite User s Guide v0.5. December Available at ftp://download.intel.com/design/servers/vi/userguide_v0.5.pdf [Larson98] J. Larson, "The HAL Interconnect PCI Card," [Pakin95] S. Pakin, M. Lauria, A. Chen. High Performance Messaging on Workstations: Illinois Fast Messages (FM) for Myrinet,,Proceedings of Supercomputing '95, San Diego, California [Pierce94] Paul Pierce and Greg Regnier, The Paragon Message Passing Interface Paper, SHPCC94, 1994 [Tandem95] Tandem Corporation, ServerNet Interconnect Technology, [VIA97] Compaq Computer Corp., Intel Corporation, Microsoft Corporation. Virtual Interface Architecture Specification. Available at [Welsh96] Low-Latency Communication over Fast Ethernet, Matt Welsh, Anindya Basu, Thorsten von Eicken. Proceedings of Euro-Par '96, Lyon, France, August 27-29, 1996.

Virtual Interface Architecture (VIA) Hassan Shojania

Virtual Interface Architecture (VIA) Hassan Shojania Agenda Introduction Software overhead VIA Concepts A VIA sample Design alternatives M-VIA Comparing with InfiniBand Architecture Conclusions & further