LUSTRE NETWORKING High-Performance Features and Flexible Support for a Wide Array of Networks White Paper November Abstract

LUSTRE NETWORKING High-Performance Features and Flexible Support for a Wide Array of Networks White Paper November 2008 Abstract This paper provides information about Lustre networking that can be used to plan cluster file system deployments for optimal performance and scalability. The paper includes information on Lustre message passing, Lustre Network Drivers, and routing in Lustre networks, and describes how these features can be used to improve cluster storage management. The final section of this paper describes new Lustre networking features that are currently under consideration or planned for future release.

Table of Contents Challenges in Cluster Networking....................................... 1 Lustre Networking Architecture and Current Features..................... 2 LNET architecture...................................................... 2 Network types supported in Lustre networks................................ 4 Routers and multiple interfaces in Lustre networks........................... 5 Applications of LNET.................................................. 7 Remote direct memory access (RDMA) and LNET............................. 7 Using LNET to implement a site-wide or global file system..................... 7 Using Lustre over wide area networks..................................... 8 Using Lustre routers for load balancing.................................... 9 Anticipated Features in Future Releases.................................. 11 New features for multiple interfaces...................................... 11 Server-driven QoS..................................................... 12 A router control plane.................................................13 Asynchronous I/O..................................................... 14 Conclusion......................................................... 15

1 Challenges in Cluster Networking Chapter 1 Challenges in Cluster Networking Networking in today s datacenters provides many challenges. For performance, file system clients must access servers using native protocols over a variety of networks, preferably leveraging capabilities such as remote direct memory access. In large installations, multiple networks may be encountered and all storage must be simultaneously accessible over multiple networks through routers and by using multiple network interfaces on the servers. While storage management nightmares such as staging multiple copies of data on file systems local to a cluster are common practice, they are also highly undesirable. Lustre networking (LNET) provides features that address many of these challenges. Chapter 2 provides an overview of some of the key features of the LNET architecture. Chapter 3 discusses how these features can be used in specific high-performance computing (HPC) networking applications. Chapter 4 looks at how LNET is expected to evolve to enhance load balancing, quality of service (QoS), and high availability in networks on a local and global scale. And Chapter 5 provides a short synopsis and recap.

2 Lustre Networking Architecture and Current Features Chapter 2 Lustre Networking Architecture and Current Features The LNET architecture comprises a number of key features that can be used to simplify and enhance HPC networking. LNET architecture The LNET architecture has evolved through extensive research into a set of protocols and application programming interfaces (APIs) to support high-performance, highavailability file systems. In a cluster with a Lustre file system, the system network is the network connecting the servers and the clients. LNET is only used over the system network where it provides all communication infrastructure required by the Lustre file system. The disk storage in a Lustre file system is connected to metadata servers (MDSs) and object storage servers (OSSs) using traditional storage area networking (SAN) technologies. However, this SAN does not extend to the Lustre client systems, and typically does not require SAN switches. Key features of LNET include: Remote direct memory access (RDMA), when supported by underlying networks such as Elan, Myrinet, and InfiniBand Support for many commonly used network types such as InfiniBand and IP High-availability and recovery features that enable transparent recovery in conjunction with failover servers Simultaneous availability of multiple network types with routing between them

3 Lustre Networking Architecture and Current Features Figure 1 shows how these network features are implemented in a cluster deployed with LNET. Metadata server (MDS) disk storage containing metadata targets (MDT) Object storage servers (OSS) systems 1-1000 s OSS storage with object storage targets (OST) Clustered MDS pool 1 100 MDS 1 (active) MDS 2 (standby) OSS 1 Commodity storage Elan Myrinet InfiniBand OSS 2 Lustre clients 1 100,000 Simultaneous support of multiple network types OSS 3 Shared storage enables failover OSS Router OSS 4 OSS 5 GigE = Failover OSS 6 OSS 7 Enterprise-class storage arrays and SAN fabric Figure 1. Lustre architecture for clusters LNET is implemented using layered software modules. The file system uses a remote procedure API with interfaces for recovery and bulk transport. This API, in turn, uses the LNET Message Passing API, which has its roots in the Sandia Portals message passing API, a well-known API in the HPC community. The LNET architecture supports pluggable drivers to provide support for multiple network types individually or simultaneously, similar in concept to the Sandia Portals network abstraction layer (NAL). The drivers, called Lustre Network Drivers (LNDs), are loaded into the driver stack, with one LND for each network type in use. Routing is possible between different networks. This was implemented early in the Lustre product cycle to provide a key customer, Lawrence Livermore National Laboratories (LLNL), with a site-wide file system (this will be discussed in more detail in Chapter 2, Applications of LNET).

4 Lustre Networking Architecture and Current Features Figure 2 shows how the software modules and APIs are layered. Vendor network device libraries Support for multiple network types Lustre Network Drivers (LNDs) LNET library Network I/O (NIO) API Lustre request processing Similar to Sandia Portals, with some new and different features Moves small and large buffers Uses RDMA Generates events Zero-copy marshalling libraries Service framework and request dispatch Connection and address naming Generic recovery infrastructureapi Legend: Not supplied Not portable API Portable Lustre component Figure 2. Modular LNET implemented with layered APIs A Lustre network is a set of configured interfaces on nodes that can send traffic directly from one interface on the network to another. In a Lustre network, configured interfaces are named using network identifiers (NIDs). The NID is a string that has the form <address>@<type><network id>. Examples of NIDs are 192.168.1.1@tcp0, designating an address on the 0th Lustre TCP network, and 4@elan8, designating address 4 on the 8th Lustre Elan network. Network types supported in Lustre networks The LNET architecture includes LNDs to support many network types, including: InfiniBand (IB): OpenFabrics IB versions 1.0, 1.2, 1.2.5 and 1.3 TCP: Any network carrying TCP traffic, including GigE, 10GigE, and IPoIB Quadrics: Elan3 and Elan4 Myricom: GM and MX Cray: SeaStar and RapidArray The LNDs that support these networks are pluggable modules for the LNET software stack.

5 Lustre Networking Architecture and Current Features Routers and multiple interfaces in Lustre networks A Lustre network consists of one or more interfaces on nodes configured with NIDS that communicate without the use of intermediate router nodes with their own NIDS. LNET can conveniently define a Lustre network by enumerating the IP addresses of the interfaces forming the Lustre network. A Lustre network is not required to be physically separated from another Lustre network, although that is possible. When more than one Lustre network is present, LNET can route traffic between networks using routing nodes in the network. An example of this is shown in Figure 3, where one of the routers is also an OSS. If multiple routers are present between a pair of networks, they offer both load balancing and high availability through redundancy. Elan clients 132.6.1.2 OSS 192.168.0.2 TCP clients Elan switch 132.6.1.4 MDS Ethernet switch... Router TCP clients access MDS... through the 132.6.1.10 192.168.0.10 router elano Lustre network tcpo Lustre network Figure 3. Lustre networks connected through routers When multiple interfaces of the same type are available, load balancing traffic across all links becomes important. If the underlying network software for the network type supports interface bonding, resulting in one address, then LNET can rely on that mechanism. Such interface bonding is available for IP networks and Elan4, but not presently for InfiniBand.

6 Lustre Networking Architecture and Current Features If the network does not provide channel bonding, Lustre networks can help. Each of the interfaces is placed on a separate Lustre network. The clients on each of these Lustre networks together can utilize all server interfaces. This configuration also provides static load balancing. Additional features that may be developed in future releases to allow LNET to even better manage multiple network interfaces are discussed further in Chapter 4, Anticipated Features in Future Releases. Figure 4 shows how a Lustre server with several server interfaces can be configured to provide load balancing for clients placed on more than one Lustre network. At the top, two Lustre networks are configured as one physical network using a single switch. At the bottom, they are configured as two physical networks using two switches. vibo Lustre network 10.0.0.7 10.0.0.8 vib1 Lustre network Clients 10.0.0.5 Switch 10.0.0.6 Clients 10.0.0.3 vibo vib1 network rail network rail 10.0.0.1 10.0.0.2 10.0.0.4 Server Multiple interfaces vibo Lustre network 10.0.0.7 10.0.0.8 vib1 Lustre network Clients 10.0.0.5 Switch Switch 10.0.0.6 Clients 10.0.0.3 vibo vib1 network rail network rail 10.0.0.1 10.0.0.2 Server Multiple interfaces 10.0.0.4 Figure 4. A Lustre server with multiple network interfaces offering load balancing to the cluster

7 Applications of LNET Chapter 3 Applications of LNET LNET provides versatility for deployments. A few opportunities are described in this section. Remote direct memory access (RDMA) and LNET With the exception of TCP, LNET provides support for RDMA on all network types. When RDMA is used, nodes can achieve almost full bandwidth with extremely low CPU utilization. This is advantageous, particularly for nodes that are busy running other software, such as Lustre server software. The LND automatically uses this feature for large message sizes. However, provisioning with sufficient CPU power and high-performance motherboards may justify TCP networking as a trade-off to using RDMA. On 64-bit processors, LNET can saturate several GigE interfaces with relatively low CPU utilization, and with the Dual-Core Intel Xeon processor 5100 series, the bandwidth on a 10 GigE network can approach a gigabyte per second. LNET provides extraordinary bandwidth utilization of TCP networks. For example, end-to-end I/O over a single GigE link routinely exceeds 110 MB/sec with LNET. The Internet Wide Area RDMA Protocol (iwarp), developed by the RDMA Consortium, is an extension to TCP/IP that supports RDMA over TCP/IP networks. Linux supports the iwarp protocol using the OpenFabrics Alliance (OFA) code and interfaces. LNET OFA LND supports iwarp properly as well as IB. Using LNET to implement a site-wide or global file system Site-wide file systems and global file systems are implemented to provide transparent access from multiple clusters to one or more file systems. Site-wide file systems are typically associated with one site, while global file systems may span multiple locations and therefore utilize wide area networking. Site-wide file systems are typically desirable in HPC centers where many clusters exist on different high-speed networks. Typically, it is not easy to extend such networks or to connect such networks to other networks. LNET makes this possible.

8 Applications of LNET An increasingly popular approach is to build a storage island at the center of such an installation. The storage island contains storage arrays and servers and utilizes an InfiniBand or TCP network. Multiple clusters can connect to this island through Lustre routing nodes. The routing nodes are simple Lustre systems with at least two network interfaces: one to the internal cluster network and one to the network used in the storage island. Figure 5 shows an example of a global file system. Clients Cluster 1 Routers... Switch Elan4... Server farm OSS InfiniBand Clients Cluster 2 Routers Switch Storage network... Switch MDS... IP network... Storage island Figure 5. A global file system implemented using Lustre networks The benefits of site-wide and global file systems are not to be underestimated. Traditional data management for multiple clusters frequently involves staging data from one cluster on the file system to another. By deploying a site-wide Lustre file system, multiple copies of the data are no longer needed and substantial savings can be achieved through improved storage management and reduced capacity requirements. Using Lustre over wide area networks The Lustre file system has been successfully deployed over wide area networks (WANs). Typically, even over a WAN, 80 percent of raw bandwidth can be achieved, which is significantly more than that achieved by many other file systems over local area networks (LANs). For example, within the United States, Lustre file system deployments have achieved a bandwidth of 970 MB/sec over a WAN using a single 10 GigE interface (from a single client). Between Europe and the United States, 97 MB/sec has been achieved with a single GigE connection. On LANs, observed I/O bandwidths are only slightly higher: 1100 MB/sec on a 10 GigE network and 118 MB/sec on a GigE network.

9 Applications of LNET Routers can also be used advantageously to connect servers distributed over a WAN. For example, a single Lustre cluster may consist of two widely separated groups of Lustre servers and clients with each group interconnected by an InfiniBand network. As shown in Figure 6, Lustre routing nodes can be used to connect the two groups of Lustre servers and clients via an IP-based WAN. Alternatively, the servers could have an InfiniBand and Ethernet interface. However, this configuration may require more ports on switches, so the routing solution may be more cost effective. WAN IP IP Router Router InfiniBand InfiniBand Lustre cluster group 1...... Lustre cluster group 2...... Clients Servers Clients Servers Location A Location B Figure 6. A Lustre cluster distributed over a WAN Using Lustre routers for load balancing Commodity servers can be used as Lustre routers to provide a cost-effective, loadbalanced, redundant router configuration. For example, consider an installation with servers on a network with 10 GigE interfaces and many clients attached to a GigE network. It is possible, but typically costly, to purchase IP switching equipment that can connect to both the servers and the clients.

10 Applications of LNET With a Lustre network, the purchase of such costly switches can be avoided. For a more cost-effective solution, two separate networks can be created. A smaller, faster network contains the servers and a set of router nodes with sufficient aggregate throughput. A second client network with slower interfaces contains all the client nodes and is also attached to the router nodes. If this second network already exists and has sufficient free ports to add the Lustre router nodes, no changes to this client network are required. Figure 7 shows an installation with this configuration. GigE clients Router farm 10GigE servers GigE switch 10GigE switch...... Load balancing, redundant router farm Figure 7. An installation combining slow and fast networks using Lustre routers The routers provide a redundant, load-balanced path between the clients and the servers. This network configuration allows many clients together to use the full bandwidth of a server, even if individual clients have insufficient network bandwidth to do so. Because multiple routers stream data to the server network simultaneously, the server network can see data throughput in excess of what a single router can deliver.

11 Anticipated Features in Future Releases Chapter 4 Anticipated Features in Future Releases LNET offers many features today. And just like most products, enhancements and new features are intended for future releases. Some possible new features include support of multiple network interfaces, implementation of server-driven quality-of-service (QoS) guarantees, asynchronous I/O, and a control interface for routers. New features for multiple interfaces As previously mentioned, LNET can currently exploit multiple interfaces by placing them on different Lustre networks. This configuration provides reasonable load balancing for a server with many clients. However, it is a static configuration that does not handle link-level failover or dynamic load balancing. It is Sun s intention to address these shortcomings with the following design. First, LNET will virtualize multiple interfaces and offer the aggregate as one NID to the users of the LNET API. In concept, this is quite similar to the aggregation (also referred to as bonding or trunking) of Ethernet interfaces using protocols such as 802.3ad Dynamic Link Aggregation. The key features that a future LNET release may offer are: Load balancing: All links are used based on availability of throughput capacity. Link-level high availability: If one link fails, the other channels transparently continue to be used for communication. These features are shown in Figure 8. Client Client X All traffic Switch Evenly-loaded traffic X Switch Link failure accommodated without server failover Server Server Figure 8. Link-level load balancing and failover

12 Anticipated Features in Future Releases From a design perspective, these load-balancing and high-availability features are similar to the features offered with LNET routing described in Chapter 2 in the section Using Lustre routers for load balancing. A challenge in developing these features is providing a simple way to configure the network. Assigning and publishing NIDs for the bonded interfaces should be simple and flexible and should work even if all links are not available at startup. We expect to use the management server protocol to resolve this issue. Server-driven QoS QoS is often a critical issue, for example, when multiple clusters are competing for bandwidth from the same storage servers. A primary QoS goal is to avoid overwhelming server systems with conflicting demands from multiple clusters or systems, resulting in performance degradation for all clusters. Setting and enforcing policies is one way to avoid this. For example, a policy can be established that guarantees that a certain minimal bandwidth is allocated to resources that must respond in real time, such as for visualization. Or a policy can be defined that gives systems or clusters doing mission-critical work priority for bandwidth over less important clusters or systems. The Lustre QoS system s role is not to determine an appropriate set of policies but to provide capabilities that allow policies to be defined and enforced. Two components proposed for the Lustre QoS scheduler are a global Epoch Handler (EH) and a Local Request Scheduler (LRS). The EH provides a shared time slice among all servers. This time slice can be relatively large (one second, for example) to avoid overhead due to excessive server-to-server networking and latency. The LRS is responsible for receiving and queuing requests according to a local policy. The EH and LRS together allow all servers in a cluster to execute the same policy during the same time slice. Note that the policy may subdivide the time slices and use the subdivision advantageously. The LRS also provides summary data to the EH to support global knowledge and adaptation.

13 Anticipated Features in Future Releases Figure 9 shows how these features can be used to schedule rendering and visualization of streaming data. In this implementation, LRS policy allocates 30 percent of each Epoch time slice to visualization and 70 percent to rendering. Epoch messaging OSS 1 3...... Rendering cluster Visualization cluster Epoch 1 2 3 30% 70% 30% 70% 30% 70% Rendering Visualization Rendering Visualization Rendering Visualization Figure 9. Using server-driven QoS to schedule video rendering and visualization A router control plane Lustre technology is expected to be used in vast worldwide file systems that traverse multiple Lustre networks with many routers. To achieve wide-area QoS guarantees that cannot be achieved with static configurations, the configurations of these networks must change dynamically. A control interface is required between the routers and external administrative systems to handle these situations. Requirements are currently being developed for a Lustre Router Control Plane to help address these issues. For example, features are being considered for the Lustre Router Control Plane that could be used when data packets are being routed by routers from A to B and also from C to D and, for operational reasons, a preference needs to be given to routing the packets from C to D. The control plane would apply a policy to the routers so that packets would be sent from C to D before packets are sent from A to B. The Lustre Router Control Plane may also include the capability to provide input to a server-driven QoS subsystem, linking router policies with server policies. It might be particularly interesting to have an interface between the server-driven QoS subsystem and the router control plane to allow coordinated adjustment of QoS in a cluster and a wide area network.

14 Anticipated Features in Future Releases Asynchronous I/O In large compute clusters, the potential exists for significant I/O optimization. When a client writes large amounts of data, a truly asynchronous I/O mechanism would allow the client to register the memory pages that need to be written for RDMA and allow the server to transfer the data to storage without causing interrupts on the client. This makes the client CPU fully available to the application again, which is a significant benefit in some situations. Source node Network Sink node LNET LND LND LNET Source node Network Sink node LNET LND LND LNET Put message description Get DMA address Register sink buffer Put message description Register source buffer Send description and source RDMA address Register source buffer Register sink buffer RDMA data Event RDMA data Event Sending message with DMA handshake Sending message without DMA handshake Figure 10. Network-level DMA with handshake interrupts and without handshake interrupts LNET supports RDMA; however, currently a handshake at the operating system level is required to initiate the RDMA, as shown in Figure 10 (on the left). The handshake exchanges the network-level DMA addresses to be used. The proposed change to LNET would eliminate the handshake and include the network-level DMA addresses in the initial request to transfer data as shown in Figure 10 (on the right).

15 Conclusion Chapter 5 Conclusion LNET provides an exceptionally flexible and innovative infrastructure. Among the many features and benefits that have been discussed, the most significant are: Native support for all commonly used HPC networks Extremely fast data rates through RDMA and unparalleled TCP throughput Support for site-wide file systems through routing, eliminating staging, and copying of data between clusters Load-balancing router support to eliminate low-speed network bottlenecks Lustre networking will continue to evolve with planned features to handle link aggregation, server-driven QoS, a rich control interface to large routed networks, and asynchronous I/O without interrupts.

Lustre Networking On the Web sun.com 4150 Network Circle, Santa Clara, CA 95054 USA Phone 1-650-960-1300 or 1-800-555-9SUN (9786) Web sun.com 2008 All rights reserved. Sun, Sun Microsystems, the Sun logo, Lustre, and Solaris are trademarks or registered trademarks of in the United States and other countries. Intel Xeon is a trademark or registered trademark of Intel Corporation or its subsidiaries in the United States and other countries. Information subject to change without notice. SunWIN #524780 Lit. #SYWP13913-1 11/08