Integration Path for Intel Omni-Path Fabric attached Intel Enterprise Edition for Lustre (IEEL) LNET

Table of Contents Introduction 3 Architecture for LNET 4 Integration 5 Proof of Concept routing for multiple fabrics 5 Ko2iblnd settings 6 Client Mounting 6 Production LNET routing for CSD3 7 Performance Tuning 8 Performance Tests and Results 8 Summary 12 Glossary 13 Appendix A 14 LNET Networks and corresponding network types 14 LNET Routing Configuration: 14 Lustre Server Configuration: 14 Lustre Client Configuration ( EDR ): 14 Lustre Client Configuration ( OPA ): 14

Introduction As High Performance Computing centres grow, data centre infrastructure becomes more complex as new and older services are integrated. This can increase the number of server, storage and network technologies that are connected together, making it critical for the successful operation of services that they work seamlessly. A centre s growth can also be impacted in the extension of its service portfolio, which puts pressure on the provision of flexible and scalable platforms, especially storage. Storage requirements increase, often doubling the existing storage with each new service adopted by HPC centres. Across each service provided it is often desirable to have a common storage infrastructure that can be accessed from each of the services provided to users. Allowing users to effectively migrate data across different systems can be challenging, and creates a risk of duplicating data and wasting storage space, together with placing undue stress on network resources. In the case of the University of Cambridge Research Computing Service (RCS), a new set of supercomputing resources has recently been procured and installed for the growing needs of science, both at the University as well as nationally within the UK. The Cambridge Service for Data Driven Discovery (CSD3) provides three new supercomputing resources, along with the existing Intel and dedicated GPU supercomputers. The RCS has made use of Lustre parallel file systems for most of its main resources, and they have been the backbone for providing high performance scalable storage across all research computing platforms. Lustre filesystems support high performance networks such as Ethernet, Infiniband and Intel Omni-Path Fabric. The older HPC service has five Lustre storage filesystems providing access through 5PB of storage and CSD3 introduces an additional five Lustre filesystems, to provide the service with another 5PB of storage space. The new storage platform has been designed and deployed with the intention of allowing both old and new systems to mount these new filesystems, allowing users to migrate and consume data as they switch between CSD3 and the existing resources. In order to take advantage of platform-specific features at the time of acquisition the CSD3 GPU system (Wilkes-2 in Figure 1 below) uses Mellanox EDR Infiniband and the new Intel Xeon Phi and Intel Xeon Gold 6142 CPU resources use the Intel Omni-Path Fabric. The goal of building a common Lustre storage system that can be accessed over the HPC fabrics of different generations and technology can be achieved through the use of LNET routing. LNET routing allows the RCS to expand beyond the confines of the existing FDR InfiniBand fabric, facilitating the translation between fabrics. Services on Intel Omni-Path Fabric, EDR/FDR InfiniBand and Ethernet can now consume existing and new Lustre storage. For example, a user on CSD3 can now write files to Lustre and launch a visualisation instance in the RCS OpenStack cloud, seamlessly accessing Lustre storage concurrently without the user being aware of the underlying infrastructure and placement. LNET routing can not only be used to join dispersed supercomputing resources; LNET routers can be used in the same way as conventional Ethernet routing by setting up multiple routers, thus Lustre traffic can traverse multiple hops of a complicated networking infrastructure, allowing for fine grain routing as scientific computing progresses beyond Petascale systems.

Architecture for LNET Skylake & Intel Xeon Phi x200 Wilkes-2 EDR Fabric Darwin Intel Omni-Path EDR Fabric Research Data Store CSD3 Storage LNET Router General Lustre1-5 Ethernet OpenStack Mix of Interconnects Wilkes Hosted Clusters Figure 1 High Level Diagram of the University of Cambridge Research Computing Services estate integrating LNET routers CSD3 incorporates two distinct processor technologies, Intel Xeon Gold 6142, internally referred to as Skylake, and Intel Xeon Phi X200 for Intel architecture together with NVIDIA P100 GPGPU, known as Wilkes-2, underpinned by multiple Lustre filesystems attached to Intel Omni-Path. While the Intel systems use Intel Omni-Path directly themselves, Wilkes-2 uses EDR InfiniBand as its fabric and this presents. Figure 1 shows a high-level view of the current RCS estate. LNET routers shown in the centre of the diagram provide a translation layer between different types of network fabrics, allowing for Lustre access across all systems within the RCS where convergence on one type of interconnect is not possible. Figure 2 shows the LNET routers that connect storage and servers that use Intel Omni-Path Fabric to provide Wilkes-2 with access to common storage. The LNET router does not mount the Lustre file system but merely forwards traffic from one network to the other. In this example two LNET routers load balance traffic and act as an active-active failover pair for Lustre traffic. Additional nodes can be added throughout the network topology to balance network routes across systems. Details on load balancing can be found in the Intel user guide for LNET [1]. Current production services, such as the Darwin CPU cluster and the existing Wilkes GPU connect to the LNET routers over FDR. This flexibility now allows for users to migrate their data over a high speed interconnect as they transition to the new service.

Figure 2 LNET Detail showing a pair of routers between Peta-4 and Wilkes-2 Integration Before progressing with the deployment of a production LNET service, an initial experimental routing set-up was completed. This concept demonstrator was then used to aid the construction of the production LNET routing within CSD3. Proof of Concept routing for multiple fabrics When integrating LNET, it is best to map out the LNET for each of the fabric or TCP networks that will connect to the router. Each fabric must have its own o2ib network in order to distinguish between each of the fabrics. Table 1 shows an example from the concept demonstrator system: Fabric Type LNET Network tag Router IP IP Subnet Intel Omni-Path Fabric o2ib0 192.168.0.254 192.168.0.0/24 InfiniBand o2ib1 10.144.60.230 10.144.0.0/16 Ethernet tcp0 172.10.2.254 172.10.2.0/24 Table 1 Example LNET layout All Lustre clients must have defined a list of LNET tags and the address of the router on the respective fabric. A compute node on o2ib0 would have the following router definition within its /etc/modprob.d/lustre.conf options lnet networks= o2ib0(ib0) routes= tcp0 192.168.0.254@o2ib0 \ live_router_check_interval=60 dead_router_check_interval=60 router_ping_timeout=60 Figure 3 Example compute node router configuration Lustre Servers MDS/MGS and OSS nodes should define similar configurations in reverse. These must know about all available LNET fabrics that would wish to mount Lustre. The test system used a set of Lustre storage servers built within an OpenStack Tenant to allow for quick development. Again the server /etc/modprob.d/lustre.conf is shown. options lnet networks= tcp0(em2) routes= o2ib0 172.10.2.254@tcp0, \ o2ib1 172.10.2.254@tcp0 live_router_check_interval=60 dead_router_check_interval=60 router_ping_timeout=60 Figure 4 Example Lustre server route configuration

Each node can define multiple routes using a bracketed list of IP addresses within the module configuration. routes= o2ib0 172.10.2.254@tcp0; o2ib1 172.10.2 [251,252,253,254]2tcp0 Figure 5 LNET route expressing multiple routers This shows the LNET server/client that in order to access the fabric o2ib1 the address 172.10.2.251 to 254@tcp0 can be used. Further settings tell the nodes how to treat the router in the event that Lustre RPCs cannot be successfully routed. When implementing LNET routing it is important to think in the context of Lustre traffic as opposed to a standard ICMP package. While the network port might be up in the traditional sense, if a lctl ping command fails, or if there is no endpoint, each LNET router will mark the route as down. The status of an available path for routing can be viewed using lctl route_list as shown below. net o2ib0 hops 4294967295 gw 172.10.2.254@tcp up pri 0 net o2ib0 hops 4294967295 gw 172.10.2.254@tcp up pri 0 [rot@lnet-mds-0 ~]# Figure 6 Screenshot of lctl route_list showing the status of available routing paths for LNET traffic. Router nodes will receive the following configuration to set the node s LNET to route traffic between fabrics. options lnet networks= o2ib0(ib0),o2ib1(ib1),tcp0(em2) forwarding=enabled Figure 7 LNET router node configuration The routing options shown after each network is defined are presented as sensible defaults. This ensures that should a router go down, a client and server can mitigate against the issue while the system administrator remediates the situation. Ko2iblnd settings The Ko2iblnd module should have the same settings for all participating LNET nodes. Due to compatibility issues between mlx and Intel Omni-Path Fabric drivers, users may need to increase the value of the map on demand option to 256, depending on the version of Lustre used. From Lustre 2.10 this can be varied with dynamic LNET configuration. options ko2iblnd-opa peer_credits=128 peer_credits_hiw=64 credits 1024 concurrent_sends=256 ntx=2048 map_on_demand=256 fmr_pool_size=2048 fmr_flush_trigger=512 fmr_cache=1 Figure 8 ko2iblnd settings for LNET estates that mix Intel Omni-Path Fabric and MLX IB Client Mounting Clients on the respective fabrics will mount the address of the MGS, Lustre Management Server, as normal. Any errors can be found in the kernel log messages. Figure 9 CSD3 Storage networking for LNET

Figure 9 shows the topology of the LNET storage network as it is currently deployed for the CSD3 early access release. On the left-hand side, the systems are connected to an Intel Omni-Path network. On the right side of the LNET routers, the GPU cluster, Wilkes-2 and the existing HPC infrastructure are connected. At present only two of the routers contain all three types of InfiniBand card, so that the existing FDR network can reach the new CSD3 lustre storage. The current implementation for CSD3 deploys Lustre 2.7 as provided by Intel Enterprise Lustre 3.1 with backports from 2.10 applied. For future deployments it is recommended that users deploy Lustre 2.10, as this will contain the most current patches and features required for future deployments. Migrating from an existing FDR infrastructure to Intel Omni-Path Fabric and EDR requires some careful configuration. During the testing of different configurations, converging the FDR and EDR into a single fabric resulted in a loss of both fabrics. The exact cause of the loss of network when converging is not currently understood, and therefore is not advised. This requires, therefore, that two servers are able to use both Intel Omni-Path Fabric and EDR-FDR mlx4 and mlx5 drivers concurrently. Each LNET router uses the latest Red Hat Linux release 7.3 at time of writing and installs the Lustre client kernel upgrades and the Intel Omni-Path Fabric installation package. It is not advised to run the Mellanox OFED package when combining cards - each package requires libraries that can be replaced by either package, preventing optimal operation. The router must therefore only make use of the Intel Omni-Path Fabric-provided software and the RDMA packages supplied by a supported Linux distribution. The standard ibtools package can be installed to ensure the correct operation of the EDR-FDR cards. EDR and FDR endpoints on compute infrastructure use the latest packages provided by Mellanox. By applying a patch from Lustre 2.10 (don t page-align a remote address with FastReg) when applied to IEEL 3.1 (Lustre 2.7) the routers make use of new lnetctl dynamic network configuration. Support for this is improved in Lustre 2.10 and makes integrating mlx5-edr cards with Intel Omni-Path smoother. This new method of configuring LNET replaces the k2oibconf and lnet.conf files found in the modprobe.d directory. For CSD3 this is only used on the LNET routers - lustre client and servers continue to make use of the existing modprobe.d configuration files. An example of the full dynamic LNET configuration for an LNET Router using all three fabric types is provided in Appendix A. Two of the eight servers require the configuration of all three cards (in the Cambridge configuration). Due to the length of these files, the important configurations that are tuneable are shown below.

Performance Tuning Performance Tuning can be achieved firstly through the use of the dynamic LNET configuration found in Lustre 2.10 and above. Second is the introduction of accelerated RDMA within Intel Omni-Path Fabric, which should be enabled to give the best performance from Intel Omni-Path network links. Further Intel Omni-Path performance enhancements, such as adaptive routing, may be considered but were not tested during the writing of this paper. In this deployment the existing ko2iblnd settings are used to provide defaults to LNET and the dynamic LNET configuration yaml file is loaded by a systemd unit file that then loads as per interface parameters. Tests and Results The tests results, for IOR, are shown for two API configurations, POSIX and MPI-IO. A baseline reference test was performed on the Intel Xeon Phi X200 system to provide a result for applications running on the same fabric, and then on Wilkes-2 to compare any reduction in performance introduced by the LNET router. Each test comprises node counts from 1,6,12,24,48,72, with two sets of these tests for one MPI rank per node and then twelve MPI ranks one per x86 core of the Wilkes-2 node. Tests were performed for a shared file striped across all OSTs and File Writer Per Process to an unstriped directory. These tests were performed using the IOR benchmark programme, with each node writing 64GB or 5GB in the case of 12 MPI ranks. 64GB is a close synthetic to how much data all four GPUs could store. Thus, should the user wish to write the contents of all the GPUs, this test makes for a reasonable approximation. The results shown in Figures 10 to 13 are for each of the APIs over the same number of nodes for the Intel Xeon Phi X200 and GPU systems. Dashed lines show the read performance, and solid lines indicate write. Both APIs show a similar performance for the same number of MPI ranks for similar core counts. For higher numbered core counts and nodes, MPI-IO performs better than the standard POSIX API. For both APIs, the GPU and Intel Xeon Phi X200 systems each achieved an approximate read performance of 12GiB/s, and a write performance of 8GiB/s. As each OSS provides 3GiB/s and a single lustre filesystem contains 24 OSTs, the maximum bandwidth for each Lustre filesystem is limited. A second test was performed from the GPU system to see if the LNET routers limited the underlying SAS performance of the Lustre storage nodes. IOR is capable of performing a test over multiple Lustre file systems. Two of the Lustre systems attached to CSD3 were used; directories were set to unstriped and initially the same tests as before were run using FilePerProcess. An initial test showed improved performance over one filesystem. However, on larger node counts the results were unrealistic, as the division of data between each MPI process was too small. Doubling the size of data to be written, the test was re-done, and the results shown in Figure 14 and 15 present an improved performance over one Lustre filesystem. The Intel Xeon Phi X200 system may appear slower, which, however, is due to a slower clock speed than that found within the processors of Wilkes-2. Figure 10 Intel Xeon Phi X200 Shared File Performance in MiB/s

Figure 11 Intel Xeon Phi X200 File Per Process Performance in MiB/s Figure 12 GPU Shared File Performance in MiB/s

Figure 13 GPU File Per Process Performance in MiB/s Figure 14 GPU Multiple Lustre Performance in MiB/s

Figure 15 KNL Multiple Lustre Performance in MiB/s

Summary The introduction of LNET routing to Research Computing Services has performed within the expectations of what can be delivered by the current design. LNET routing has not been shown to impact I/O performance to such a degree that it would impact on programmes run on the GPU services, viz routed via InfiniBand to Intel Omni- Path networked Lustre servers, compared with the Intel Xeon Phi X200 I/O performance. Research Computing Services can now offer greater flexibility when more services consume Lustre irrespective of fabric network type. Multiple filesystem performance tests show there was no degradation in application I/O from the LNET routers, and Lustre storage administrators are limited only by the chosen disk technologies, disk host adapters (e.g. SAS) and the size of each Lustre filesystem. To extract more performance from Lustre storage, SSD disks are being looked at, to improve capacity Lustre performance through converged or tiered solutions. Glossary Term EDR FDR KNL lnetctl lctl MDS MGS o2ib Description Enhanced Data Rate InfiniBand Fourteen Data Rate InfiniBand Intel Xeon Phi x200 processor LNET command and configuration program Lustre control utility Metadata Server for Lustre Management Server for Lustre. Usually run with a Metadata server Named identifier of the LNET subnet. Further development is planned to test the overall performance of older InfiniBand and Ethernet fabrics for application I/O, which should help to provide training materials for maximising the best use of Lustre when considering improvements to application I/O. References [1] Intel Omni-Path IP and Storage Router. Intel, 2017.

Appendix A LNET Networks and corresponding network types: o2ib0 FDR network for Existing HPC services o2ib1 OPA network for Peta4 and new Lustre storage o2ib2 EDR network for Wilkes-2 LNET Routing Configuration: o2ib0 peer_credits=128 peer_credits_hiw=127 credits=1024 concurrent_sends=64 map_on_demand=0 fmr_pool_size=512 fmr_flush_ trigger=384 fmr_cache=1 o2ib1 peer_credits=128 peer_credits_hiw=127 credits=1024 concurrent_sends=256 map_on_demand=32 fmr_pool_size=2048 fmr_ flush_trigger=512 fmr_cache=1 o2ib2 peer_credits=128 peer_credits_hiw=127 credits=1024 concurrent_sends=256 map_on_demand=32 fmr_pool_size=2048 fmr_ flush_trigger=512 fmr_cache=1 Lustre Server Configuration: options ko2iblnd-opa peer_credits=128 peer_credits_hiw=64 credits=1024 concurrent_sends=256 ntx=2048 map_on_ demand=32 fmr_pool_size=2048 fmr_flush_trigger=512 fmr_cache=1 options lnet networks= o2ib1(ib0), tcp2(em1.43) routes= o2ib2 1 10.47.240.[161-168]@o2ib1; o2ib0 1 10.47.240.[167-168]@ o2ib1 auto_down=1 avoid_asym_router_failure=1 check_routers_before_use=1 dead_router_check_interval=60 live_router_ check_interval=60 router_ping_timeout=60 Lustre Client Configuration (EDR): options ko2iblnd-mlx5 peer_credits=128 peer_credits_hiw=127 credits=1024 concurrent_sends=256 ntx=2048 map_on_ demand=32 fmr_pool_size=2048 fmr_flush_trigger=512 fmr_cache=1 options lnet networks=o2ib2(ib0) routes= o2ib1 1 10.44.240.[161-168]@o2ib2; o2ib0 1 10.44.240.[167-168]@o2ib2 auto_down=1 avoid_asym_router_failure=1 check_routers_before_use=1 dead_router_check_interval=60 live_router_check_interval=60 router_ping_timeout=60 Lustre Client Configuration (OPA): options ko2iblnd-opa peer_credits=128 peer_credits_hiw=64 credits=1024 concurrent_sends=256 ntx=2048 map_on_ demand=32 fmr_pool_size=2048 fmr_flush_trigger=512 fmr_cache=1 options lnet networks=o2ib1(ib0) routes= o2ib2 1 10.47.240.[161-168]@o2ib1; o2ib0 1 10.47.240.[167-168]@o2ib1 auto_down=1 avoid_asym_router_failure=1 check_routers_before_use=1 dead_router_check_interval=60 live_router_check_interval=60 router_ping_timeout=60

Intel technologies features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at intel.com. Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources of information to evaluate performance as you consider your purchase. For more complete information about performance and benchmark results, visit http://www.intel. com/performance. Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site and confirm whether referenced data are accurate. Intel, the Intel logo, Intel Xeon, Intel SSD DC S3610, Intel SSD DC S3710, and Intel SSD DC P3600 are trademarks of Intel Corporation in the U.S. and/or other countries.

2017 Dell EMC, All rights reserved. Dell EMC, the Dell EMC logo and products as identified in this document are registered trademarks of Dell, Inc. in the U.S.A. and/or other countries. No part of this document may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying and recording for any purpose without the written permission of Dell, Inc. ( Dell ). Dell EMC disclaims proprietary interest in the marks and names of others. Dell EMC service offerings do not affect customer s statutory rights. Availability and terms of Dell EMC Services vary by region. Terms and Conditions of Sales, Service and Finance apply and are available on request or at Dell.co.uk/terms. THIS WHITE PAPER IS FOR INFORMATIONAL PURPOSES ONLY, AND MAY CONTAIN TYPOGRAPHICAL ERRORS AND TECHNICAL INACCURACIES. THE CONTENT IS PROVIDED AS IS, WITHOUT EXPRESS OR IMPLIED WARRANTIES OF ANY KIND. Dell Corporation Limited. Registered in England. Reg. No. 02081369 Dell House, The Boulevard, Cain Road, Bracknell, Berkshire, RG12 1LF, UK.