Implementing Storage in Intel Omni-Path Architecture Fabrics

Similar documents
Dell EMC Ready Bundle for HPC Digital Manufacturing Dassault Systѐmes Simulia Abaqus Performance

High Performance Interconnects: Landscape, Assessments & Rankings

FlashGrid Software Enables Converged and Hyper-Converged Appliances for Oracle* RAC

Intel s Architecture for NFV

RoCE vs. iwarp Competitive Analysis

Intel Omni-Path Architecture

Ravindra Babu Ganapathi

Intel Cluster Ready Allowed Hardware Variances

Extremely Fast Distributed Storage for Cloud Service Providers

Dell EMC Ready Bundle for HPC Digital Manufacturing ANSYS Performance

Altair OptiStruct 13.0 Performance Benchmark and Profiling. May 2015

NVMe Over Fabrics: Scaling Up With The Storage Performance Development Kit

Olaf Weber Senior Software Engineer SGI Storage Software. Amir Shehata Lustre Network Engineer Intel High Performance Data Division

Small File I/O Performance in Lustre. Mikhail Pershin, Joe Gmitter Intel HPDD April 2018

High-Performance Lustre with Maximum Data Assurance

Sun Lustre Storage System Simplifying and Accelerating Lustre Deployments

Intel Omni-Path Fabric Manager GUI Software

LNet Roadmap & Development. Amir Shehata Lustre * Network Engineer Intel High Performance Data Division

2 to 4 Intel Xeon Processor E v3 Family CPUs. Up to 12 SFF Disk Drives for Appliance Model. Up to 6 TB of Main Memory (with GB LRDIMMs)

Application Acceleration Beyond Flash Storage

THE STORAGE PERFORMANCE DEVELOPMENT KIT AND NVME-OF

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Optimization of Lustre* performance using a mix of fabric cards

Re-Architecting Cloud Storage with Intel 3D XPoint Technology and Intel 3D NAND SSDs

DELL EMC ISILON F800 AND H600 I/O PERFORMANCE

LUSTRE NETWORKING High-Performance Features and Flexible Support for a Wide Array of Networks White Paper November Abstract

STAR-CCM+ Performance Benchmark and Profiling. July 2014

Integration Path for Intel Omni-Path Fabric attached Intel Enterprise Edition for Lustre (IEEL) LNET

Intel Open Network Platform Release 2.0 Hardware and Software Specifications Application Note. SDN/NFV Solutions with Intel Open Network Platform

Intel Xeon Phi Coprocessor. Technical Resources. Intel Xeon Phi Coprocessor Workshop Pawsey Centre & CSIRO, Aug Intel Xeon Phi Coprocessor

Linux Network Tuning Guide for AMD EPYC Processor Based Servers

Supra-linear Packet Processing Performance with Intel Multi-core Processors

PART-I (B) (TECHNICAL SPECIFICATIONS & COMPLIANCE SHEET) Supply and installation of High Performance Computing System

NFS/RDMA over 40Gbps iwarp Wael Noureddine Chelsio Communications

Becca Paren Cluster Systems Engineer Software and Services Group. May 2017

Create a Flexible, Scalable High-Performance Storage Cluster with WekaIO Matrix

Intel Omni-Path Fabric Manager GUI Software

Intel Omni-Path Fabric Manager GUI Software

IBM Emulex 16Gb Fibre Channel HBA Evaluation

Intel Speed Select Technology Base Frequency - Enhancing Performance

Best Practices for Deployments using DCB and RoCE

Intel Many Integrated Core (MIC) Architecture

DPDK Performance Report Release Test Date: Nov 16 th 2016

Meltdown and Spectre Interconnect Performance Evaluation Jan Mellanox Technologies

Accelerating Workload Performance with Cisco 16Gb Fibre Channel Deployments

The Transition to PCI Express* for Client SSDs

Active-Active LNET Bonding Using Multiple LNETs and Infiniband partitions

Accelerating Hadoop Applications with the MapR Distribution Using Flash Storage and High-Speed Ethernet

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience

Study. Dhabaleswar. K. Panda. The Ohio State University HPIDC '09

Intel Cluster Toolkit Compiler Edition 3.2 for Linux* or Windows HPC Server 2008*

vstart 50 VMware vsphere Solution Specification

Fast-track Hybrid IT Transformation with Intel Data Center Blocks for Cloud

Intel Atom Processor Based Platform Technologies. Intelligent Systems Group Intel Corporation

An Oracle White Paper December Accelerating Deployment of Virtualized Infrastructures with the Oracle VM Blade Cluster Reference Configuration

IXPUG 16. Dmitry Durnov, Intel MPI team

H.J. Lu, Sunil K Pandey. Intel. November, 2018

Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms

An Introduction to GPFS

LS-DYNA Performance Benchmark and Profiling. April 2015

Evaluation Report: HP StoreFabric SN1000E 16Gb Fibre Channel HBA

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구

Fast and Easy Persistent Storage for Docker* Containers with Storidge and Intel

NFV Platform Service Assurance Intel Infrastructure Management Technologies

Messaging Overview. Introduction. Gen-Z Messaging

OpenFOAM Performance Testing and Profiling. October 2017

DB2 purescale: High Performance with High-Speed Fabrics. Author: Steve Rees Date: April 5, 2011

INTEL HPC DEVELOPER CONFERENCE FUEL YOUR INSIGHT

Andreas Schneider. Markus Leberecht. Senior Cloud Solution Architect, Intel Deutschland. Distribution Sales Manager, Intel Deutschland

Intel Enterprise Edition Lustre (IEEL-2.3) [DNE-1 enabled] on Dell MD Storage

Intel Cloud Builder Guide: Cloud Design and Deployment on Intel Platforms

Open storage architecture for private Oracle database clouds

Accelerating NVMe-oF* for VMs with the Storage Performance Development Kit

Hyper-converged storage for Oracle RAC based on NVMe SSDs and standard x86 servers

LS-DYNA Performance Benchmark and Profiling. October 2017

Density Optimized System Enabling Next-Gen Performance

FROM HPC TO THE CLOUD WITH AMQP AND OPEN SOURCE SOFTWARE

NAMD Performance Benchmark and Profiling. January 2015

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters

Merging Enterprise Applications with Docker* Container Technology

ARISTA: Improving Application Performance While Reducing Complexity

Intel Cluster Checker 3.0 webinar

High Performance Computing The Essential Tool for a Knowledge Economy

LAMMPS-KOKKOS Performance Benchmark and Profiling. September 2015

Intel Omni-Path Fabric Switches

SUN CUSTOMER READY HPC CLUSTER: REFERENCE CONFIGURATIONS WITH SUN FIRE X4100, X4200, AND X4600 SERVERS Jeff Lu, Systems Group Sun BluePrints OnLine

NEMO Performance Benchmark and Profiling. May 2011

CESM (Community Earth System Model) Performance Benchmark and Profiling. August 2011

OPENSHMEM AND OFI: BETTER TOGETHER

SMB Direct Update. Tom Talpey and Greg Kramer Microsoft Storage Developer Conference. Microsoft Corporation. All Rights Reserved.

IFS RAPS14 benchmark on 2 nd generation Intel Xeon Phi processor

12th ANNUAL WORKSHOP 2016 NVME OVER FABRICS. Presented by Phil Cayton Intel Corporation. April 6th, 2016

AMD EPYC Processors Showcase High Performance for Network Function Virtualization (NFV)

SPDK China Summit Ziye Yang. Senior Software Engineer. Network Platforms Group, Intel Corporation

Entry-level Intel RAID RS3 Controller Family

Chelsio Communications. Meeting Today s Datacenter Challenges. Produced by Tabor Custom Publishing in conjunction with: CUSTOM PUBLISHING

SAS workload performance improvements with IBM XIV Storage System Gen3

IBM WebSphere MQ Low Latency Messaging Software Tested With Arista 10 Gigabit Ethernet Switch and Mellanox ConnectX

Microsoft SQL Server 2012 Fast Track Reference Architecture Using PowerEdge R720 and Compellent SC8000

InfiniBand Networked Flash Storage

Transcription:

white paper Implementing in Intel Omni-Path Architecture Fabrics Rev 2 A rich ecosystem of storage solutions supports Intel Omni- Path Executive Overview The Intel Omni-Path Architecture (Intel OPA) is the next-generation fabric architected to deliver the performance and scaling needed for tomorrow s high performance computing (HPC) workloads. A rich ecosystem of storage offerings and solutions is key to enabling the building of high performance Intel OPA-based systems. This white paper describes Intel OPA storage solutions and discusses the considerations involved in selecting the best storage solution. It is aimed at solution architects and those interested in understanding native Intel OPA high performance storage or connecting legacy storage solutions. Overview A system with an Intel OPA-based network fabric often requires connectivity to a parallel file system, enabling the Intel OPA connected compute nodes to access the file system storage in the most optimal way. These storage solutions involve storage servers, routers, block storage devices, storage networks, hierarchical storage management and parallel file systems. This overview discusses the components that make up the storage solutions and describes some typical configurations.

2 Table of Contents Executive Overview 1 Overview... 1 Components... 2 Configurations... 2 Intel OPA Software and Considerations... 3 Intel OPA HFI coexistence with Mellanox* InfiniBand HCA... 4 Intel OPA Solutions... 4 Interoperability with Existing... 5 Dual-homed... 6 LNet Router... 6 Scoping LNet Routers... 7 IP Router... 8 Components This section describes the terminology that will be used in this paper to discuss the components of a storage solution. Client Side Connection IB, Intel OPA, Ethernet, etc Figure 1: Components Servers Running Lustre*, GPFS, IBM Spectrum Scale*, NFS, etc Connection IB, FC, SAS, etc The client side connections are the connections to the compute cluster fabric. Devices The storage servers run the file system server software, such as Lustre* Object Server (OSS) software, IBM Spectrum Scale* / General Parallel File System (GPFS) Network Shared Disk (NSD) software. servers can take different forms. The storage servers are often implemented as standalone Linux* servers with adapter cards for connectivity to both the client side connections and the storage connections. These are sometimes productized by storage vendors, and in some cases the storage servers are integrated into an appliance offering. Servers Appliance Figure 2: Server Configurations The storage connection and storage devices generally take the form of a block storage device offered by the storage vendors. It is expected that there will be block storage devices with Intel OPA storage connections; however, this is not critical to enabling Intel OPA storage solutions. The client side connection is the focus for Intel OPA enablement. Configurations The connections from the storage servers to the Intel OPA fabric can take different forms, depending on the requirements of the system installation. Direct attached the storage servers are directly attached to the Intel OPA fabric with Intel OPA adapter cards in the storage servers. Dual-homed the storage servers are directly attached to the Intel OPA fabric and to another fabric, typically InfiniBand* (IB) or Ethernet. Adapter cards for both fabrics are installed in the storage servers. Routed the storage servers are connected to the Intel OPA fabric through routers that carry traffic between the Intel OPA fabric to the client side connection of the storage servers, typically InfiniBand or Ethernet.

Implementing in Intel Omni-Path Architecture Fabrics 3 The direct attached solution is usually found with new system acquisitions where the best option is to provide a native fabric interface between compute and storage. The dual-homed and routed configurations are typically used to provide connectivity to legacy storage or to share storage across multiple clusters with different fabrics. Direct-attached OPA Routed Connection File Sy s Server Legacy Cluster Legacy File Sys Server Router Intel OPA Fabric Dual-homed File Sys Server CN 0 CN 1 CN 2 CN 3 CN 4 CN n Figure 3: Configurations Overview Intel OPA Software and Considerations Intel OPA Host Software Intel s host software strategy is to utilize the existing OpenFabrics Alliance interfaces, thus ensuring that today s application software written to those interfaces runs with Intel OPA with no code changes required. This immediately enables an ecosystem of applications to just work. ISVs may over time choose to implement changes to take advantages of the unique capabilities present in Intel OPA to further optimize their offerings. All of the Intel Omni-Path host software is open source. Intel is working with major operating system vendors to incorporate Intel OPA support into future releases. Prior to being in-box with these distributions, Intel will release a delta package to support Intel OPA. The Intel software will be available on the Intel Download Center: https://downloadcenter.intel.com Table 1: Intel OPA Linux* Support LINUX* DISTRIBUTION RedHat SuSE CentOS Scientific Linux VERSIONS SUPPORTED RHEL 6.7, RHEL 7.2 or newer SLES 12 SP1 or newer CentOS 6.7, CentOS 7.2 or newer Scientific Linux 7.2 or newer Note: Check with your OS Vendor to insure CPU support. Table 2: Intel OPA Lustre* Support LUSTRE* DISTRIBUTION Community Intel Foundation Edition Intel Enterprise Edition VERSIONS SUPPORTING INTEL OPA 2.8 or newer 2.7.1 or newer 2.4 (client support only) 3.0 or newer (client and server)

4 Table 3: Intel OPA & IBM Spectrum Scale (formerly GPFS) Support IBM Spectrum Scale SOFTWARE IBM Spectrum Scale (formerly GPFS) over IP IBM Spectrum Scale (GPFS) over RDMA with Intel OPA VERSION SUPPORTING INTEL OPA Supported Version 4.2 and beyond Intel OPA HFI coexistence with Mellanox* InfiniBand HCA In a dual-homed file system server, or in a Lustre Networking (LNet) or IP router, a single OpenFabrics Alliance (OFA) software environment supporting both an Intel OPA HFI and a Mellanox* InfiniBand HCA is required. The OFA software stack is architected to support multiple targeted network types. Currently, the OFA stack simultaneously supports iwarp for Ethernet, RDMA over Converged Ethernet (RoCE), InfiniBand networks, and the Intel OPA network has been added to that list. As the OS distributions implement their OFA stacks, it will be validated to simultaneously support both Intel OPA Host Fabric Adapters and Mellanox Host Channel Adapters. Intel is working closely with the major Linux distributors, including Red Hat* and SUSE*, to ensure that Intel OPA support is integrated into their OFA implementation. At the time of writing, Red Hat 7.3 and SLES 12 SP2 have integrated OFA with OPA support, simultaneously supporting Mellanox InfiniBand and Intel OPA. The Mellanox OFED drivers do not co-exist with other OFA supported hardware, as a result OS distrtibution OFA or Intel OFA Delta release should be used. Even when support is present, it may still be advantagous to update the OFA software to resolve critical issues. Linux distribution support is provided by the operating system vendor. Operating system vendors are expected to provide the updates necessary to address issues with the OFA stack that must be resolved prior to the next official Linux distribution release. This is the way that software drivers for other interconnects, such as Ethernet, work as well. With the Lustre versions prior to 2.9.0, the software doesn t yet have the ability to manage more than one set of Lustre network settings in a single node. There is a patch to address this capability that is being tracked by https://jira.hpdd.intel.com/browse/lu- 7101. With QDR and FDR InfiniBand, there are settings that work well for both IB and Intel OPA. With Enhanced Data Rate (EDR) InfiniBand, there isn t a set of settings that work well for both the InfiniBand and the Intel OPA devices. Therefore, coexistence of Intel OPA and EDR InfiniBand isn t recommended until that Lustre patch is available. The patch has been encorporated into Lustre source builds, and also included in all future Intel Enterprise Edition Lustre version 3.0.1 releases as well as community releases. Intel OPA Solutions Intel OPA Direct Attached High performance file systems with connectivity to an Intel OPA compute fabric, including Lustre*, BeeGFS* and IBM Spectrum Scale* (formerly GPFS), are a core part of end-to-end Intel OPA solutions. When an Intel OPA based system requires new storage, there are several options available. The Omni-Path Fabric Builders catalog contains details on the ecosystem of partner offerings as well as partner contact names, including storage and storage solution providers. https://fabricbuilders.intel.com OEM/Customer built OEMs or end customers put together the file system, procuring block storage from storage vendors, selecting an appropriate server, and obtaining the file system software either from the open source community or from vendors that offer supported versions. This option is straightforward with Intel OPA. See the Intel OPA Host Software section above for information about OS and file system software versions compatible with Intel OPA. vendor offering In some cases, the complete file system solution is provided by a system or storage vendor. These complete solutions can take the form of block storage with external servers running the file system software or fully integrated appliance-type solutions.

Implementing in Intel Omni-Path Architecture Fabrics 5 Interoperability with Existing In some cases, when a new Intel OPA-based system is deployed, there are requirements for access to existing storage. There are three main options to be considered: Upgrade existing file system to Intel OPA Dual-home the existing file system to Intel OPA LNet and IP Router solutions Site specific scenarios, such as through-put requirements, how long the existing storage will remain in production and even distance between solutions need to be reviewed. If the existing file system can be upgraded to support Intel OPA connections, this is often the best solution. For cases where the existing storage will only be accessed by the Intel OPA based system, this upgrade can take the form of replacing existing fabric adapter cards with Intel OPA adapter cards if supported by the server and storage vendor. Software and Operating System upgrades may also be required. Contact your storage vendor to help with this upgrade and to insure existing storage can be upgraded to support Intel OPA. In other cases, the existing file system will need to continue to be accessed from an existing (non-intel OPA) cluster and also will require access from the new Intel OPA-based system. In these cases, the file system can be upgraded to be dual homed by adding Intel OPA host adapters to the file system servers. Again keeping in mind the hardware, OS and software requirements for all components. In some cases, dual-homing the storage is not possible. This can happen when the file servers are older and cannot support the newer hardware or the effort, risk and downtime required to update the file system software outweigh the benefits provided by dualhoming the solution. In these cases, a router-based solution can solve the interoperability challenge. For Lustre file systems, the LNet router component of Lustre can be used for this purpose. For other file systems, such as IBM Spectra Scale*, GPFS, or NFS the Linux IP router can be used. Table 4: Options Considerations DIRECT-ATTACHED DUAL HOMED ROUTED SOLUTION Pros Excellent bandwidth Predictable performance No additional system complexity Excellent bandwidth Predictable performance Easy to add to existing storage Minimal to no downtime of existing storage Cons Legacy cluster may need a router to access. Downtime to update OS, driver stack and install hardware Legacy hardware may not support OPA Updates may be viewed as too risky by storage administrators Bandwidth requirements may mean that multiple routers are required Complexity to manage extra pieces in the system

Implementing in Intel Omni-Path Architecture Fabrics Dual-homed In the dual-homed approach, an Intel OPA connection is provided directly from the file system server, providing the best possible bandwidth and latency solution. This is a good option when: The file system servers have a PCIe slot available to add the Intel OPA adapters and meet Intel OPA hardware requirements The file system servers utilize OS and file system software versions compatible with Intel OPA, or can be upgraded to do so For OEMs and customers who have built the file system themselves, this solution will be supported through the OS and file system software arrangements that are already in place. When the file system solution was provided by a storage vendor, that vendor can be engaged to perform and support the upgrade to dual-homed. LNet Router The LNet router is a standard component of the Lustre stack. It is specifically designed to route traffic natively from one network type to another with the ability to perform load-balancing and failover. To facilitate the implementation of Lustre routers in Intel OPA deployments, a validated reference design recipe is provided in the Router Design Guide. This recipe provides instructions on how to implement and configure LNet routers to connect Intel OPA and InfiniBand fabrics. Some vendors have also developed their own LNet router, while others may redistribute an Intel LNet router. The Lustre software supports dynamic load sharing between multiple targets. This is handled in the client, which has a router table and does periodic pings to all of its end points to check status. This capability is leveraged to provide load balancing across multiple LNet routers. Roundrobin load sharing is performed transparently. This capability also provides for failover because in the event of an LNet router failure, the load is automatically redistributed to other available routers.

Implementing in Intel Omni-Path Architecture Fabrics Table 5: LNet Router Hardware Recipe HARDWARE CPU Memory Server Platform Intel OPA connection IB/Ethernet connection RECOMMENDATION Intel Xeon E5-2640 v4 (2.40 GHz), 10 core Hyperthreading disabled 16 to 32 GByte RAM per node 1U rack server or equivalent form factor with two x16 PCIe* slots One x16 HFI Mellanox FDR Mellanox EDR Ethernet Table 6: LNet Router Software Recipe SOFTWARE Base OS RECOMMENDATION RHEL 7.2 + Intel OPA delta distribution OR SLES 12 SP1 + Intel OPA delta distribution Lustre* Community version 2.8 OR IEEL 2.4 or newer OR FE 2.7.1 or newer Note: The LNet Router Hardware Recipe (Table 5) and the LNet Router Software Recipe (Table 6) information is preliminary and based upon configurations that have been tested by Intel to date. Further optimization of CPU and memory requirements is planned. CN o CN n Intel OPA Compute Fabric OPA LNet Routers IB Existing Infrastructure Servers Figure 7: LNet Router Scoping LNet Routers Some scenarios will require more than a single LNet router to provide appropriate throughput to the existing storage system. Whether developing your own LNet Router configuration or making use of packaged solutions available on the market today, there is a few considerations and rules of thumb that can be used for architecting an appropriate solution. An LNet Router is capable of approximately 80% of the slowest network types throughput. In practice LNet routers trend to linear scaling when routers are designated as equal priority and equal hops in the client fabric. Note: this does rely upon the backend file systems total capability, bulk data transfers and large messages. Attempt to have the network cards use the same PCIe bus sub-system Intel benchmarked numerous LNet router configurations including changing CPU frequency from 2.6Ghz to 1.2 Ghz to measure the effect it had on performance, while also providing a baseline for individual LNet router performance. The setup used to collect these results had FDR and EDR compute nodes connecting through an LNet Router to an OPA based Lustre solution. The router remained the point of network contention and the results should hold true if the storage was InfiniBand based and the OPA compute nodes were attempting to read/write to storage. Some of the IOR results can be found in Table 7 below for a server with Intel Xeon E5-2697A v4, 2.6 GHz, 16 cores with the CPU clock speed changed to 2.0GHz. With the OPA/FDR results the PCIe Adapter were connected to the same socket, for the OPA/EDR results the PCIe adapters were connected to

Implementing in Intel Omni-Path Architecture Fabrics different sockets. What we found was the CPU speed had only minor effects on performance and did not play a direct role on the throughput the LNet router was capable of providing. Table 7: LNet Router IOR Performance LNET ROUTER NETWORKS 1 WRITE PERFORMANCE (IOR) READ PERFORMANCE (IOR) OPA and Mellanox FDR 2 6.1 GB/s 5.3 GB/s 9.3 GB/s saturated bandwidth with 4 clients 10.1 GB/s saturated bandwidth with 4 clients OPA and Mellanox EDR 3 6.7 GB/s with 1 client node 7.2 GB/s with 1 client node Note: This table should be used as guidance, each individual solution will vary slightly. When a solution requires higher read and write I/O than is available from one LNet router, the additional bandwidth can be achieved by instantiating multiple LNet routers. To scope the number of LNet Routers required, a solution architect should understand the capability of the existing storage and the customers throughput requirements between the legacy storage and the new fabric systems, typically both in GB/s. Using this information and knowledge of rough LNet Router performance the appropriate number of routers can be determined. Solution architects may wish to err on the side of additional LNet Routers, as an additional router will provide fault tolerance and compensate for non-bulk traffic found in some application I/O communication patterns. As an example; if a customer would like to have 20GB/s read and write with a legacy FDR Lustre* storage solution that is capable of this level of performance. Based on our own findings found in table 7 above we know a single FDR LNet router is capable of 6 & 5 GB/s write/read respectively. Proposing 4 LNet routers would cover the requirements in best case scenarios. It may be beneficial to propose a 5 th LNet Router as it could offer an additional level of cushion for redundancy and performance. Support for the LNet router solution is provided through the customer s Lustre support path, as it is a standard part of the Lustre software stack. IP Router The IP router is a standard component in Linux. When configured to support routing between an Intel OPA fabric and a legacy fabric, it provides IP based routing that can be used for IP traffic from GPFS, NFS, IP based LANs and other file systems that use IP based traffic. To facilitate the implementation of IP routers in Intel OPA deployments, a validated reference design recipe is provided. The recipe is available at the Intel download center host software documentation, titled Intel Omni-Path Router Design Guide, provides instructions on how to implement and configure IP routers to connect Intel OPA and InfiniBand or Ethernet networks. The router design guide focuses on storage TCP/IP routing but the fundamentals hold true for general TCP/IP routing. The Virtual Router Redundancy Protocol (VRRP) v3 software in Linux is used to enable failover and load balancing with the IP routers. The VRRP is a computer networking protocol that provides for automatic assignment of available Internet Protocol (IP) routers to participating hosts. This increases the availability and reliability of routing paths via automatic default gateway selections on an IP subnetwork. IP routers can be configured for high availability using VRRP. This can be done with an active and a passive server. In a system configured with multiple routers, routers can be configured to be master on some subnets and slaves on the others, thus allowing the routers to be more fully utilized while still providing resiliency. The load-balancing capability is provided by VRRP using IP Virtual Server (IPVS). IPVS implements transport-layer load balancing, usually called Layer 4 LAN switching, as part of the Linux kernel. IPVS is incorporated into the Linux Virtual Server (LVS), where it runs on a host and acts as a load balancer in front of a cluster of real servers. IPVS can direct requests for TCPand UDP- based services to the real servers, and make services of the real servers appear as virtual services on a single IP address.

Implementing in Intel Omni-Path Architecture Fabrics A weighted round-robin algorithm is used and different weights can been added to distribute load across file system servers that have different performance capabilities. CN o CN n Figure 8: IP Router Intel OPA Compute Fabric OPA IP Routers Ethernet or IB Existing Infrastructure ( IB or Ethernet ) GPFS or NFS Servers Table 8: IP Router Hardware Recipe HARDWARE CPU Memory Server Platform Intel OPA connection IB/Ethernet connection RECOMMENDATION Dual-socket current or future generation Intel Xeon Processors Example: Intel Xeon E5-2643 v4 (3.40 GHz), 6 core 16 or 32 GByte RAM per node 1U rack server or equivalent form factor with two x16 PCIe slots One x16 HFI Mellanox FDR and EDR (Other generations are expected to work, performance will vary) Ethernet Table 9: IP Router Software Recipe SOFTWARE RECOMMENDATION Base OS RHEL 7.2 + Intel OPA delta distribution OR SLES 12 SP1 + Intel OPA delta distribution Note: The IP Router Hardware Recipe (Table 7) and the IP Router Software Recipe (Table 8) information is preliminary and based upon configurations that have been tested by Intel to date. Further optimization of CPU and memory is planned. The target peak aggregate forwarding rate is 35 Gbps per IP router server with either EDR or FDR IB when following the above recipe. Performance of IP Router functionality is dependent on the TCP/IP stack in Linux which has proven to be sensitive to CPU frequency speeds. Higher CPU frequencies should be considered for optimal performance, lower frequency solution can be used but as with all solutions they should be benchmarked to fully understand the IPoIB throughput capabilities. Other generations of InfiniBand are expected to work with this recipe, however performance will vary. Ethernet connectivity and routing to storage or other networks is also supported by the recipe. Additional bandwidth can be achieved by instantiating multiple IP routers. Support for the IP router solution is provided through the customer s Linux support path, as it is a standard part of the Linux software stack. For more information about Intel Omni-Path Architecture and next-generation fabric technology, visit: www.intel.com/hpcfabrics www.intel.com/omnipath

Implementing in Intel Omni-Path Architecture Fabrics Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. THE INFORMATION PROVIDED IN THIS PAPER IS INTENDED TO BE GENERAL IN NATURE AND IS NOT SPECIFIC GUIDANCE. RECOMMENDATIONS (INCLUDING POTENTIAL COST SAVINGS) ARE BASED UPON INTEL S EXPERIENCE AND ARE ESTIMATES ONLY. INTEL DOES NOT GUARANTEE OR WARRANT OTHERS WILL OBTAIN SIMILAR RESULTS. INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS AND SERVICES. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS AND SERVICES, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS AND SERVICES INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Results have been estimated or simulated using internal Intel analysis or architecture simulation or modeling, and provided to you for informational purposes. Any differences in your system hardware, software or configuration may affect your actual performance. Intel technologies features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at intel.com. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest Intel product specifications and roadmaps. Copyright 2017 Intel Corporation. All rights reserved. Intel, Intel Xeon, and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries. * Other names and brands may be claimed as the property of others. 1 - Lustre Intel Enterprise Edition for Lustre 46TB file system (IEEL version 2.7.16.4) Meta Data Subsystem 2xMDS Dual socket E5-2699 v3 BDW-EP, 256 GB/node 2133 MHz DDR4, RHEL7.2 2x 480GB Intel Haleyville SSD per node Intel R2224WTTYSR Wildcat Pass Object Subsystem 4xOSS Dual socket E5-2699 v3 BDW-EP, 256 GB/node 2133 MHz DDR4, RHEL7.2 24x 480GB Intel Haleyville SSD per node Intel R2224WTTYSR Wildcat Pass SAS Controller 47 Lnet Router Intel Xeon processor E5-2697A v4, 2.60 GHz, 16 cores Intel Enterprise edition for Lustre version 2.7.16.11 Intel Turbo Boost Technology disabled, Intel Hyper-Threading Technology disabled BIOS settings: Snoop hold-off timer = 9, Early snoop disabled, Cluster on die disabled, IOU Non-posted prefetch disabled OS: Red Hat Enterprise Linux* Server release 7.2 (Maipo) Kernel: 3.10.0-327.36.3.el7.x86_64 2 - Tests performed on Intel Xeon processor E5-2697A v4, 2.60 GHz, 16 cores. Intel Turbo Boost Technology enabled, Intel Hyper-Threading Technology enabled. RHEL 7.2. BIOS settings: IOU non-posted prefetch disabled. Snoop timer for posted prefetch=9. Early snoop disabled. Cluster on Die disabled. Intel Fabric Suite 10.2.0.0.158. Intel Corporation Device 24f0 Series 100 HFI ASIC (B0 silicon). OPA Switches: Series 100 Edge Switch 48 port (B0 silicon). Mellanox EDR based on internal measurements: MLNX_OFED_LINUX-3.2-2.0.0.0 (OFED-3.2-2.0.0). Mellanox EDR ConnectX-4 Single Port Rev 3 MCX455A HCA. Mellanox SB7700-36 Port EDR Infiniband switch. IOR benchmark version 2.10.3. Transfer size=1 MB, File size=256 GB, 16ppn, unique file created per process. EDR parameters: -genv I_MPI_FABRICS=shm:dapl 3 - Through-put achieved with 4 aggregate EDR clients reading/writing to OPA based Lustre. Tests performed on Intel Xeon processor E5-2697A v4, 2.60 GHz, 16 cores. Intel Turbo Boost Technology enabled, Intel Hyper-Threading Technology enabled. RHEL 7.2. BIOS settings: IOU non-posted prefetch disabled. Snoop timer for posted prefetch=9. Early snoop disabled. Cluster on Die disabled. Intel Corporation Device 24f0 Series 100 HFI ASIC (B0 silicon). OPA Switches: Series 100 Edge Switch 48 port (B0 silicon). Mellanox EDR based on internal measurements: MLNX_OFED_LINUX-3.2-2.0.0.0 (OFED- 3.2-2.0.0). Mellanox EDR ConnectX-4 Single Port Rev 3 MCX455A HCA. Mellanox SB7700-36 Port EDR Infiniband switch. IOR benchmark version 2.10.3. Transfer size=1 MB, File size=256 GB, 16ppn, unique file created per process. EDR parameters: -genv I_MPI_FABRICS=shm:dapl Printed in USA 1115/EON/HBD/PDF Please Recycle 333509-001US