Multi-Rail LNet for Lustre

Size: px

Start display at page:

Download "Multi-Rail LNet for Lustre"

Donald Murphy
5 years ago
Views:

Multi-Rail LNet for Lustre Rob Mollard September 2016 The SGI logos and SGI

trademarks of Silicon Graphics International Corp. or one of its subsidiaries.

together with this Legal Notice, must be retained on this presentation.

1 Multi-Rail LNet for Lustre Rob Mollard September 2016 The SGI logos and SGI product names used or referenced herein are either registered trademarks or trademarks of Silicon Graphics International Corp. or one of its subsidiaries. All other trademarks, trade names, service marks and logos referenced herein belong to their respective holders. Any and all copyright or other proprietary notices that appear herein, together with this Legal Notice, must be retained on this presentation. The information contained herein is subject to change without notice. Some names and brands may be claimed as the property of others 1

2 Multi-Rail LNet Multi-Rail is a long-standing wish list item known under a variety of names: Multi-Rail Interface Bonding Channel Bonding The various names do imply some technical differences. This implementation is a collaboration between SGI and Intel. Some names and brands may be claimed as the property of others 2

3 What Is Multi-Rail? Multi-Rail allows nodes to communicate across multiple interfaces: Using multiple interfaces connected to one network Using multiple interfaces connected to several networks These interfaces are used simultaneously Multi-Rail increases the per client Lustre performance Some names and brands may be claimed as the property of others 3

4 Why Multi-Rail: Increasing Server Bandwidth In big clusters, bandwidth to the server nodes becomes a bottleneck. Adding faster interfaces implies replacing much or all of the network. Adding faster interfaces only to the servers does not work Adding more interfaces to the servers increases the bandwidth. Using those interfaces requires a redesign of the LNet networks Without Multi-Rail each interface connects to a separate LNet network s must be distributed across these networks Some names and brands may be claimed as the property of others 4

multiple TB of memory needs a lot of bandwidth.

5 Why Multi-Rail: Big s We want to support big Lustre nodes. SGI UV 300: 32-socket NUMA system SGI UV 3000: 256-socket NUMA system A system with multiple TB of memory needs a lot of bandwidth. NUMA systems benefit when memory buffers and interfaces are close in the system s topology. Some names and brands may be claimed as the property of others 5

6 The Multi-Rail Project Add basic multi-rail capability Multiplexing across interfaces, as opposed to striping across them Multiple data streams are needed Hardware agnostic: Ethernet, InfiniBand, Omnipath Extend peer discovery to simplify configuration Discover peer interfaces Discover peer multi-rail capability Configuration can be changed at runtime Including adding or removing interfaces lnetctl is used for configuration Fully compatible with non-multi-rail nodes Added resiliency via alternate paths 6

7 Two Types of Configuration Methods Multi-Rail can be configured statically with lnetctl. The following must be configured statically Local network interfaces The network interfaces by which a node sends messages Selection rules The rules which determine the local/remote network interface pair used to communicate between a node and a peer Default is weighted round-robin The following can be configured statically or discovered dynamically Peer network interfaces The remote network interfaces of peer nodes to which a node sends messages Some names and brands may be claimed as the property of others 7

8 Dynamic Configuration Enable dynamic peer discovery to have LNet configure peers automatically. LNet can dynamically discover a peer s NIDs. On a node: Peers are discovered as messages are sent and received An LNet ping is used to get a list of the peer s NIDs A feature bit indicates whether the peer supports Multi-Rail The node pushes a list of its NIDs to Multi-Rail peers 8

9 Use Cases Improved performance Improved resiliency Better usage of large clients The Multi-Rail code is NUMA aware Fine grained control of traffic Simplify multi-network file system access Some names and brands may be claimed as the property of others 9

10 Example Configurations 10 Some names and brands may be claimed as the property of others

11 Single Fabric With One LNet Network MGS MGT This is a small Lustre cluster with a single big client node. Big Congestion MDS MDT All nodes are connected to a single fabric (physical network). There is one LNet network connecting the nodes. The big client node has a single connection to this network. It has the same network bandwidth available to it as the small clients. Without Multi-Rail LNET 11

12 Single Fabric With Multiple LNet Networks MGS MGT Additional interfaces have been added to the big client node to increase its bandwidth. MDS MDT Without Multi-Rail LNet we must configure multiple LNet networks. Big Each lives on a separate LNet network, within the single fabric. Each interface on the big client node connects to one of these LNet networks. On the other client nodes, aliases are used to connect a single interface to multiple LNet networks. Without Multi-Rail LNET 12

13 Single Fabric With One Multi-Rail LNet Network MGS MDS MGT MDT Multi-Rail LNet allows for the LNet network configuration to match the fabric. The fabric is the same as in the previous slide. Big The configuration is much simpler. The network bandwidth to the big client node is increased to match its size. 13

14 Dual Fabric With Dual Multi-Rail LNet Networks MGS MDS MGT MDT In this example there are two fabrics, each with an LNet network on top. The server nodes connect to both fabrics. Big The big client node connects with multiple interfaces to both fabrics. The other client nodes connect to only one fabric. 14

15 Complex Environments 1 2 MGS MDS 1 2 MGT MDT In this example there is a single fabric with a bottleneck. 1 and 2 can be configured to avoid sending traffic over the red connection. Without Multi-Rail LNET 15

16 Resiliency 1 MGS MDS MGT MDT The link from the top half to 1 is down Now traffic from 1 to 1 does flow over the red link. Without Multi-Rail LNET 16

17 Fine Grained Control 1 Big 2 MGS MDS 1 2 MGT MDT In this example a big client is connected to both halves of the single fabric. The big client can still be configured to avoid the red link. 17

18 Project Status 18 Some names and brands may be claimed as the property of others

19 Project Status Public project wiki page: Code development is done on the multi-rail branch of the Lustre master repo. Patches to enable static configuration are under review Initial unit testing and system testing have completed Patches for selection rules are under development Patches for dynamic peer discovery are under development Estimated project completion time: end of CY 2016 Master landing date: Lustre 2.10 Speak to SGI, for early access today! 19

20 Initial Results 20 Some names and brands may be claimed as the property of others

UV 2000 FDR MGS MDS FC8 FC8 MGT MDT FDR InfiniBand 160-CPU SGI

21 Initial Results: Test Hardware Older hardware, used for functionality testing, not performance. UV 2000 FDR MGS MDS FC8 FC8 MGT MDT FDR InfiniBand 160-CPU SGI UV nodes 4 legs on fabric 8 1 leg on fabric each 30 5 * SGI IS5500 FC8 connections to 21

22 Initial Results: The Numbers UV 2000 FDR MGS MDS FC8 FC8 MGT MDT At 16.5 GB/s performance we re approaching the theoretical limit of the configured filesystem This is almost 3 * FDR single-rail speed. 22

23 Q & A 23 Some names and brands may be claimed as the property of others

Olaf Weber Senior Software Engineer SGI Storage Software. Amir Shehata Lustre Network Engineer Intel High Performance Data Division

Olaf Weber Senior Software Engineer SGI Storage Software Amir Shehata Lustre Network Engineer Intel High Performance Data Division Intel and the Intel logo are trademarks or registered trademarks of Intel