Olaf Weber Senior Software Engineer SGI Storage Software. Amir Shehata Lustre Network Engineer Intel High Performance Data Division

Similar documents
Multi-Rail LNet for Lustre

Olaf Weber Senior Software Engineer SGI Storage Software

Lustre Interface Bonding

LNET MULTI-RAIL RESILIENCY

LNet Roadmap & Development. Amir Shehata Lustre * Network Engineer Intel High Performance Data Division

Andreas Dilger. Principal Lustre Engineer. High Performance Data Division

Intel & Lustre: LUG Micah Bhakti

Optimization of Lustre* performance using a mix of fabric cards

Intel Omni-Path Fabric Manager GUI Software

Fan Yong; Zhang Jinghai. High Performance Data Division

Intel Unite. Intel Unite Firewall Help Guide

Small File I/O Performance in Lustre. Mikhail Pershin, Joe Gmitter Intel HPDD April 2018

Zhang, Hongchao

Intel Unite Plugin Guide for VDO360 Clearwater

Integration Path for Intel Omni-Path Fabric attached Intel Enterprise Edition for Lustre (IEEL) LNET

Andreas Dilger High Performance Data Division RUG 2016, Paris

Multi-tenancy: a real-life implementation

Lustre * Features In Development Fan Yong High Performance Data Division, Intel CLUG

New Storage Architectures

Intel Unite Solution. Linux* Release Notes Software version 3.2

LUSTRE NETWORKING High-Performance Features and Flexible Support for a Wide Array of Networks White Paper November Abstract

Clear CMOS after Hardware Configuration Changes

Intel Omni-Path Fabric Manager GUI Software

Re-Architecting Cloud Storage with Intel 3D XPoint Technology and Intel 3D NAND SSDs

DPDK Summit China 2017

SELINUX SUPPORT IN HFI1 AND PSM2

Lustre Networking at Cray. Chris Horn

Andreas Dilger, Intel High Performance Data Division SC 2017

Intel Clear Containers. Amy Leeland Program Manager Clear Linux, Clear Containers And Ciao

Intel Omni-Path Fabric Switches

Active-Active LNET Bonding Using Multiple LNETs and Infiniband partitions

Achieve Low Latency NFV with Openstack*

Ravindra Babu Ganapathi

Intel Omni-Path Fabric Manager GUI Software

DPDK-in-a-Box. David Hunt - Intel DPDK US Summit - San Jose

Fujitsu s Contribution to the Lustre Community

Omni-Path Cluster Configurator

Munara Tolubaeva Technical Consulting Engineer. 3D XPoint is a trademark of Intel Corporation in the U.S. and/or other countries.

Applying the Benefits of Network on a Chip Architecture to FPGA System Design

Intel Unite Solution Version 4.0

Engineers can be significantly more productive when ANSYS Mechanical runs on CPUs with a high core count. Executive Summary

Improving Driver Performance A Worked Example

Drive Recovery Panel

Andreas Dilger, Intel High Performance Data Division LAD 2017

Evolving Small Cells. Udayan Mukherjee Senior Principal Engineer and Director (Wireless Infrastructure)

OMNI-PATH FABRIC TOPOLOGIES AND ROUTING

HLD For SMP node affinity

Implementing Storage in Intel Omni-Path Architecture Fabrics

Blue Waters I/O Performance

NVMe Over Fabrics: Scaling Up With The Storage Performance Development Kit

WITH INTEL TECHNOLOGIES

IXPUG 16. Dmitry Durnov, Intel MPI team

Lustre * 2.12 and Beyond Andreas Dilger, Intel High Performance Data Division LUG 2018

Orchestrating an OpenStack* based IoT Smart Home

Intel Unite Solution. Plugin Guide for Protected Guest Access

Intel. Rack Scale Design: A Deeper Perspective on Software Manageability for the Open Compute Project Community. Mohan J. Kumar Intel Fellow

Modernizing Meetings: Delivering Intel Unite App Authentication with RFID

High-Performance Lustre with Maximum Data Assurance

Intel Xeon Phi Coprocessor. Technical Resources. Intel Xeon Phi Coprocessor Workshop Pawsey Centre & CSIRO, Aug Intel Xeon Phi Coprocessor

Intel Cluster Ready Allowed Hardware Variances

Intel Unite Solution Intel Unite Plugin for WebEx*

Lustre Beyond HPC. Presented to the Lustre* User Group Beijing October 2013

Efficient Parallel Programming on Xeon Phi for Exascale

Andreas Dilger, Intel High Performance Data Division Lustre User Group 2017

Supplement to InfiniBand TM Architecture Specification Volume 1 Release 1.2. Annex A11: RDMA IP CM Service. September 8, 2006

2-Port 40 Gb InfiniBand Expansion Card (CFFh) for IBM BladeCenter IBM BladeCenter at-a-glance guide

Intel Stress Bitstreams and Encoder (Intel SBE) 2017 AVS2 Release Notes (Version 2.3)

Intel Unite Solution Version 4.0

Välkommen. Intel Anders Huge

Maximizing System x and ThinkServer Performance with a Balanced Memory Configuration

Localized Adaptive Contrast Enhancement (LACE)

Data Plane Development Kit

Extremely Fast Distributed Storage for Cloud Service Providers

Intel Software Guard Extensions Platform Software for Windows* OS Release Notes

Data life cycle monitoring using RoBinHood at scale. Gabriele Paciucci Solution Architect Bruno Faccini Senior Support Engineer September LAD

Analyze and Optimize Windows* Game Applications Using Intel INDE Graphics Performance Analyzers (GPA)

Becca Paren Cluster Systems Engineer Software and Services Group. May 2017

Running Docker* Containers on Intel Xeon Phi Processors

IFS RAPS14 benchmark on 2 nd generation Intel Xeon Phi processor

Intel Server Board S2600STB

SDSC s Data Oasis Gen II: ZFS, 40GbE, and Replication

OpenCL* and Microsoft DirectX* Video Acceleration Surface Sharing

High Performance Computing The Essential Tool for a Knowledge Economy

HPCG on Intel Xeon Phi 2 nd Generation, Knights Landing. Alexander Kleymenov and Jongsoo Park Intel Corporation SC16, HPCG BoF

Intel Omni-Path Fabric Switches

12th ANNUAL WORKSHOP 2016 NVME OVER FABRICS. Presented by Phil Cayton Intel Corporation. April 6th, 2016

unleashed the future Intel Xeon Scalable Processors for High Performance Computing Alexey Belogortsev Field Application Engineer

Virtuozzo Hyperconverged Platform Uses Intel Optane SSDs to Accelerate Performance for Containers and VMs

Intel s Architecture for NFV

Community Release Update

Sample for OpenCL* and DirectX* Video Acceleration Surface Sharing

Computer Science Section. Computational and Information Systems Laboratory National Center for Atmospheric Research

Merging Enterprise Applications with Docker* Container Technology

Intel Xeon Scalable Family Balanced Memory Configurations

Fujitsu's Lustre Contributions - Policy and Roadmap-

Intel Xeon W-3175X Processor Thermal Design Power (TDP) and Power Rail DC Specifications

Intel Compute Card Slot Design Overview

Ben Walker Data Center Group Intel Corporation

Experiences Running and Optimizing the Berkeley Data Analytics Stack on Cray Platforms

Assembly Instructions: ASUS* Z62F

Transcription:

Olaf Weber Senior Software Engineer SGI Storage Software Amir Shehata Lustre Network Engineer Intel High Performance Data Division Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. *Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without any notice. Copyright 2016, Intel Corporation.

Multi-Rail LNet Multi-Rail is a long-standing wish list item known under a variety of names: Multi-Rail Interface Bonding Channel Bonding The various names do imply some technical differences. This implementation is a collaboration between SGI and Intel. Our goal is to land multi-rail support in Lustre*. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. *Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without any notice. Copyright 2016, Intel Corporation. 3

What Is Multi-Rail? Multi-Rail allows nodes to communicate across multiple interfaces: Using multiple interfaces connected to one network Using multiple interfaces connected to several networks These interfaces are used simultaneously 4

Why Multi-Rail: Big s We want to support big Lustre nodes. SGI UV 300: 32-socket NUMA system SGI UV 3000: 256-socket NUMA system A system with multiple TB of memory needs a lot of bandwidth. NUMA systems benefit when memory buffers and interfaces are close in the system s topology. 5

Why Multi-Rail: Increasing Server Bandwidth In big clusters, bandwidth to the server nodes becomes a bottleneck. Adding faster interfaces implies replacing much or all of the network. Adding faster interfaces only to the servers does not work Adding more interfaces to the servers increases the bandwidth. Using those interfaces requires a redesign of the LNet networks Without Multi-Rail each interface connects to a separate LNet network s must be distributed across these networks It should be simpler than this. 6

The Multi-Rail Project Add basic multi-rail capability Multiplexing across interfaces, as opposed to striping across them Multiple data streams are therefore needed to profit Extend peer discovery to simplify configuration Discover peer interfaces Discover peer multi-rail capability Configuration can be changed at runtime This includes adding or removing interfaces lnetctl is used for configuration Compatible with non-multi-rail nodes Add resiliency by using alternate paths in case of failures 7

Single Fabric With One LNet Network MGS MGT This is a small Lustre cluster with a single big client node. MDS MDT OSS UV OSS All nodes are connected to a single fabric (physical network). There is one LNet network connecting the nodes. The big UV node has a single connection to this network. It has the same network bandwidth available to it as the small clients. OSS 8

Single Fabric With Multiple LNet Networks MGS MGT Additional interfaces have been added to the UV to increase its bandwidth. MDS MDT Without Multi-Rail LNet we must configure multiple LNet networks. OSS Each OSS lives on a separate LNet network, within the single fabric. UV OSS Each interface on the UV connects to one of these LNet networks. OSS On the other client nodes, aliases are used to connect a single interface to multiple LNet networks. 9

Single Fabric With One Multi-Rail LNet Network MGS MDS OSS MGT MDT Multi-Rail LNet allows for the LNet network configuration to match the fabric. The fabric is the same as in the previous slide. The configuration is much simpler. UV OSS The network bandwidth to the UV node is increased to match its size. OSS 10

Dual Fabric With Dual Multi-Rail LNet Networks MGS MGT In this example there are two fabrics, each with an LNet network on top. MDS MDT The server nodes connect to both fabrics. OSS The UV client connects with multiple interfaces to both fabrics. UV OSS The other client nodes connect to only one fabric. OSS 11

Use Cases Improved performance Improved resiliency Better usage of large clients The Multi-Rail code is NUMA aware Fine grained control of traffic Simplify multi-network file system access 13

Two Types of Configuration Methods Multi-Rail can be configured statically with lnetctl. The following must be configured statically Local network interfaces The network interfaces by which a node sends messages Selection rules The rules which determine the local/remote network interface pair used to communicate between a node and a peer Default is weighted round-robin The following may be configured statically, but can be discovered dynamically Peer network interfaces The remote network interfaces of peer nodes to which a node sends messages Enable dynamic peer discovery to have LNet configure peers automatically 14

Static Configuration Basic Concepts On a node: Configure local network interfaces Example: tcp(eth0,eth1) <eth0 IP>@tcp, <eth1 IP>@tcp Configure remote network interfaces Specify the peer s Network Interface IDs (NIDs) <peerx primary NID>, <peerx NID2>, Configure selection rules 15

Dynamic Configuration Basic Concepts LNet can dynamically discover a peer s NIDs. On a node: Local network interfaces must be configured as before Selection rules must be configured as before Peers are discovered as messages are sent and received An LNet ping is used to get a list of the peer s NIDs A feature bit indicates whether the peer supports Multi-Rail The node pushes a list of its NIDs to Multi-Rail peers 16

Selection Rules Basic Concepts Selection rules work by adding priorities Higher priorities are preferred The rules specify patterns The following types of rules are supported Network Local NID Peer NID Local NID / peer NID pair These rules have two NID patterns, one for the local NID, one for the peer NID They are used to specify preferred paths 17

Example: Improved Performance o2ib0 MGS MDS MGT MDT Configuring the OSS net: - net type: o2ib local NI(s): - interfaces: 0: ib0 1: ib1 OSS 18

Example: Improved Resiliency IB: o2ib0 MGS MGT MDS MDT OPA: o2ib1 OSS 19

Example: Improved Resiliency Configuring the s and OSS: net: - net type: o2ib local NI(s): - interfaces: 0: ib0 - net type: o2ib1 local NI(s): - interfaces: 0: ib1 selection: - type: net net: o2ib priority: 0 # highest priority - type: net net: o2ib1 priority: 1 20

Example: Multi Network Filesystem Access o2ib0 MGS MGT MDS MDT tcp0 OSS 21

Example: Multi Network Filesystem Access IB s TCP s net: - net type: o2ib local NI(s): - interfaces: 0: ib0 peer: - primary nid: <mgs-ib-ip>@o2ib0 peer ni: - nid: <mgs-tcp-ip>@tcp0 selection: - type: nid - nid: *.*.*.*@o2ib* - priority: 0 # highest priority net: - net type: tcp0 local NI(s): - interfaces: 0: eth0 peer: - primary nid: <mgs-ib-ip>@o2ib0 peer ni: - nid: <mgs-tcp-ip>@tcp0 selection: - type: nid - nid: *.*.*.*@tcp* - priority: 0 # highest priority 22

Example: Fine Grained Traffic Control 1 2 MGS MDS OSS1 OSS2 MGT MDT This is a single fabric with a bottleneck. 1: 10.10.10.2@o2ib 2: 10.10.10.3@o2ib MGS-1: 10.10.10.4@o2ib MGS-2: 10.10.10.5@o2ib MDS-1: 10.10.10.6@o2ib MDS-2: 10.10.10.7@o2ib OSS1-1: 10.10.10.8@o2ib OSS1-2: 10.10.10.9@o2ib OSS2-1: 10.10.10.10@o2ib OSS2-2: 10.10.10.11@o2ib 23

Example: Fine Grained Traffic Control 1 MGS MGT This rule makes 1 avoid the red path: 2 MDS OSS1 MDT selection: - type: peer local: 10.10.10.2@o2ib remote: 10.10.10.[4-10/2]@o2ib priority: 0 # highest priority OSS2 1 will only use the red path if there is no other option. 24

Project Status Public project wiki page: http://wiki.lustre.org/multi-rail_lnet Code development is done on the multi-rail branch of the Lustre master repo. Patches to enable static configuration are under review Unit testing and system testing underway Patches for selection rules are under development Patches for dynamic peer discovery are under development Estimated project completion time: end of this year Master landing date: TBD 26

Legal Notices and Disclaimers Intel technologies features and benefits depend on system configuration and may require enabled hardware, software or service activation. Learn more at intel.com, or from the OEM or retailer. No computer system can be absolutely secure. Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources of information to evaluate performance as you consider your purchase. For more complete information about performance and benchmark results, visit http://www.intel.com/performance. Intel disclaims all express and implied warranties, including, without limitation, the implied warranties and merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing or usage in trade. Intel, the Intel logo and others are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. 2016 Intel Corporation. 27