Olaf Weber Senior Software Engineer SGI Storage Software

Similar documents
Olaf Weber Senior Software Engineer SGI Storage Software. Amir Shehata Lustre Network Engineer Intel High Performance Data Division

Multi-Rail LNet for Lustre

Lustre Interface Bonding

LNET MULTI-RAIL RESILIENCY

New Storage Architectures

Active-Active LNET Bonding Using Multiple LNETs and Infiniband partitions

Multi-tenancy: a real-life implementation

LUSTRE NETWORKING High-Performance Features and Flexible Support for a Wide Array of Networks White Paper November Abstract

Integration Path for Intel Omni-Path Fabric attached Intel Enterprise Edition for Lustre (IEEL) LNET

HLD For SMP node affinity

Intel Omni-Path Fabric Manager GUI Software

Lustre Networking at Cray. Chris Horn

Andreas Dilger. Principal Lustre Engineer. High Performance Data Division

Fujitsu s Contribution to the Lustre Community

LNet Roadmap & Development. Amir Shehata Lustre * Network Engineer Intel High Performance Data Division

An Overview of Fujitsu s Lustre Based File System

High-Performance Lustre with Maximum Data Assurance

Fujitsu's Lustre Contributions - Policy and Roadmap-

Blue Waters I/O Performance

DL-SNAP and Fujitsu's Lustre Contributions

Xyratex ClusterStor6000 & OneStor

SDSC s Data Oasis Gen II: ZFS, 40GbE, and Replication

6.5 Collective Open/Close & Epoch Distribution Demonstration

Lustre * Features In Development Fan Yong High Performance Data Division, Intel CLUG

LustreFS and its ongoing Evolution for High Performance Computing and Data Analysis Solutions

Andreas Dilger, Intel High Performance Data Division LAD 2017

Intel Omni-Path Fabric Switches

ZEST Snapshot Service. A Highly Parallel Production File System by the PSC Advanced Systems Group Pittsburgh Supercomputing Center 1

Load Balancing with McAfee Network Security Platform

Small File I/O Performance in Lustre. Mikhail Pershin, Joe Gmitter Intel HPDD April 2018

Andreas Dilger, Intel High Performance Data Division Lustre User Group 2017

Community Release Update

Short Talk: System abstractions to facilitate data movement in supercomputers with deep memory and interconnect hierarchy

Community Release Update

Intel Omni-Path Fabric Manager GUI Software

Store Process Analyze Collaborate Archive Cloud The HPC Storage Leader Invent Discover Compete

Upstreaming of Features from FEFS

Analytics of Wide-Area Lustre Throughput Using LNet Routers

Intel PRO/1000 PT and PF Quad Port Bypass Server Adapters for In-line Server Appliances

designed. engineered. results. Parallel DMF

McAfee Web Gateway

MSC Nastran Explicit Nonlinear (SOL 700) on Advanced SGI Architectures

Implementing Storage in Intel Omni-Path Architecture Fabrics

Routing Overview for Firepower Threat Defense

5.4 - DAOS Demonstration and Benchmark Report

Parallel File Systems for HPC

Summit Virtual Chassis Design and Installation Guide

Intel & Lustre: LUG Micah Bhakti

Dell EMC. VxBlock Systems for VMware NSX 6.2 Architecture Overview

Optimization of Lustre* performance using a mix of fabric cards

Multicast Communications. Slide Set were original prepared by Dr. Tatsuya Susa

Lustre* is designed to achieve the maximum performance and scalability for POSIX applications that need outstanding streamed I/O.

Technical Computing Suite supporting the hybrid system

Architecting a High Performance Storage System

Dell EMC. VxBlock Systems for VMware NSX 6.3 Architecture Overview

23 Must-Have WiFi Features

BIG-IP Link Controller : Implementations. Version 12.1

OVERVIEW OF DIFFERENT APPLICATION SERVER MODELS

Experiences Running and Optimizing the Berkeley Data Analytics Stack on Cray Platforms

Collaborating Between Autodesk Visual Effects and Finishing Applications & Autodesk Lustre

Advanced Software for the Supercomputer PRIMEHPC FX10. Copyright 2011 FUJITSU LIMITED

Feedback on BeeGFS. A Parallel File System for High Performance Computing

Andreas Dilger, Intel High Performance Data Division SC 2017

S3 Series. Knowledge Base Article

Avoka Transact Reference Architectures. Version 4.0

Intel Omni-Path Architecture

Applying the Benefits of Network on a Chip Architecture to FPGA System Design

Lustre at Scale The LLNL Way

Demonstration Milestone Completion for the LFSCK 2 Subproject 3.2 on the Lustre* File System FSCK Project of the SFS-DEV-001 contract.

libhio: Optimizing IO on Cray XC Systems With DataWarp

Network Configuration Example

Introduction to Lustre* Architecture

2012 HPC Advisory Council

Design and Evaluation of a 2048 Core Cluster System

PROACTIVE IDENTIFICATION AND REMEDIATION OF HPC NETWORK SUBSYSTEM FAILURES

Simplified Multi-Tenancy for Data Driven Personalized Health Research

Dell EMC Ready Bundle for HPC Digital Manufacturing Dassault Systѐmes Simulia Abaqus Performance

Welcome! Virtual tutorial starts at 15:00 BST

Designing High Performance Communication Middleware with Emerging Multi-core Architectures

HPC Architectures. Types of resource currently in use

OceanStor 9000 InfiniBand Technical White Paper. Issue V1.01 Date HUAWEI TECHNOLOGIES CO., LTD.

Intel Enterprise Edition Lustre (IEEL-2.3) [DNE-1 enabled] on Dell MD Storage

Performance of Mellanox ConnectX Adapter on Multi-core Architectures Using InfiniBand. Abstract

Chapter 7: Routing Dynamically. Routing & Switching

Intel Omni-Path Fabric Manager GUI Software

Architecting Storage for Semiconductor Design: Manufacturing Preparation

Auranet EAP Solution 2

Table of Contents. LISA User Guide

CRFS: A Lightweight User-Level Filesystem for Generic Checkpoint/Restart

Lustre * 2.12 and Beyond Andreas Dilger, Intel High Performance Data Division LUG 2018

Routing Overview. Information About Routing CHAPTER

OFA Developer Workshop 2013

Chapter 4. Routers with Tiny Buffers: Experiments. 4.1 Testbed experiments Setup

Lustre on ZFS. Andreas Dilger Software Architect High Performance Data Division September, Lustre Admin & Developer Workshop, Paris, 2012

Introduction The Project Lustre Architecture Performance Conclusion References. Lustre. Paul Bienkowski

Distributed Block Storage using NVMe-over- Fabric

ASPERA HIGH-SPEED TRANSFER. Moving the world s data at maximum speed

Fibre Channel Gateway Overview

Lustre2.5 Performance Evaluation: Performance Improvements with Large I/O Patches, Metadata Improvements, and Metadata Scaling with DNE

IBM Data Center Networking in Support of Dynamic Infrastructure

Transcription:

Olaf Weber Senior Software Engineer SGI Storage Software Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. *Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without any notice. Copyright 2016, Intel Corporation.

Multi-Rail LNet Multi-Rail is a long-standing wish list item known under a variety of names: Multi-Rail Interface Bonding Channel Bonding The various names do imply some technical differences This implementation is a collaboration between SGI and Intel First discussed at the LAD 15 Lustre Developer Summit. Presented at LUG 16. Our goal is to land multi-rail support in Lustre* 2.10 Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. *Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without any notice. Copyright 2016, Intel Corporation. 3

What Is Multi-Rail? Multi-Rail allows nodes to communicate across multiple interfaces: Using multiple interfaces connected to one network Using multiple interfaces connected to several networks These interfaces are used simultaneously 4

Big s We want to support big Lustre* nodes SGI UV 300: 32-socket NUMA system SGI UV 3000: 256-socket NUMA system These systems contain hardware equivalent to an entire cluster of smaller systems. For example, 16 dual-socket nodes have 16 connections to the cluster s network. A single 32-socket system should have more than one. * Other names and brands may be claimed as the property of others. 5

Server Bandwidth In big clusters, bandwidth to the server nodes becomes a bottleneck Adding faster interfaces implies replacing much or all of the network Adding faster interfaces only to the servers does not work Adding more interfaces to the servers increases the bandwidth Using those interfaces requires a redesign of the LNet networks Without Multi-Rail each interface connects to a separate LNet network s must be distributed across these networks It should be simpler than this 6

The Multi-Rail Project Add basic multi-rail capability Multiplexing across interfaces, as opposed to striping across them Multiple data streams are therefore needed to profit Extend peer discovery to simplify configuration Discover peer interfaces Discover peer multi-rail capability Configuration can be changed at runtime This includes adding or removing interfaces lnetctl is used for configuration Compatible with non-multi-rail nodes Add resiliency by using alternate paths in case of failures 7

Example Cluster MGS MDS MGT MDT A big UV node has just been added to a cluster. It has the same bandwidth available to it as the smaller client nodes. UV 9

Multi-Rail Big MGS MGT The UV node has been given multiple interfaces. UV MDS MDT The UV node is the only node that must run a Multi-Rail aware version of Lustre*. * Other names and brands may be claimed as the property of others. 10

Configuring the Big /etc/sysconfig/lnet.conf Interface list ip2nets: - net-spec: o2ib0 interfaces: 0: ib0[0] 1: ib1[4] 2: ib2[8] 3: ib3[12] Extra configuration on the big node: Big nodes benefit from setting up CPTs: Use cpu_patterns=n to map the structure of the machine to CPTs. Bind the interfaces to their CPTs. The example gives 4 interfaces evenly spaced on a 16-node system. We re looking at ways to simplify this by improving the default settings. The special configuration is limited to the big node. 11

Multi-Rail Big : LNet traffic flow UV MGS MDS MGT MDT A Multi-Rail node selects a single interface to talk to a non Multi-Rail peer node. Different interfaces are used to talk to different non Multi-Rail peers. In a big node, a process may have to use a remote interface. This effectively adds another network hop. 12

Multi-Rail Nodes MGS MDS MGT MDT The nodes run Multi-Rail aware Lustre*. Extra interfaces have been added to give them additional bandwidth. UV Bear in mind that the s of an may not be able to fill that bandwidth. * Other names and brands may be claimed as the property of others. 13

Configuring the nodes /etc/sysconfig/lnet.conf Extra configuration on the nodes: Interface list ip2nets: - net-spec: o2ib0 interfaces: 0: ib0 1: ib1 Pattern rule ip2nets: - net-spec: o2ib0 ip-range: 0: 10.0.0.* An node likely does not need CPTs defined As with the big client, we can list all interfaces. A pattern rule is simpler: Adds all interfaces on the 10.0.0.* subnet. Peer discovery takes care of finding which NIDs belong to a peer node. 14

Multi-Rail Nodes: LNet traffic flow MGS MDS MGT MDT When both nodes run Multi-Rail aware Lustre*, all interfaces on both nodes can be used, provided that there is a route between them. UV The big client node is free to choose the interface pair: any of the red paths can be used. In this example, the closest interface was used. * Other names and brands may be claimed as the property of others. 15

Multi-Rail Cluster MGS MDS MGT MDT All nodes run Multi-Rail aware Lustre*. The server nodes have also been given multiple interfaces. UV Configuration of the other nodes is similar to how the nodes are configured. * Other names and brands may be claimed as the property of others. 16

Multi-Rail Cluster: Multiple LNets MGS MGT When there are multiple networks, all interfaces on all networks are used. UV MDS MDT This is a case where specifying the NIDs of the MGS/MDS/ can be useful. The MGT/MDT/ was built with the IP address on the green network. The red network is added. A client that only connects to the red network needs these peers configured. A client that connects to the green network can use peer discovery. 17

Configuring Peer Nodes /etc/sysconfig/lnet.conf Peer configuration peer: - primary nid: 10.0.0.1@o2ib0 peer ni: - nid: 10.0.1.1@ob2ib1 - primary nid: 10.0.0.2@ob2ib0 peer ni: - nid: 10.0.1.2@ob2ib1 [ ] - primary nid: 10.0.0.100@o2ib0 peer ni: - nid: 10.0.0.101@o2ib0 - nid: 10.0.1.100@o2ib1 - nid: 10.0.1.101@o2ib1 Explicit peer configuration example: The colors refer to the networks in the previous slide. The Primary NID identifies a peer. LNet maps the NIDs to the Primary NID. The Primary NID shows up in log files. LNet logs both Primary and actual NID. If peer discovery is also enabled, it will complain if it find a discrepancy. But the explicit configuration is used. 18

SGI s Multi-Rail Test Cluster UV MGT/MDT MGS MDS FDR SGI s test cluster is built around a 16 socket/160 core SGI UV 2000. 12 with 52 built from a gaggle of different hardware. The fabric is FDR infiniband. Using older hardware enables us to dedicate a system of this size to development. 20

Increasing Performance by Adding Interfaces 20 18 16 14 12 10 8 6 4 2 IOR throughput Older hardware does not lend itself to the latest in performance. The plot gives I/O throughput in GiB/s against number of IB interfaces. Performance increases almost linear with number of interfaces. 0 1 2 3 4 ior read ior write 21

Multi-Rail for Resiliency MGS MGT Nodes are connected with multiple interfaces. UV MDS MDT If an interface / connection fails, the other interfaces are used to carry the load. On failure, a message will be resent from and to a different interface, if one is available. 23

Connecting Multiple LNets for Resiliency MGS MGT When there are multiple networks, they can be connected with routers. MDS MDT The resiliency carries over to this configuration as well. UV Router The router ensures that a singleconnected client can still talk to the servers if a server interface goes down. 24

Prioritizing LNet Networks MGS MGT MDS MDT An active/standby network setup. A slow standby network. Or smaller capacity. UV Or load-balanced across networks. The nodes that can use the active network will do so. Nodes that can only use the standby network will fall back to using it. 25

One LNet with Bottleneck MGS MGT Instead of a router a link connects two fabrics into one LNet network. MDS MDT A link like this is a bottleneck. UV Advanced routing makes it possible to avoid the bottleneck during normal operation. The link will be used when a server loses one of its connections to the network. 26

Status of the Multi-Rail Project Code development is done on the multi-rail branch of the master repository. Adding basic multi-rail capability Committed to multi-rail branch Extend peer discovery to simplify configuration In review on multi-rail branch Added resiliency Under development Advanced routing rules Under development Estimated project completion time: end of this year. 28

Project Wiki Page The public wiki page for the project is htttp://wiki.lustre.org/multi-rail_lnet Here you can find: Design documents Previous presentations Other documentation * Other names and brands may be claimed as the property of others. 29