LNET MULTI-RAIL RESILIENCY

Similar documents
Olaf Weber Senior Software Engineer SGI Storage Software. Amir Shehata Lustre Network Engineer Intel High Performance Data Division

Lustre Interface Bonding

Multi-Rail LNet for Lustre

Olaf Weber Senior Software Engineer SGI Storage Software

LNet Roadmap & Development. Amir Shehata Lustre * Network Engineer Intel High Performance Data Division

DL-SNAP and Fujitsu's Lustre Contributions

Reduction Network Discovery Design Document FOR EXTREME-SCALE COMPUTING RESEARCH AND DEVELOPMENT (FAST FORWARD) STORAGE AND I/O

Multi-tenancy: a real-life implementation

Lecture 5: Flow Control. CSE 123: Computer Networks Alex C. Snoeren

HPC NETWORKING IN THE REAL WORLD

NWEN 243. Networked Applications. Transport layer and application layer

CSE 473 Introduction to Computer Networks. Exam 2. Your name here: 11/7/2012

LUSTRE NETWORKING High-Performance Features and Flexible Support for a Wide Array of Networks White Paper November Abstract

The flow of data must not be allowed to overwhelm the receiver

ZEST Snapshot Service. A Highly Parallel Production File System by the PSC Advanced Systems Group Pittsburgh Supercomputing Center 1

STEVEN R. BAGLEY PACKETS

Chapter 8 Fault Tolerance

ECEN Final Exam Fall Instructor: Srinivas Shakkottai

New Storage Architectures

HLD For SMP node affinity

Today: Fault Tolerance. Reliable One-One Communication

Lecture 15: TCP over wireless networks. Mythili Vutukuru CS 653 Spring 2014 March 13, Thursday

TCP Strategies. Keepalive Timer. implementations do not have it as it is occasionally regarded as controversial. between source and destination

Integration Path for Intel Omni-Path Fabric attached Intel Enterprise Edition for Lustre (IEEL) LNET

Fault Tolerance. Distributed Systems IT332

Data Link Control Protocols

CRC. Implementation. Error control. Software schemes. Packet errors. Types of packet errors

CSE 473 Introduction to Computer Networks. Midterm Exam Review

bitcoin allnet exam review: transport layer TCP basics congestion control project 2 Computer Networks ICS 651

Efficient Object Storage Journaling in a Distributed Parallel File System

Preview Test: HW3. Test Information Description Due:Nov. 3

Lustre overview and roadmap to Exascale computing

UNIT IV -- TRANSPORT LAYER

Measuring TCP bandwidth on top of a Gigabit and Myrinet network

ECE 435 Network Engineering Lecture 15

Distributed Systems Exam 1 Review Paul Krzyzanowski. Rutgers University. Fall 2016

Module 15 Communication at Data Link and Transport Layer

User Datagram Protocol

NET ID. CS519, Prelim (March 17, 2004) NAME: You have 50 minutes to complete the test. 1/17

The Transport Layer Reliability

Lustre at Scale The LLNL Way

Lecture 7: Sliding Windows. CSE 123: Computer Networks Geoff Voelker (guest lecture)

Lecture 21. Reminders: Homework 6 due today, Programming Project 4 due on Thursday Questions? Current event: BGP router glitch on Nov.

An Overview of Fujitsu s Lustre Based File System

Mobile IP and Mobile Transport Protocols

CSE 123: Computer Networks Alex C. Snoeren. HW 1 due NOW!

ERROR AND FLOW CONTROL. Lecture: 10 Instructor Mazhar Hussain

EECS 122, Lecture 19. Reliable Delivery. An Example. Improving over Stop & Wait. Picture of Go-back-n/Sliding Window. Send Window Maintenance

Midterm Review. EECS 489 Computer Networks Z. Morley Mao Monday Feb 19, 2007

Data Link Layer, Part 5 Sliding Window Protocols. Preface

Internet II. CS10 : Beauty and Joy of Computing. cs10.berkeley.edu. !!Senior Lecturer SOE Dan Garcia!!! Garcia UCB!

CHAPTER 3 EFFECTIVE ADMISSION CONTROL MECHANISM IN WIRELESS MESH NETWORKS

PLEASE READ CAREFULLY BEFORE YOU START

MIDTERM EXAMINATION #2 OPERATING SYSTEM CONCEPTS U N I V E R S I T Y O F W I N D S O R S C H O O L O F C O M P U T E R S C I E N C E

CCNA 1 Chapter 7 v5.0 Exam Answers 2013

CSE/EE 461 Lecture 16 TCP Congestion Control. TCP Congestion Control

Fujitsu s Contribution to the Lustre Community

Lecture 7: Flow Control"

ICS 451: Today's plan

CS 5520/ECE 5590NA: Network Architecture I Spring Lecture 13: UDP and TCP

Andreas Dilger. Principal Lustre Engineer. High Performance Data Division

MOVING FORWARD WITH FABRIC INTERFACES

An Implementation of the Homa Transport Protocol in RAMCloud. Yilong Li, Behnam Montazeri, John Ousterhout

MODELS OF DISTRIBUTED SYSTEMS

High Level Design Client Health and Global Eviction FOR EXTREME-SCALE COMPUTING RESEARCH AND DEVELOPMENT (FAST FORWARD) STORAGE AND I/O MILESTONE: 4.

Investigating the Use of Synchronized Clocks in TCP Congestion Control

CS 162 Operating Systems and Systems Programming Professor: Anthony D. Joseph Spring Lecture 21: Network Protocols (and 2 Phase Commit)

Blue Waters I/O Performance

MODELS OF DISTRIBUTED SYSTEMS

Computer Network. Direct Link Networks Reliable Transmission. rev /2/2004 1

Routing Basics. Campus Network Design & Operations Workshop

Priority Traffic CSCD 433/533. Advanced Networks Spring Lecture 21 Congestion Control and Queuing Strategies

Flow Control in Relay Enhanced IEEE Networks

Fault Tolerance Part I. CS403/534 Distributed Systems Erkay Savas Sabanci University

WITH THE LUSTRE FILE SYSTEM PETA-SCALE I/O. An Oak Ridge National Laboratory/ Lustre Center of Excellence Paper February 2008.

Andreas Dilger, Intel High Performance Data Division Lustre User Group 2017

Computer Science 461 Midterm Exam March 14, :00-10:50am

A local area network that employs either a full mesh topology or partial mesh topology

Optimization of Lustre* performance using a mix of fabric cards

CprE Fault Tolerance. Dr. Yong Guan. Department of Electrical and Computer Engineering & Information Assurance Center Iowa State University

CSE 5306 Distributed Systems

A Performance Comparison of Multi-Hop Wireless Ad Hoc Network Routing Protocols. Broch et al Presented by Brian Card

UNIVERSITY OF TORONTO FACULTY OF APPLIED SCIENCE AND ENGINEERING

10.1 REVIEW QUESTIONS

Adaptive Routing. Claudio Brunelli Adaptive Routing Institute of Digital and Computer Systems / TKT-9636

Carnegie Mellon Computer Science Department Spring 2015 Midterm Exam

Recap. TCP connection setup/teardown Sliding window, flow control Retransmission timeouts Fairness, max-min fairness AIMD achieves max-min fairness

CSE 5306 Distributed Systems. Fault Tolerance

Upstreaming of Features from FEFS

PLEASE READ CAREFULLY BEFORE YOU START

PLEASE READ CAREFULLY BEFORE YOU START

Chapter 24 Congestion Control and Quality of Service 24.1

Overview. TCP & router queuing Computer Networking. TCP details. Workloads. TCP Performance. TCP Performance. Lecture 10 TCP & Routers

FUJITSU Software Interstage Information Integrator V11

TCP Review. Carey Williamson Department of Computer Science University of Calgary Winter 2018

CS603: Distributed Systems

TCP over Wireless PROF. MICHAEL TSAI 2016/6/3

Advanced Computer Networks. Flow Control

Mobile IP Overview. Based on IP so any media that can support IP can also support Mobile IP

ECE 435 Network Engineering Lecture 10

Transcription:

13th ANNUAL WORKSHOP 2017 LNET MULTI-RAIL RESILIENCY Amir Shehata, Lustre Network Engineer Intel Corp March 29th, 2017

OUTLINE Multi-Rail Recap Base Multi-Rail Dynamic Discovery Multi-Rail performance on SGI System LNet Resiliency LNet Messages Message failures Failure Handling Interface Health Tracking Reporting Summary 2

LUSTRE NETWORK MULTI-RAIL (MR) Prior to MR Only one Network Interface (NI) per LNet network. Multiple NIs == Multiple LNet networks. Resulted in a complex configuration. 3

WITHOUT MULTI-RAIL 4

DRIVER Add support for big Lustre Nodes SGI UV 300: 32 Socket NUMA system SGI UV 3000: 256 Socket NUMA system Large systems need lots of bandwidth. NUMA systems benefit when memory buffers and interfaces are close in the system s topology. Support different HW (ex OPA, MLX, ETH). Simplify configuration.

WITH MULTI-RAIL

BENEFITS OF MULTI-RAIL Multi-Rail implemented at the LNet layer allows: Multiple interfaces connected to one network. Multiple interfaces connected to different networks. Interfaces are used simultaneously. LNet level implementation allows use of interfaces from different vendors. For example: OPA and MLX are incompatible. OPA on o2ib0, MLX on o2ib1. MR can use both simultaneously to communicate with peers on the same networks.

BASIC CONFIGURATION Two MR configuration steps Configure the local networks and the interfaces on each network EX: o2ib0 with ib0, ib1 and ib2. Configure peers In this step each node is configured with all the Multi-Rail peers it needs to know about. Peers are configured by identifying on the node the peer s primary NID and all of its other NIDs.

EXAMPLE CONFIGURATION Peer B s configuration on Peer A peer: - primary nid: 192.168.122.30@tcp Multi-Rail: True peer ni: - nid: 192.168.122.30@tcp state: NA - nid: 192.168.122.31@tcp state: NA - nid: 192.168.122.32@tcp state: NA Peer A s configuration on Peer B peer: - primary nid: 192.168.122.10@tcp Multi-Rail: True peer ni: - nid: 192.168.122.10@tcp state: NA - nid: 192.168.122.11@tcp state: NA - nid: 192.168.122.12@tcp state: NA

DYNAMIC DISCOVERY Manual configuration can be error prone. Dynamic Discovery reduces configuration burden.

LNET MR SUMMARY MR allows multiple interfaces, increasing bandwidth. Good for servers that need more bandwidth to serve more clients. Ability to add more interfaces to the server. Useful for large clients with many interfaces, like the UV.

MULTI-RAIL PERFORMANCE THE SYSTEM SGU UV 300: 32 socket of Xeon Processors 16 TB of memory 8 Omni-Path network interfaces 8 C2112-GP2-EX Object Storage Systems (OSS) 4 P3700 NVMe devices LDISKFS Object Storage Target (OST) per OSS

MULTI-RAIL PERFORMANCE Theoretical maximum performance of the system: P3700 Sequential Write: 34560 MB/s Sequential Read: 86400 MB/s Multi-Rail performance: Sequential Write: 31990.18 MB/s Sequential Read: 68593.35 MB/s

But what about resiliency? LNET RESILIENCY Lustre error handling is expensive. It can involve evicting clients (depending on RPC lost). What can LNet do to address network failures? Other interfaces can be tried for the same peer, before giving up on a message.

LNET MESSAGES LNet has four main message types: PUT ACK GET REPLY ACK is an optional response to a PUT REPLY is the response to a GET

LNET MESSAGE TRANSACTIONS That gives us the following combinations to handle: PUT PUT + ACK GET + REPLY An RPC can be one or more combination of the above.

GOAL In order for LNet to be resilient it must try all local interfaces and all remote interfaces before it declares a message failed. But it must not go on resending messages indefinitely There has to be an upper time limit after which LNet declares a message failed. Resiliency logic needs to be implemented at the LNet level to be able to fail across different network types if needed: OPA, MLX, ETH.

STRATEGY An LNet network can have directly connected nodes. Or LNet routers can introduce extra hops between the source and the destination. LNet resiliency is concerned with ensuring that an LNet message is delivered to the next hop EX if the Source and the Destination are separated by multiple hops, each hop will be responsible for ensuring that a message is received and potentially acknowledged by the immediate nexthop.

PUT A PUT message is used to indicate that data is about to be RDMAed from the node to the peer. This is the simplest LNet message. Since there is no response to this message if it is dropped on the remote end, there is nothing that LNet can do. Upper layers which request PUT on its own need to implement their own timeouts.

PUT+ACK The sender of the PUT can explicitly request an ACK from the receiver. The ACK message is generated by the LNet layer once it receives the PUT. It does not wait for the upper layer to process the PUT. Since LNet on the sender node expects an ACK if it is not received it can deliver a timeout failure event to the upper layer.

GET+REPLY GET semantics means that the sender node is requesting data from the peer. A GET always expects a REPLY LNet does not wait fro upper layer to process the GET. Since LNet knows that a REPLY is expected if it is not received it can deliver a timeout event to the upper layer.

LNET RESILIENCY SUMMARY LNet Resiliency is concerned with ensuring that a message is delivered to the next-hop. In PUT+ACK and GET+REPLY cases LNet can maintain a timeout.

FAILURES TO HANDLE There are three classes of failure that LNet needs to handle: 1. Local interface failure: There is some issue with the local interface that prevents it from sending or receiving messages. 2. Remote interface failure: There is some issue with the remote interface that prevents it from sending or receiving messages. 3. Path Failure: The local interface is working properly but messages never reach the peer interface.

LOCAL INTERFACE FAILURE The LND reports failures it gets from the hardware to LNet. EX: IB_EVENT_DEVICE_FATAL, IB_EVENT_PORT_ERR LNet refrains from using that interface for any traffic. Retransmit on other available interfaces. It will keep attempting to retransmit the message until a configurable peer timeout is reached. On peer-timeout a failure is propagated to the higher levels.

REMOTE INTERFACE FAILURE There are cases when a remote interface can be considered failed: Address is wrong error No route to host Connection can not be established Connection was rejected due to incompatible parameters?? (Needs further investigation) In this case the local interface is working, but is unable to establish a connection with the remote interface. LND reports this error to the LNet layer. LNet attempts to resend the message to a different remote interface. If none are available, then fail message.

PATH FAILURE The LNDs currently implement a protocol which ensures that a message is received by the remote LND within a transmit timeout. If an LND message is not acknowledged the transmit timeout on that message expires. Where the timeout expires tells us where the failure resides, either due to a problem with the local interface or somewhere on the network.

PATH FAILURE BREAKDOWN LND Timeouts occur due to the following reasons. The message is on the sender s queue but is not posted within the transmit timeout The message is posted but the transmit never completes The message is posted, the transmit is completed, but the remote never acknowledges If it s a local or remote interface issue, then it s dealt with as previously indicated. Otherwise an entirely new pathway between the peers is attempted.

INTERFACE HEALTH TRACKING A relative health value is kept per local and remote interface. On soft errors, such as timeouts, the interface health value is decremented. That interface is selected at a lower priority for future messages. The health value recovers over time. The result is consistently unhealthy interfaces are preferred less.

REPORTING Currently lnetctl is a utility used to configure the different LNet parameters. lnetctl will extract LNet Resiliency information and show it on demand. This enhances the ability to debug a cluster when needed. This information can be displayed on a GUI. EX: Intel Manager for Lustre (IML).

SUMMARY Multi-Rail increases a node s bandwidth. LNet Resiliency builds on top of the Multi-Rail infrastructure to allow resending of LNet messages over different interfaces on failures. The LNet level implementation allows messages to be sent over different networks and HW.

13th ANNUAL WORKSHOP 2017 THANK YOU Amir Shehata, Lustre Network Engineer Intel Corp