Active-Active LNET Bonding Using Multiple LNETs and Infiniband partitions

Similar documents
Enhancing Lustre Performance and Usability

Lustre2.5 Performance Evaluation: Performance Improvements with Large I/O Patches, Metadata Improvements, and Metadata Scaling with DNE

2012 HPC Advisory Council

DL-SNAP and Fujitsu's Lustre Contributions

Sun Lustre Storage System Simplifying and Accelerating Lustre Deployments

Olaf Weber Senior Software Engineer SGI Storage Software

Fujitsu s Contribution to the Lustre Community

DDN s Vision for the Future of Lustre LUG2015 Robert Triendl

Multi-Rail LNet for Lustre

Fujitsu's Lustre Contributions - Policy and Roadmap-

ZEST Snapshot Service. A Highly Parallel Production File System by the PSC Advanced Systems Group Pittsburgh Supercomputing Center 1

Memory Scalability Evaluation of the Next-Generation Intel Bensley Platform with InfiniBand

DDN. DDN Updates. Data DirectNeworks Japan, Inc Shuichi Ihara. DDN Storage 2017 DDN Storage

Upstreaming of Features from FEFS

LustreFS and its ongoing Evolution for High Performance Computing and Data Analysis Solutions

An Overview of Fujitsu s Lustre Based File System

New Storage Architectures

Implementing Storage in Intel Omni-Path Architecture Fabrics

NetApp High-Performance Storage Solution for Lustre

HPC Storage Use Cases & Future Trends

Olaf Weber Senior Software Engineer SGI Storage Software. Amir Shehata Lustre Network Engineer Intel High Performance Data Division

LUSTRE NETWORKING High-Performance Features and Flexible Support for a Wide Array of Networks White Paper November Abstract

DDN. DDN Updates. DataDirect Neworks Japan, Inc Nobu Hashizume. DDN Storage 2018 DDN Storage 1

OpenFabrics Alliance Interoperability Logo Group (OFILG) May 2014 Logo Event Report

Integration Path for Intel Omni-Path Fabric attached Intel Enterprise Edition for Lustre (IEEL) LNET

CSCS HPC storage. Hussein N. Harake

Application Acceleration Beyond Flash Storage

Lustre Interface Bonding

Architecting Storage for Semiconductor Design: Manufacturing Preparation

SFA12KX and Lustre Update

High Capacity network storage solutions

SNAP Performance Benchmark and Profiling. April 2014

1. ALMA Pipeline Cluster specification. 2. Compute processing node specification: $26K

Experiences with HP SFS / Lustre in HPC Production

NFS/RDMA over 40Gbps iwarp Wael Noureddine Chelsio Communications

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA

SDSC s Data Oasis Gen II: ZFS, 40GbE, and Replication

OpenFabrics Alliance Interoperability Logo Group (OFILG) January 2016 Logo Event Report

High-Performance Lustre with Maximum Data Assurance

DELL Terascala HPC Storage Solution (DT-HSS2)

InfiniBand and Mellanox UFM Fundamentals

Multi-tenancy: a real-life implementation

Designing High Performance Communication Middleware with Emerging Multi-core Architectures

Design and Evaluation of a 2048 Core Cluster System

Routing Verification Tools

Lustre / ZFS at Indiana University

STAR-CCM+ Performance Benchmark and Profiling. July 2014

Scaling to Petaflop. Ola Torudbakken Distinguished Engineer. Sun Microsystems, Inc

Using DDN IME for Harmonie

Architecting a High Performance Storage System

Parallel File Systems for HPC

The Road to ExaScale. Advances in High-Performance Interconnect Infrastructure. September 2011

InfiniBand OFED Driver for. VMware Virtual Infrastructure (VI) 3.5. Installation Guide

AcuSolve Performance Benchmark and Profiling. October 2011

ABySS Performance Benchmark and Profiling. May 2010

C H A P T E R InfiniBand Commands Cisco SFS 7000 Series Product Family Command Reference Guide OL

ehca Virtualization on System p

QuickSpecs. HP InfiniBand Options for HP BladeSystems c-class. Overview

Analytics of Wide-Area Lustre Throughput Using LNet Routers

InfiniBand OFED Driver for. VMware Infrastructure 3. Installation Guide

High Availability Client Login Profiles

SwitchX Virtual Protocol Interconnect (VPI) Switch Architecture

T10PI End-to-End Data Integrity Protection for Lustre

HLD For SMP node affinity

UNH-IOL 121 Technology Drive, Suite 2 Durham, NH OpenFabrics Interoperability Logo Group (OFILG)

Configuring a Single Oracle ZFS Storage Appliance into an InfiniBand Fabric with Multiple Oracle Exadata Machines

Dell TM Terascala HPC Storage Solution

John Fragalla TACC 'RANGER' INFINIBAND ARCHITECTURE WITH SUN TECHNOLOGY. Presenter s Name Title and Division Sun Microsystems

Technical Computing Suite supporting the hybrid system

Introduction to High-Speed InfiniBand Interconnect

OpenFabrics Alliance Interoperability Logo Group (OFILG) May 2014 Logo Event Report

OpenFabrics Alliance Interoperability Logo Group (OFILG) January 2014 Logo Event Report

Adapted from: TRENDS AND ATTRIBUTES OF HORIZONTAL AND VERTICAL COMPUTING ARCHITECTURES

InfiniBand Networked Flash Storage

Birds of a Feather Presentation

Introduction to Infiniband

Enclosed are the results from OFA Logo testing performed on the following devices under test (DUTs):

Lustre Networking at Cray. Chris Horn

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

OpenFabrics Alliance

Future Routing Schemes in Petascale clusters

NEMO Performance Benchmark and Profiling. May 2011

ANSYS Fluent 14 Performance Benchmark and Profiling. October 2012

Store Process Analyze Collaborate Archive Cloud The HPC Storage Leader Invent Discover Compete

DELL EMC ISILON F800 AND H600 I/O PERFORMANCE

Feedback on BeeGFS. A Parallel File System for High Performance Computing

CPMD Performance Benchmark and Profiling. February 2014

Solaris Engineered Systems

Lustre on ZFS. At The University of Wisconsin Space Science and Engineering Center. Scott Nolin September 17, 2013

Lessons learned from Lustre file system operation

Lustre Metadata Fundamental Benchmark and Performance

MPI Optimizations via MXM and FCA for Maximum Performance on LS-DYNA

To Infiniband or Not Infiniband, One Site s s Perspective. Steve Woods MCNC

The following InfiniBand products based on Mellanox technology are available for the HP BladeSystem c-class from HP:

Datura The new HPC-Plant at Albert Einstein Institute

SUN CUSTOMER READY HPC CLUSTER: REFERENCE CONFIGURATIONS WITH SUN FIRE X4100, X4200, AND X4600 SERVERS Jeff Lu, Systems Group Sun BluePrints OnLine

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters

Voltaire Making Applications Run Faster

A ClusterStor update. Torben Kling Petersen, PhD. Principal Architect, HPC

Dell EMC Ready Bundle for HPC Digital Manufacturing Dassault Systѐmes Simulia Abaqus Performance

Transcription:

April 15th - 19th, 2013 LUG13 LUG13 Active-Active LNET Bonding Using Multiple LNETs and Infiniband partitions Shuichi Ihara DataDirect Networks, Japan

Today s H/W Trends for Lustre Powerful server platforms emerging 16 CPU cores and growing, high memory bandwidth, PCI gen3, etc. The number of OSS is important in order to obtain high throughput performance, but power and management cost are also critical. Fast disks and next generation devices are coming New 2.5 inch form-factor with 12 Gbps mixed SAS/PCI connector. Prices for high speed devices are decreasing rapidly. Infiniband is providing high bandwidth to storage. High bandwidth network is available Infiniband is now the most common interconnect Lustre Networking Efficient bit encoding rates Conclusion: The performance of all components is increasing drastically. To achieve optimal H/W and S/W performance is challenging! 2

LND/LNET Bonding Lustre LNET performance 6GB/sec on Single Infiniband FDR link More bandwidth would help if the storage system is very powerful Configurations with less Lustre servers become possible. Channel bonding LND active/active channel bonding is not supported in mainstream Lustre today. Infiniband multi-rail configuration is supported. Lustre supports Active/Standby bonding with Infiniband. 3

Active/Standby IB LND configuration Infiniband Fabric Clients LNET1 Active Fail over HCA0 OSS Standby HCA1 Failover Capability, but no Performance advantage! OST OST OST OST 4

Infiniband Multi-Rail Configuration Infiniband Fabric Clients (Group1) LNET1 Clients (Group2) LNET2 2x aggregated Active bandwidth Active HCA0 HCA1 OSS Double bandwidth, but no failover! OST OST OST OST 5

Novel Approach: Active/Active LND/LNET Configuration 2 x Active/Standby = Active/Active IB partitions create virtual (child) Interface on a HCA Multiple LNETs with o2iblnd are created on an IB fabric o2iblnd LND layer provides Lustre failover capability for Infiniband What s advantages? No Lustre modification necessary simply enabling IB partitions on SM (Subnet Manager), bonding, and LNET configuration. Basically, no additional hardware on clients and server (More hardware increases performance!) NUMA aware optimized OST access. Auto failback, manual active network link control is possible. 6

Infiniband Partition Partitioning enforces isolation among systems sharing an Infiniband fabric The concept is similar to VLAN (802.1Q) Enforced on Host and Switch Partitions are represented by P-key Subnet Manager creates P-KEY tables for HCAs and switches in the network Two membership configuration are available: o Full access o Limited access. /etc/opensm/partitions.conf Default=0x7fff,ipoib :ALL=full; LNET0=0x8001,ipoib :ALL=full; LNET1=0x8002,ipoib :ALL=full; IPoIB uses P-keys for creating child interfaces associated with the P-key 7

Active/Active LNET Configuration l Same Physical H/W Configuration l Two P_KEY are created for IPoIB child interface on OSSs and clients. l Two bond interfaces are enabled with IPoIB child interfaces. mkfs.lustre ost.. --network=o2ib Infiniband Fabric e.g) bond0 is active on HCA0 bond1 is active on HCA1 LNET0 l Two LNETs with o2iblnd are created using bond interfaces bond0 oss: options lnet networks=o2ib0(bond0), \ o2ib1(bond1) client: options lnet networks=o2ib0(ib0.8001) \ ib0.8001 ib0.8002 o2ib1(ib0.8002) HCA0 l Restricted OST access by LNET Lustre client HCA0 ib0.8001 ib0.8002 OSS LNET1 bond1 ib2.8001 ib2.8002 HCA1 OST OST OST OST 8

LNET Round-robin for OST access LNET0 (network=o2ib0) Clients HCA0 o2ib0(ib0.8001) o2ib1(ib0.8002) Even/odd OSTs are only accessible through either o2ib0 (LNET0) or o2ib1(lnet1), using restricted network option. LNET1 (network=o2ib1) o2ib0(bond0) Lustre Threads Lustre Threads o2ib1(bond1) CPU#0 QPI CPU#1 HCA0 HCA1 Memory Memory OSS Storage 9 OST0, OST2,..., OSTn OST1, OST3,..., OSTn+1

Benchmark and Failover Testing Test Configuration Storage 1 x SFA12K-40 160 x 15Krpms SAS disk Server 1 x MDS, 1 x OSS 2 x 2.6GHz E5-2670, 64GB Memory 2 x FDR IB Dual port HCA (2 ports for LNET and 2 ports for Storage) Client 16 x Client 1 x 2.0GHz, E5-2650, 16GB Memory 1 x QDR IB Single port HCA Software CentOS6.3 Lustre-2.3.63 Mellanox OFED-1.5.3 16 x Client... 16 x QDR IB FDR IB Switch 2 x FDR IB each server 1 x MDS 1 x OSS 2 x FDR IB SFA12K-40!"#$%&'!(%)'$ **+,!"#$%&'!(%)'$ **+,!"#$%&'!(%)'$ **+,!"#$%&'!(%)'$ **+,!"#$%&'!(%)'$ **+, 10

Active/Active LNET Bonding Performance Single OSS's Throughput (Write) Nobonding AA-bonding AA-bonding(CPU bind) Single OSS's Throughput (Read) Nobonding AA-bonding AA-bonding(CPU bind) 14000 14000 12000 Two FDR Active-Active LNET bandwidth 12.0 (GB/sec) 12000 10000 10000 Throughput(MB/sec) 8000 6000 Single FDR s LNET bandwidth 6.0 (GB/sec) 8000 6000 4000 4000 2000 2000 0 1 2 4 8 16 Number of client 0 1 2 4 8 16 Number of client 11

Performance Evaluation on Enhanced Hardware Configuration network=o2ib0 o2ib0(bond0) HCA0 Clients HCA0 o2ib0(ib0.8001),o2ib1(ib0.8002) OSS Additional FDR chip (HCA) for Storage. network=o2ib1 o2ib1(bond1) HCA1 CPU#0 QPI CPU#1 HCA2 Memory Memory HCA3 Storage 12 OST0, OST2,..., OSTn OST1, OST3,..., OSTn+1

Active/Active LNET Bonding Performance (4 x HCA Configuration) Single OSS's Throughput (Write) Single OSS's Throughput (Read) 2HCA 4HCA 2HCA 4HCA 14000 12000 Two FDR Active-Active LNET bandwidth 12.0 (GB/sec) 14000 94% of LNET s maximum bandwidth 12000 Throughput (MB/sec) 10000 8000 6000 1.8x 11.3GB/sec 10000 8000 6000 1.8x 11GB/sec 4000 4000 2000 2000 0 Nobonding AA-bonding AA-bonding(CPU bind) 0 Nobonding AA-bonding AA-bonding(CPU bind) 13

Failover testing and behavior(1) - Failover Test during IOR from 16 clients to OSS - 14000 12000 10000 Throughput (MB/sec) 8000 6000 4000 2000 Complete IB Failback and performance is back 0 Remove a FDR cable Complete IB Failover Insert FDR cable again Time 14

Failover Testing and Behavior(2) Administrator controlled active IB links AA LNET bonding can be configured on clients as well as server ifenslave command helps to switch active Interface e.g) # ifenslave bond0 -c ib2.8001 # cat /proc/net/bonding/bond0 Ethernet Channel Bonding Driver: v3.6.0 (September 26, 2009) Bonding Mode: fault-tolerance (active-backup) (fail_over_mac active) Primary Slave: ib0.8001 (primary_reselect always) Currently Active Slave: ib2.8001 MII Status: up MII Polling Interval (ms): 50 Up Delay (ms): 5000 Down Delay (ms): 0... Lustre OSTs try to reconnect Once active slave Interface changed. 15

Conclusion Demonstrated LNET active/active configuration using IB partitions and it performed well. 2 x Active/Standby bonding configuration works for LNET. Achieved more than 94% of 2 x FDR LNET bandwidth from a single OSS. Max Performance: 11.3GB/sec (WRITE), 11.0GB/sec (READ) Failover and failback works well after IB and bonding driver detect link failure/up status. User controlled failover and client side Active/Active configurations are possible. Application job aware network control might be possible using this approach. 16

LUG13 DataDirect Networks, Information in Motion, Silicon Storage Appliance, S2A, Storage Fusion Architecture, SFA, Storage Fusion Fabric, SFX, Web Object Scaler, WOS, EXAScaler, GRIDScaler, xstreamscaler, NAS Scaler, ReAct, ObjectAssure, In-Storage Processing are all trademarks of DataDirect Networks. Any unauthorized use is prohibited. 17 2/7/12