Introduction to High-Speed InfiniBand Interconnect

Similar documents
Introduction to Infiniband

OceanStor 9000 InfiniBand Technical White Paper. Issue V1.01 Date HUAWEI TECHNOLOGIES CO., LTD.

Mellanox Infiniband Foundations

Workshop on High Performance Computing (HPC) Architecture and Applications in the ICTP October High Speed Network for HPC

Performance monitoring in InfiniBand networks

Informatix Solutions INFINIBAND OVERVIEW. - Informatix Solutions, Page 1 Version 1.0

InfiniBand Linux Operating System Software Access Layer

PERFORMANCE ACCELERATED Mellanox InfiniBand Adapters Provide Advanced Levels of Data Center IT Performance, Productivity and Efficiency

InfiniBand Networked Flash Storage

NFS/RDMA over 40Gbps iwarp Wael Noureddine Chelsio Communications

VPI / InfiniBand. Performance Accelerated Mellanox InfiniBand Adapters Provide Advanced Data Center Performance, Efficiency and Scalability

VPI / InfiniBand. Performance Accelerated Mellanox InfiniBand Adapters Provide Advanced Data Center Performance, Efficiency and Scalability

Voltaire. Fast I/O for XEN using RDMA Technologies. The Grid Interconnect Company. April 2005 Yaron Haviv, Voltaire, CTO

The Road to ExaScale. Advances in High-Performance Interconnect Infrastructure. September 2011

QuickSpecs. HP InfiniBand Options for HP BladeSystems c-class. Overview

Comparing Server I/O Consolidation Solutions: iscsi, InfiniBand and FCoE. Gilles Chekroun Errol Roberts

Scaling to Petaflop. Ola Torudbakken Distinguished Engineer. Sun Microsystems, Inc

InfiniBand and Mellanox UFM Fundamentals

Messaging Overview. Introduction. Gen-Z Messaging

Multifunction Networking Adapters

ABySS Performance Benchmark and Profiling. May 2010

Flex System IB port FDR InfiniBand Adapter Lenovo Press Product Guide

Performance Accelerated Mellanox InfiniBand Adapters Provide Advanced Data Center Performance, Efficiency and Scalability

Fabric Consolidation with InfiniBand. Dror Goldenberg, Mellanox Technologies

CERN openlab Summer 2006: Networking Overview

Welcome to the IBTA Fall Webinar Series

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters

Aspects of the InfiniBand Architecture 10/11/2001

OFED Storage Protocols

Infiniband and RDMA Technology. Doug Ledford

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구

QLogic InfiniBand Cluster Planning Guide

2008 International ANSYS Conference

Module 2 Storage Network Architecture

Cisco SFS 7000D InfiniBand Server Switch

Low latency, high bandwidth communication. Infiniband and RDMA programming. Bandwidth vs latency. Knut Omang Ifi/Oracle 2 Nov, 2015

Containing RDMA and High Performance Computing

Advancing RDMA. A proposal for RDMA on Enhanced Ethernet. Paul Grun SystemFabricWorks

HP Cluster Interconnects: The Next 5 Years

Future Routing Schemes in Petascale clusters

Infiniband Fast Interconnect

OpenFabrics Interface WG A brief introduction. Paul Grun co chair OFI WG Cray, Inc.

Fabric Consolidation with InfiniBand. Dror Goldenberg, Mellanox Technologies

MM5 Modeling System Performance Research and Profiling. March 2009

The Exascale Architecture

Birds of a Feather Presentation

Extending InfiniBand Globally

Cisco - Enabling High Performance Grids and Utility Computing

2-Port 40 Gb InfiniBand Expansion Card (CFFh) for IBM BladeCenter IBM BladeCenter at-a-glance guide

The desire for higher interconnect speeds between

Creating an agile infrastructure with Virtualized I/O

Joe Pelissier InfiniBand * Architect

Advanced Computer Networks. End Host Optimization

Industry Standards for the Exponential Growth of Data Center Bandwidth and Management. Craig W. Carlson

The Mellanox ConnectX-2 Dual Port QSFP QDR IB network adapter for IBM System x delivers industryleading performance and low-latency data transfer

Study. Dhabaleswar. K. Panda. The Ohio State University HPIDC '09

Technische Universität München. Comparison of Network Interface Controllers for Software Packet Processing

InfiniBand SDR, DDR, and QDR Technology Guide

The following InfiniBand products based on Mellanox technology are available for the HP BladeSystem c-class from HP:

ehca Virtualization on System p

HPC Customer Requirements for OpenFabrics Software

Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms

Fermi Cluster for Real-Time Hyperspectral Scene Generation

RapidIO.org Update. Mar RapidIO.org 1

InfiniBand* Architecture

AcuSolve Performance Benchmark and Profiling. October 2011

ISILON ETHERNET BACKEND NETWORK OVERVIEW

Application Acceleration Beyond Flash Storage

Solutions for Scalable HPC

QLogic in HPC Vendor Update IDC HPC User Forum April 16, 2008 Jeff Broughton Sr. Director Engineering Host Solutions Group

High Performance Computing

TABLE I IBA LINKS [2]

Checklist for Selecting and Deploying Scalable Clusters with InfiniBand Fabrics

Linux Clusters Institute: Intermediate Networking

Best Practices for Deployments using DCB and RoCE

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA

NVMe Direct. Next-Generation Offload Technology. White Paper

The Impact of Inter-node Latency versus Intra-node Latency on HPC Applications The 23 rd IASTED International Conference on PDCS 2011

Introduction Electrical Considerations Data Transfer Synchronization Bus Arbitration VME Bus Local Buses PCI Bus PCI Bus Variants Serial Buses

ISILON ETHERNET BACKEND NETWORK OVERVIEW

RoCE vs. iwarp Competitive Analysis

Highest Levels of Scalability Simplified Network Manageability Maximum System Productivity

Infiniband und Virtualisierung

Key Measures of InfiniBand Performance in the Data Center. Driving Metrics for End User Benefits

In-Network Computing. Paving the Road to Exascale. 5th Annual MVAPICH User Group (MUG) Meeting, August 2017

HYCOM Performance Benchmark and Profiling

Memory Management Strategies for Data Serving with RDMA

NAMD Performance Benchmark and Profiling. January 2015

Evaluating the Impact of RDMA on Storage I/O over InfiniBand

Ron Emerick, Oracle Corporation

High Performance Computing: Concepts, Methods & Means Enabling Technologies 2 : Cluster Networks

Interconnect Your Future

The following InfiniBand products based on Mellanox technology are available for the HP BladeSystem c-class from HP:

Scheduling Strategies for HPC as a Service (HPCaaS) for Bio-Science Applications

To Infiniband or Not Infiniband, One Site s s Perspective. Steve Woods MCNC

Server Networking e Virtual Data Center

New Storage Architectures

Mellanox Technologies Maximize Cluster Performance and Productivity. Gilad Shainer, October, 2007

Oracle Solaris - The Best Platform to run your Oracle Applications

STAR-CCM+ Performance Benchmark and Profiling. July 2014

Transcription:

Introduction to High-Speed InfiniBand Interconnect

2 What is InfiniBand? Industry standard defined by the InfiniBand Trade Association Originated in 1999 InfiniBand specification defines an input/output architecture used to interconnect servers, communications infrastructure equipment, storage and embedded systems InfiniBand is a pervasive, low-latency, high-bandwidth interconnect which requires low processing overhead and is ideal to carry multiple traffic types (clustering, communications, storage, management) over a single connection. As a mature and field-proven technology, InfiniBand is used in thousands of data centers, high-performance compute clusters and embedded applications that scale from small scale to large scale Source: InfiniBand Trade Association (IBTA) www.infinibandta.org

3 The InfiniBand Architecture Industry standard defined by the InfiniBand Trade Association Defines System Area Network architecture Comprehensive specification: from physical to applications Architecture supports Host Channel Adapters (HCA) Target Channel Adapters (TCA) Switches Routers Facilitated HW design for Low latency / high bandwidth Transport offload Consoles RAID HCA Processor Node Subnet Manager TCA HCA TCA Storage Subsystem Switch InfiniBand Subnet Switch Switch Switch Gateway Ethernet HCA Processor Node HCA Gateway Processor Node Fibre Channel

4 InfiniBand Highlights Serial High Bandwidth Links SDR: 10Gb/s DDR: 20Gb/s QDR: 40Gb/s FDR: 56Gb/s EDR: 100Gb/s HDR: 200Gb/s Memory exposed to remote node access RDMA-read and RDMA-write Atomic operations Quality Of Service Independent I/O channels at the adapter level Virtual Lanes at the link level Ultra low latency Under 1 us application to application Reliable, lossless, self-managing fabric Link level flow control Congestion control to prevent HOL blocking Full CPU Offload Hardware Based Reliable Transport Protocol Kernel Bypass (User level applications get direct access to hardware) Cluster Scalability/flexibility Up to 48K nodes in subnet, up to 2^128 in network Parallel routes between end nodes Multiple cluster topologies possible Simplified Cluster Management Centralized route manager In-band diagnostics and upgrades

PHY PHY Link InfiniBand Network Stack User code Buffer Application Kernel code Hardware Application Buffer Transport Layer Transport Layer Buffer Network Layer Packet relay Network Layer Link Layer Packet relay Link Link Layer Physical Layer PHY PHY Physical Layer InfiniBand node InfiniBand Switch Router Legacy node 5

6 InfiniBand Components Overview Host Channel Adapter (HCA) Device that terminates an IB link and executes transport-level functions and support the verbs interface Switch A device that routes packets from one link to another of the same IB Subnet Router A device that transports packets between IBA subnets Bridge InfiniBand to Ethernet

7 Physical Layer Link Rate InfiniBand uses serial stream of bits for data transfer Link width 1x One differential pair per Tx/Rx 4x Four differential pairs per Tx/Rx 12x - Twelve differential pairs per Tx and per Rx Link Speed Single Data Rate (SDR) - 2.5Gb/s per lane (10Gb/s for 4x) Double Data Rate (DDR) - 5Gb/s per lane (20Gb/s for 4x) Quad Data Rate (QDR) - 10Gb/s per lane (40Gb/s for 4x) Fourteen Data Rate (FDR) - 14Gb/s per lane (56Gb/s for 4x) Enhanced Data rate (EDR) - 25Gb/s per lane (100Gb/s for 4x) Link rate Multiplication of the link width and link speed Most common shipping today is 4x ports

8 Physical Layer Cables Media types PCB: several inches Passive copper: 20m SDR, 10m DDR, 7m QDR, 3m FDR, 3m EDR, 3m HDR Fiber: 300m SDR, 150m DDR, 100/300m QDR 4X QSFP Copper Link encoding SDR, DDR, QDR: 8 to 10 bit encoding FDR, EDR, HDR: 64 to 66 bit encoding Industry standard components Copper cables / Connectors Optical cables Backplane connectors 4x QSFP Fiber FR4 PCB

9 Link Layer Flow Control Credit-based link-level flow control Link Flow control assures no packet loss within fabric even in the presence of congestion Link Receivers grant packet receive buffer space credits per Virtual Lane Flow control credits are issued in 64 byte units Separate flow control per Virtual Lanes provides: Alleviation of head-of-line blocking Virtual Fabrics Congestion and latency on one VL does not impact traffic with guaranteed QOS on another VL even though they share the same physical link Link Control Arbitration Credits Returned Link Control Packets Mux Packets Transmitted Demux Receive Buffers

10 Transport Layer Using Queue Pairs QPs are in pairs (Send/Receive) Work Queue is the consumer/producer interface to the fabric The Consumer/producer initiates a Work Queue Element (WQE) The Channel Adapter executes the work request The Channel Adapter notifies on completion or errors by writing a Completion Queue Element (CQE) to a Completion Queue (CQ) Local QP Remote QP Transmit Receive WQE Receive Transmit

Transport Layer Types Transfer Operations SEND Read message from HCA local system memory Transfers data to Responder HCA Receive Queue logic Does not specify where the data will be written in remote memory Immediate Data option available RDMA Read Responder HCA reads its local memory and returns it to the Requesting HCA Requires remote memory access rights, memory start address, message length RDMA Write Requester HCA sends data to be written into the Responder HCA s system memory Requires remote memory access rights, memory start address, message length 11

Management Model IBA management defines a common management infrastructure Subnet Management Provides methods for a subnet manager to discover and configure IBA devices Manage the fabric General management services Subnet administration - provides nodes with information gathered by the SM Provides a registrar for nodes to register general services they provide Communication establishment and connection management between end nodes Performance management Monitors and reports well-defined performance counters And more 12

Management Model SNMP Tunneling Agent Application-Specific Agent Vendor-Specific Agent Device Management Agent Performance Management Agent Communication Mgmt (Mgr/Agent) Baseboard Management Agent Subnet Administration (an Agent) General Service Interface QP1 (virtualized per port) Uses any VL except 15 MADs called GMPs - LID-Routed Subject to Flow Control Subnet Manager (SM) Agent Subnet Manager Subnet Management Interface QP0 (virtualized per port) Always uses VL15 MADs called SMPs LID or Direct-Routed No Flow Control 13

14 Subnet Management Topology Discovery Fabric Maintenance Initialization uses Directed Route packets: LID Route Directed Route Vector LID Route System Memory CPU Standby Subnet Manager SM SM Each Subnet must have a Subnet Manager (SM) HCA SMA IB Switch HCA Subnet Manager IB Switch SMA TCA SMA HCA SMA IB Switch SMA HCA SMA IB Switch SMA HCA SMA Every entity (CA, SW, Router) must support a Subnet Management Agent (SMA) Multipathing: LMC Supports Multiple LIDS

15 Cluster Topologies Topologies that are mainly in use for large clusters Fat-Tree 3D Torus Mesh Dragonfly Fat-tree (also known as CBB) Flat network, can be set as oversubscribed network or not In other words, blocking or non blocking Typically the lowest latency network 3D Torus An oversubscribed network, easier to scale Fit more applications with locality Dragonfly A generic concept of connecting groups or virtual routers together In a full-graph (all to all) way 0,0 0,1 1,0 1,1 2,0 2,1 0,2 1,2 2,2

Open Fabrics Linux/Windows Software Stack Open Fabrics is an open-source software development organization Open Fabrics develops software stack for InfiniBand Linux and Windows Contains low level drivers, core, Upper Layer Protocols (ULPs), Tools and documents Available on OpenFabrics.org web site 16

17 Software Stack Upper Layer protocols Examples IPoIB IP over IB (TCP/UDP over InfiniBand) EoIB Ethernet over IB RDS Reliable Datagram Sockets MPI Message Passing Interface iser iscsi for InfiniBand SRP SCSI RDMA Protocol udapl User Direct Access Programming Library NetworkDirect (Windows only)

Thank You www.hpcadvisorycouncil.com All trademarks are property of their respective owners. All information is provided As-Is without any kind of warranty. The HPC Advisory Council makes no representation to the accuracy and completeness of the information contained herein. HPC Advisory Council undertakes no duty and assumes no obligation to update or correct any information presented herein