Introduction to Infiniband

Similar documents
Introduction to High-Speed InfiniBand Interconnect

Fabric Consolidation with InfiniBand. Dror Goldenberg, Mellanox Technologies

Fabric Consolidation with InfiniBand. Dror Goldenberg, Mellanox Technologies

InfiniBand Networked Flash Storage

Comparing Server I/O Consolidation Solutions: iscsi, InfiniBand and FCoE. Gilles Chekroun Errol Roberts

SwitchX Virtual Protocol Interconnect (VPI) Switch Architecture

NFS/RDMA over 40Gbps iwarp Wael Noureddine Chelsio Communications

VPI / InfiniBand. Performance Accelerated Mellanox InfiniBand Adapters Provide Advanced Data Center Performance, Efficiency and Scalability

OceanStor 9000 InfiniBand Technical White Paper. Issue V1.01 Date HUAWEI TECHNOLOGIES CO., LTD.

VPI / InfiniBand. Performance Accelerated Mellanox InfiniBand Adapters Provide Advanced Data Center Performance, Efficiency and Scalability

Performance Accelerated Mellanox InfiniBand Adapters Provide Advanced Data Center Performance, Efficiency and Scalability

High Performance Computing

Storage Protocol Offload for Virtualized Environments Session 301-F

The Road to ExaScale. Advances in High-Performance Interconnect Infrastructure. September 2011

Mellanox Infiniband Foundations

Learn Your Alphabet - SRIOV, NPIV, RoCE, iwarp to Pump Up Virtual Infrastructure Performance

Future Routing Schemes in Petascale clusters

Informatix Solutions INFINIBAND OVERVIEW. - Informatix Solutions, Page 1 Version 1.0

Application Acceleration Beyond Flash Storage

Industry Standards for the Exponential Growth of Data Center Bandwidth and Management. Craig W. Carlson

PERFORMANCE ACCELERATED Mellanox InfiniBand Adapters Provide Advanced Levels of Data Center IT Performance, Productivity and Efficiency

Voltaire. Fast I/O for XEN using RDMA Technologies. The Grid Interconnect Company. April 2005 Yaron Haviv, Voltaire, CTO

Solutions for Scalable HPC

CERN openlab Summer 2006: Networking Overview

Multifunction Networking Adapters

Mellanox Technologies Maximize Cluster Performance and Productivity. Gilad Shainer, October, 2007

RoCE vs. iwarp Competitive Analysis

2008 International ANSYS Conference

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구

Welcome to the IBTA Fall Webinar Series

Containing RDMA and High Performance Computing

The Future of Interconnect Technology

Scaling to Petaflop. Ola Torudbakken Distinguished Engineer. Sun Microsystems, Inc

SNIA Developers Conference - Growth of the iscsi RDMA (iser) Ecosystem

The NE010 iwarp Adapter

Checklist for Selecting and Deploying Scalable Clusters with InfiniBand Fabrics

Birds of a Feather Presentation

Cisco - Enabling High Performance Grids and Utility Computing

Creating an agile infrastructure with Virtualized I/O

2017 Storage Developer Conference. Mellanox Technologies. All Rights Reserved.

N V M e o v e r F a b r i c s -

Advanced Computer Networks. Flow Control

InfiniBand and Mellanox UFM Fundamentals

Interconnect Your Future

Messaging Overview. Introduction. Gen-Z Messaging

HP Cluster Interconnects: The Next 5 Years

Benefits of 25, 40, and 50GbE Networks for Ceph and Hyper- Converged Infrastructure John F. Kim Mellanox Technologies

In-Network Computing. Paving the Road to Exascale. 5th Annual MVAPICH User Group (MUG) Meeting, August 2017

OFED Storage Protocols

iscsi or iser? Asgeir Eiriksson CTO Chelsio Communications Inc

Best Practices for Deployments using DCB and RoCE

The Future of High Performance Interconnects

All Roads Lead to Convergence

Memory Management Strategies for Data Serving with RDMA

Chelsio Communications. Meeting Today s Datacenter Challenges. Produced by Tabor Custom Publishing in conjunction with: CUSTOM PUBLISHING

ETHERNET OVER INFINIBAND

Interconnect Your Future

Key Measures of InfiniBand Performance in the Data Center. Driving Metrics for End User Benefits

Performance monitoring in InfiniBand networks

QuickSpecs. HP InfiniBand Options for HP BladeSystems c-class. Overview

Workshop on High Performance Computing (HPC) Architecture and Applications in the ICTP October High Speed Network for HPC

Evolving HPC Solutions Using Open Source Software & Industry-Standard Hardware

Study. Dhabaleswar. K. Panda. The Ohio State University HPIDC '09

Unified Storage Networking. Dennis Martin President, Demartek

Ethernet. High-Performance Ethernet Adapter Cards

OpenFabrics Interface WG A brief introduction. Paul Grun co chair OFI WG Cray, Inc.

Infiniband and RDMA Technology. Doug Ledford

2-Port 40 Gb InfiniBand Expansion Card (CFFh) for IBM BladeCenter IBM BladeCenter at-a-glance guide

Low latency and high throughput storage access

HP Storage Summit 2015

Advancing RDMA. A proposal for RDMA on Enhanced Ethernet. Paul Grun SystemFabricWorks

Networking at the Speed of Light

TABLE I IBA LINKS [2]

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters

Cavium FastLinQ 25GbE Intelligent Ethernet Adapters vs. Mellanox Adapters

Module 2 Storage Network Architecture

Mark Falco Oracle Coherence Development

Next Generation Storage Networking for Next Generation Data Centers. PRESENTATION TITLE GOES HERE Dennis Martin President, Demartek

InfiniBand SDR, DDR, and QDR Technology Guide

1 Copyright 2011, Oracle and/or its affiliates. All rights reserved.

Interconnect Your Future

How Are The Networks Coping Up With Flash Storage

iscsi : A loss-less Ethernet fabric with DCB Jason Blosil, NetApp Gary Gumanow, Dell

Paving the Road to Exascale

Mellanox CloudX, Mirantis Fuel 5.1/ 5.1.1/6.0 Solution Guide

STORAGE CONSOLIDATION WITH IP STORAGE. David Dale, NetApp

Mellanox OFED for FreeBSD for ConnectX-4/ConnectX-4 Lx/ ConnectX-5 Release Note. Rev 3.5.0

FROM HPC TO THE CLOUD WITH AMQP AND OPEN SOURCE SOFTWARE

Cisco UCS 6324 Fabric Interconnect

Audience This paper is targeted for IT managers and architects. It showcases how to utilize your network efficiently and gain higher performance using

VM Migration Acceleration over 40GigE Meet SLA & Maximize ROI

STORAGE CONSOLIDATION WITH IP STORAGE. David Dale, NetApp

Analyst Perspective: Test Lab Report 16 Gb Fibre Channel Performance and Recommendations

SMB Direct Update. Tom Talpey and Greg Kramer Microsoft Storage Developer Conference. Microsoft Corporation. All Rights Reserved.

Concurrent Support of NVMe over RDMA Fabrics and Established Networked Block and File Storage

Hardened Security in the Cloud Bob Doud, Sr. Director Marketing March, 2018

Infiniband Fast Interconnect

IO virtualization. Michael Kagan Mellanox Technologies

Dragon Slayer Consulting

NVMe over Universal RDMA Fabrics

Transcription:

Introduction to Infiniband FRNOG 22, April 4 th 2014 Yael Shenhav, Sr. Director of EMEA, APAC FAE, Application Engineering

The InfiniBand Architecture Industry standard defined by the InfiniBand Trade Association Defines System Area Network architecture Comprehensive specification - from physical to applications Rev 1.0 Rev 1.0a Rev 1.1 Architecture supports Host Channel Adapters (HCA) Target Channel Adapters (TCA) Switches Routers Rev 1.2 2000 2001 2002 2004 Rev 1.2.1 2007 Rev XRC RoCE FDR EDR 1.2.1 2009 2010 2011 Consoles RAID HCA Processor Node Subnet Manager TCA HCA Storage Subsystem Switch TCA InfiniBand Subnet Switch Switch Switch Gateway Ethernet HCA Processor Node HCA Gateway Processor Node Fibre Channel 2

InfiniBand Highlights Performance 56Gb/s per link shipping today Down to 0.6us application to application latency Aggressive roadmap Reliable and lossless fabric Link level flow control Congestion control to prevent HOL blocking Efficient Transport Offload Kernel bypass RDMA and atomic operations QoS Virtualization Application acceleration Scalable beyond Petascale computing I/O consolidation Simplified Cluster Management Centralized route manager In-band diagnostics and upgrades InfiniBand Roadmap Source: www.infinibandta.org 3

InfiniBand Topologies Back to Back 2 Level Fat Tree 3D Torus Dual Star Hybrid Scalability to 1,000s and 10,000s of nodes Full/configurable CBB ratio Multi-pathing Modular switches are based on fat tree architecture DragonFly 4

PHY PHY Link InfiniBand Protocol Layers InfiniBand Node InfiniBand Node Application Application ULP ULP Transport Layer Network Layer Link Layer InfiniBand Switch Packet relay InfiniBand Router Packet relay Link Transport Layer Network Layer Link Layer Physical Layer PHY PHY Physical Layer 5

Physical Layer Cables Media types PCB: several inches Passive copper: 20m SDR, 10m DDR, 7m QDR, 3m FDR Fiber: 300m SDR, 150m DDR, 100/300m QDR&FDR Link encoding SDR, DDR, QDR: 8 to 10 bit encoding FDR, EDR: 64 to 66 bit encoding Industry standard components Copper cables / Connectors Optical cables Backplane connectors 4X QSFP Copper 4x QSFP Fiber FR4 PCB 6

Link Layer Addressing and Switching Local Identifier (LID) addressing Unicast LID - 48K addresses Multicast LID up to 16K addresses Efficient linear lookup Cut through switching supported Multi-pathing support through LMC Independent Virtual Lanes Flow control (lossless fabric) Service level VL arbitration for QoS Congestion control Forward / Backward Explicit Congestion Notification (FECN/BECN) Data Integrity Invariant CRC Variant CRC Per QP/VL injection rate control HCA BECN High Priority WRR Low Priority WRR Priority Select H/L Weighted Round Robin (WRR) VL Arbitration Switch threshold VL ARB Independent Virtual Lanes (VLs) FECN BECN Efficient FECN/BECN Based Congestion Control Packets to be Transmitted HCA 7

Partitions I/O A Host A Partition 1 Inter-Host InfiniBand fabric Host B I/O D Logically divide the fabric into isolated domains Partial and full membership per partition Partition filtering at switches Similar to FC Zoning 802.1Q VLANs Partition 3 private to host A I/O B I/O C Partition 4 shared Partition 2 private to host B 8

Fabric Consolidation with InfiniBand/RoCE Storage App Networking App Management App OS One Wire IB HCA High bandwidth pipe for capacity provisioning Dedicated I/O channels enable convergence For Networking, Storage, Management Application compatibility QoS - differentiates different traffic types Partitions logical fabrics, isolation Virtualization with bare metal performance Flexibility Soft servers / fabric repurposing 9

Transport Host Channel Adapter (HCA) Model Asynchronous interface - Verbs Consumer posts work requests HCA processes Consumer polls completions Transport executed by HCA I/O channel exposed to the application Polling and interrupt models supported QP posting WQEs Send Queue Receive Queue Consumer polling CQEs Completion Queue QP Send Queue Transport and RDMA Offload Engine Receive Queue HCA VL VL VL VL VL VL VL VL Port Port 10

HARDWARE KERNEL USER What is RDMA? The 3 Remote Direct Memory Access Protocol ( RDMA) goodies Transport Offload Kernel Bypass RDMA and Atomic Operations Application 1 Buffer 1 Buffer 1 Application 2 OS Buffer 1 Buffer 1 Buffer 1 Buffer 1 OS RDMA over InfiniBand/ Ethernet HCA HCA NIC Buffer 1 Buffer 1 NIC RACK 1 TCP/IP RACK 2 11

System Space System Space User Space User Space I/O Offload Frees Up CPU for Application Processing Without RDMA With RDMA and Offload ~53% CPU Efficiency ~88% CPU Efficiency ~47% CPU Overhead/Idle ~12% CPU Overhead/Idle 12

kernel bypass Upper Layer Protocols ULPs connect InfiniBand to common interfaces Supported on mainstream operating systems clustering Clustering Apps Socket based Apps socket Clustering MPI (Message Passing Interface) RDS (Reliable Datagram Socket) Network IPoIB/EoIB (IP/Eth over InfiniBand) SDP (Socket Direct Protocol) VMA usermode socket accelerator socket Socket based Apps TCP/ IP IPoIB EoIB sockets SDP RDS block storage SRP Storage Apps storage Interfaces (file/block) iser file storage NFS over RDMA MPI InfiniBand Core Services IB Apps VMA Device Driver User Kernel IB Apps Storage SRP (SCSI RDMA Protocol) iser (iscsi Extensions for RDMA) NFSoRDMA (NFS over RDMA) InfiniBand Core Services Device Driver Hardware Operating system InfiniBand Infrastructure Applications 13

RoCE RDMA over Converged Ethernet InfiniBand transport over Ethernet API Compatible Efficient, light-weight transport, layered directly over Ethernet FCoE equivalent for high-performance IPC traffic Takes advantage of DCB Ethernet PFC, ETS, and QCN Kernel Sockets TCP IP Ethernet Link Layer RDMA Transport IB/Ethernet Link Layer Ethernet Management IB/Ethernet Management Infiniband LRH (L2 Hdr) GRH (L3 Hdr) BTH+ (L4 Hdr) IB Payload ICRC VCRC RoCE MAC ET RoCE GRH BTH+ IB Payload ICRC FCS 14

Latency us Latency RoCE 40GbE vs. QDR vs. FDR (ConnectX -3) 2 1.8 1.6 1.4 1.2 Latency A B 1 0.8 0.6 0.4 0.2 T roundtrip Poll post 0 2 4 8 16 32 64 128 256 512 1024 Message Size (Byte) Latency = T roundtrip / 2 FDR InfiniBand QDR InfiniBand 40GbE RoCE FDR10 InfiniBand Lower is Better 15

T Bandwidth (Gb/s) Throughput FDR vs. QDR vs. RoCE 40GbE (ConnectX-3) 60 Bandwidth 50 A B 40 30 20 10 0 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 65536 131072 262144 524288 Message Size (Byte) FDR InfiniBand QDR InfiniBand 40GbE RoCE FDR10 InfiniBand BW = N * MessageSize / T Poll last post Higher is Better 16

Summary - The Goodies RDMA Goodies Transport offload Kernel bypass RDMA and Atomic operations InfiniBand Goodies True SDN Cheaper Lower latency Higher density (roadmap) 17

RDMA Inside Best Kept Secret in the Data Center Many of the World s Largest Web 2.0 Data Centers are Running on Mellanox Interconnects InfiniBand and Ethernet (RoCE) RDMA Interconnects are connecting millions of servers 18

50% CAPEX Reduction for Bing Maps High-performance system to support map image processing 10X performance improvement compared to previous systems Half the cost compared to 10GbE Mellanox end-to-end InfiniBand 40Gb/s interconnect solutions Cost-Effective Accelerated Web 2.0 Services 19

ProfitBricks Public Cloud Solution InfiniBand based IaaS Deploy more VMs per physical server Saving on CapEx and OpEx Provider ProfitBricks $0.07 Amazon EC2 $0.16 Rackspace $0.12 Provider ProfitBricks $0.26 Amazon EC2 $0.45 Rackspace $0.48 Cost Per Hour (Config 1) Cost Per Hour (Config 2) Config 1: 1 Core, 2GB RAM, 50 GB HDD Instance Config 2: 1 Core, 8GB RAM, 100 GB HDD Instance ProfitBricks Cloud Architecture Based on InfiniBand 20

NoSQL Database Acceleration - No Change to Application Required! - Up to 400% Increase in Number of Clients Without Adding Servers! - Client and Server Side Installation Up to 400% Acceleration Over RDMA 21

Mellanox InfiniBand Connected Petascale Systems Connecting Half of the World s Petascale Systems Mellanox Connected Petascale System Examples 22

Thank You