Introduction to Infiniband FRNOG 22, April 4 th 2014 Yael Shenhav, Sr. Director of EMEA, APAC FAE, Application Engineering
The InfiniBand Architecture Industry standard defined by the InfiniBand Trade Association Defines System Area Network architecture Comprehensive specification - from physical to applications Rev 1.0 Rev 1.0a Rev 1.1 Architecture supports Host Channel Adapters (HCA) Target Channel Adapters (TCA) Switches Routers Rev 1.2 2000 2001 2002 2004 Rev 1.2.1 2007 Rev XRC RoCE FDR EDR 1.2.1 2009 2010 2011 Consoles RAID HCA Processor Node Subnet Manager TCA HCA Storage Subsystem Switch TCA InfiniBand Subnet Switch Switch Switch Gateway Ethernet HCA Processor Node HCA Gateway Processor Node Fibre Channel 2
InfiniBand Highlights Performance 56Gb/s per link shipping today Down to 0.6us application to application latency Aggressive roadmap Reliable and lossless fabric Link level flow control Congestion control to prevent HOL blocking Efficient Transport Offload Kernel bypass RDMA and atomic operations QoS Virtualization Application acceleration Scalable beyond Petascale computing I/O consolidation Simplified Cluster Management Centralized route manager In-band diagnostics and upgrades InfiniBand Roadmap Source: www.infinibandta.org 3
InfiniBand Topologies Back to Back 2 Level Fat Tree 3D Torus Dual Star Hybrid Scalability to 1,000s and 10,000s of nodes Full/configurable CBB ratio Multi-pathing Modular switches are based on fat tree architecture DragonFly 4
PHY PHY Link InfiniBand Protocol Layers InfiniBand Node InfiniBand Node Application Application ULP ULP Transport Layer Network Layer Link Layer InfiniBand Switch Packet relay InfiniBand Router Packet relay Link Transport Layer Network Layer Link Layer Physical Layer PHY PHY Physical Layer 5
Physical Layer Cables Media types PCB: several inches Passive copper: 20m SDR, 10m DDR, 7m QDR, 3m FDR Fiber: 300m SDR, 150m DDR, 100/300m QDR&FDR Link encoding SDR, DDR, QDR: 8 to 10 bit encoding FDR, EDR: 64 to 66 bit encoding Industry standard components Copper cables / Connectors Optical cables Backplane connectors 4X QSFP Copper 4x QSFP Fiber FR4 PCB 6
Link Layer Addressing and Switching Local Identifier (LID) addressing Unicast LID - 48K addresses Multicast LID up to 16K addresses Efficient linear lookup Cut through switching supported Multi-pathing support through LMC Independent Virtual Lanes Flow control (lossless fabric) Service level VL arbitration for QoS Congestion control Forward / Backward Explicit Congestion Notification (FECN/BECN) Data Integrity Invariant CRC Variant CRC Per QP/VL injection rate control HCA BECN High Priority WRR Low Priority WRR Priority Select H/L Weighted Round Robin (WRR) VL Arbitration Switch threshold VL ARB Independent Virtual Lanes (VLs) FECN BECN Efficient FECN/BECN Based Congestion Control Packets to be Transmitted HCA 7
Partitions I/O A Host A Partition 1 Inter-Host InfiniBand fabric Host B I/O D Logically divide the fabric into isolated domains Partial and full membership per partition Partition filtering at switches Similar to FC Zoning 802.1Q VLANs Partition 3 private to host A I/O B I/O C Partition 4 shared Partition 2 private to host B 8
Fabric Consolidation with InfiniBand/RoCE Storage App Networking App Management App OS One Wire IB HCA High bandwidth pipe for capacity provisioning Dedicated I/O channels enable convergence For Networking, Storage, Management Application compatibility QoS - differentiates different traffic types Partitions logical fabrics, isolation Virtualization with bare metal performance Flexibility Soft servers / fabric repurposing 9
Transport Host Channel Adapter (HCA) Model Asynchronous interface - Verbs Consumer posts work requests HCA processes Consumer polls completions Transport executed by HCA I/O channel exposed to the application Polling and interrupt models supported QP posting WQEs Send Queue Receive Queue Consumer polling CQEs Completion Queue QP Send Queue Transport and RDMA Offload Engine Receive Queue HCA VL VL VL VL VL VL VL VL Port Port 10
HARDWARE KERNEL USER What is RDMA? The 3 Remote Direct Memory Access Protocol ( RDMA) goodies Transport Offload Kernel Bypass RDMA and Atomic Operations Application 1 Buffer 1 Buffer 1 Application 2 OS Buffer 1 Buffer 1 Buffer 1 Buffer 1 OS RDMA over InfiniBand/ Ethernet HCA HCA NIC Buffer 1 Buffer 1 NIC RACK 1 TCP/IP RACK 2 11
System Space System Space User Space User Space I/O Offload Frees Up CPU for Application Processing Without RDMA With RDMA and Offload ~53% CPU Efficiency ~88% CPU Efficiency ~47% CPU Overhead/Idle ~12% CPU Overhead/Idle 12
kernel bypass Upper Layer Protocols ULPs connect InfiniBand to common interfaces Supported on mainstream operating systems clustering Clustering Apps Socket based Apps socket Clustering MPI (Message Passing Interface) RDS (Reliable Datagram Socket) Network IPoIB/EoIB (IP/Eth over InfiniBand) SDP (Socket Direct Protocol) VMA usermode socket accelerator socket Socket based Apps TCP/ IP IPoIB EoIB sockets SDP RDS block storage SRP Storage Apps storage Interfaces (file/block) iser file storage NFS over RDMA MPI InfiniBand Core Services IB Apps VMA Device Driver User Kernel IB Apps Storage SRP (SCSI RDMA Protocol) iser (iscsi Extensions for RDMA) NFSoRDMA (NFS over RDMA) InfiniBand Core Services Device Driver Hardware Operating system InfiniBand Infrastructure Applications 13
RoCE RDMA over Converged Ethernet InfiniBand transport over Ethernet API Compatible Efficient, light-weight transport, layered directly over Ethernet FCoE equivalent for high-performance IPC traffic Takes advantage of DCB Ethernet PFC, ETS, and QCN Kernel Sockets TCP IP Ethernet Link Layer RDMA Transport IB/Ethernet Link Layer Ethernet Management IB/Ethernet Management Infiniband LRH (L2 Hdr) GRH (L3 Hdr) BTH+ (L4 Hdr) IB Payload ICRC VCRC RoCE MAC ET RoCE GRH BTH+ IB Payload ICRC FCS 14
Latency us Latency RoCE 40GbE vs. QDR vs. FDR (ConnectX -3) 2 1.8 1.6 1.4 1.2 Latency A B 1 0.8 0.6 0.4 0.2 T roundtrip Poll post 0 2 4 8 16 32 64 128 256 512 1024 Message Size (Byte) Latency = T roundtrip / 2 FDR InfiniBand QDR InfiniBand 40GbE RoCE FDR10 InfiniBand Lower is Better 15
T Bandwidth (Gb/s) Throughput FDR vs. QDR vs. RoCE 40GbE (ConnectX-3) 60 Bandwidth 50 A B 40 30 20 10 0 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 65536 131072 262144 524288 Message Size (Byte) FDR InfiniBand QDR InfiniBand 40GbE RoCE FDR10 InfiniBand BW = N * MessageSize / T Poll last post Higher is Better 16
Summary - The Goodies RDMA Goodies Transport offload Kernel bypass RDMA and Atomic operations InfiniBand Goodies True SDN Cheaper Lower latency Higher density (roadmap) 17
RDMA Inside Best Kept Secret in the Data Center Many of the World s Largest Web 2.0 Data Centers are Running on Mellanox Interconnects InfiniBand and Ethernet (RoCE) RDMA Interconnects are connecting millions of servers 18
50% CAPEX Reduction for Bing Maps High-performance system to support map image processing 10X performance improvement compared to previous systems Half the cost compared to 10GbE Mellanox end-to-end InfiniBand 40Gb/s interconnect solutions Cost-Effective Accelerated Web 2.0 Services 19
ProfitBricks Public Cloud Solution InfiniBand based IaaS Deploy more VMs per physical server Saving on CapEx and OpEx Provider ProfitBricks $0.07 Amazon EC2 $0.16 Rackspace $0.12 Provider ProfitBricks $0.26 Amazon EC2 $0.45 Rackspace $0.48 Cost Per Hour (Config 1) Cost Per Hour (Config 2) Config 1: 1 Core, 2GB RAM, 50 GB HDD Instance Config 2: 1 Core, 8GB RAM, 100 GB HDD Instance ProfitBricks Cloud Architecture Based on InfiniBand 20
NoSQL Database Acceleration - No Change to Application Required! - Up to 400% Increase in Number of Clients Without Adding Servers! - Client and Server Side Installation Up to 400% Acceleration Over RDMA 21
Mellanox InfiniBand Connected Petascale Systems Connecting Half of the World s Petascale Systems Mellanox Connected Petascale System Examples 22
Thank You