Infiniband Fast Interconnect

Similar documents
Future Routing Schemes in Petascale clusters

OceanStor 9000 InfiniBand Technical White Paper. Issue V1.01 Date HUAWEI TECHNOLOGIES CO., LTD.

Introduction to High-Speed InfiniBand Interconnect

To Infiniband or Not Infiniband, One Site s s Perspective. Steve Woods MCNC

BlueGene/L. Computer Science, University of Warwick. Source: IBM

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems.

High Performance Computing: Concepts, Methods & Means Enabling Technologies 2 : Cluster Networks

MPI Optimizations via MXM and FCA for Maximum Performance on LS-DYNA

Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms

Informatix Solutions INFINIBAND OVERVIEW. - Informatix Solutions, Page 1 Version 1.0

Advanced Computer Networks. End Host Optimization

Extending InfiniBand Globally

The Road to ExaScale. Advances in High-Performance Interconnect Infrastructure. September 2011

Intel PRO/1000 PT and PF Quad Port Bypass Server Adapters for In-line Server Appliances

Introduction to Infiniband

Key Measures of InfiniBand Performance in the Data Center. Driving Metrics for End User Benefits

Accelerating Real-Time Big Data. Breaking the limitations of captive NVMe storage

Module 2 Storage Network Architecture

Networking for Data Acquisition Systems. Fabrice Le Goff - 14/02/ ISOTDAQ

CERN openlab Summer 2006: Networking Overview

Performance Evaluation of InfiniBand with PCI Express

Performance Evaluation of InfiniBand with PCI Express

EqualLogic Storage and Non-Stacking Switches. Sizing and Configuration

An FPGA-Based Optical IOH Architecture for Embedded System

Solutions for Scalable HPC

Building Core Networks and Routers in the 2002 Economy

Performance monitoring in InfiniBand networks

InfiniBand SDR, DDR, and QDR Technology Guide

Generic Model of I/O Module Interface to CPU and Memory Interface to one or more peripherals

Network bandwidth is a performance bottleneck for cluster computing. Especially for clusters built with SMP machines.

ET4254 Communications and Networking 1

Multicomputer distributed system LECTURE 8

Workshop on High Performance Computing (HPC) Architecture and Applications in the ICTP October High Speed Network for HPC

End-to-End Adaptive Packet Aggregation for High-Throughput I/O Bus Network Using Ethernet

WebSphere MQ Low Latency Messaging V2.1. High Throughput and Low Latency to Maximize Business Responsiveness IBM Corporation

Storage. Hwansoo Han

EXTENDING AN ASYNCHRONOUS MESSAGING LIBRARY USING AN RDMA-ENABLED INTERCONNECT. Konstantinos Alexopoulos ECE NTUA CSLab

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA

InfiniBand Credit-Based Link-Layer Flow-Control

More on LANS. LAN Wiring, Interface

Implementing RapidIO. Travis Scheckel and Sandeep Kumar. Communications Infrastructure Group, Texas Instruments

Scaling to Petaflop. Ola Torudbakken Distinguished Engineer. Sun Microsystems, Inc

Ch. 4 - WAN, Wide Area Networks

NOTE: A minimum of 1 gigabyte (1 GB) of server memory is required per each NC510F adapter. HP NC510F PCIe 10 Gigabit Server Adapter

Integrated t Services Digital it Network (ISDN) Digital Subscriber Line (DSL) Cable modems Hybrid Fiber Coax (HFC)

2008 International ANSYS Conference

ECE/CS 757: Advanced Computer Architecture II Interconnects

QuickSpecs. Models. HP NC510C PCIe 10 Gigabit Server Adapter. Overview

CPS104 Computer Organization and Programming Lecture 18: Input-Output. Outline of Today s Lecture. The Big Picture: Where are We Now?

Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services

CMPE 150/L : Introduction to Computer Networks. Chen Qian Computer Engineering UCSC Baskin Engineering Lecture 11

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters

CS152 Computer Architecture and Engineering Lecture 20: Busses and OS s Responsibilities. Recap: IO Benchmarks and I/O Devices

Mellanox Technologies Maximize Cluster Performance and Productivity. Gilad Shainer, October, 2007

The desire for higher interconnect speeds between

Advanced Computer Networks. Flow Control

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

Reduces latency and buffer overhead. Messaging occurs at a speed close to the processors being directly connected. Less error detection

Advanced Computer Networks. Flow Control

The Optimal CPU and Interconnect for an HPC Cluster

High Performance Ethernet for Grid & Cluster Applications. Adam Filby Systems Engineer, EMEA

PCI Express x8 Single Port SFP+ 10 Gigabit Server Adapter (Intel 82599ES Based) Single-Port 10 Gigabit SFP+ Ethernet Server Adapters Provide Ultimate

Fast-Response Multipath Routing Policy for High-Speed Interconnection Networks

Chelsio Communications. Meeting Today s Datacenter Challenges. Produced by Tabor Custom Publishing in conjunction with: CUSTOM PUBLISHING

iscsi Technology Brief Storage Area Network using Gbit Ethernet The iscsi Standard

CCNA Exploration1 Chapter 7: OSI Data Link Layer

Interconnect Your Future

Designing Next-Generation Data- Centers with Advanced Communication Protocols and Systems Services. Presented by: Jitong Chen

EVALUATING INFINIBAND PERFORMANCE WITH PCI EXPRESS

Comparing Server I/O Consolidation Solutions: iscsi, InfiniBand and FCoE. Gilles Chekroun Errol Roberts

Mobile Routing : Computer Networking. Overview. How to Handle Mobile Nodes? Mobile IP Ad-hoc network routing Assigned reading

LUSTRE NETWORKING High-Performance Features and Flexible Support for a Wide Array of Networks White Paper November Abstract

Network Design Considerations for Grid Computing

Lecture 9: Bridging. CSE 123: Computer Networks Alex C. Snoeren

PERFORMANCE ACCELERATED Mellanox InfiniBand Adapters Provide Advanced Levels of Data Center IT Performance, Productivity and Efficiency

New Approaches to Optical Packet Switching in Carrier Networks. Thomas C. McDermott Chiaro Networks Richardson, Texas

Buses. Disks PCI RDRAM RDRAM LAN. Some slides adapted from lecture by David Culler. Pentium 4 Processor. Memory Controller Hub.

6.9. Communicating to the Outside World: Cluster Networking

Objectives. Hexadecimal Numbering and Addressing. Ethernet / IEEE LAN Technology. Ethernet

Lecture 12: Interconnection Networks. Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E)

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구

A Comparison Between Two Link Layer Networking Protocol Models: the OMNeT++ InfiniBand and Ethernet Models

Presentation_ID. 2002, Cisco Systems, Inc. All rights reserved.

LS-DYNA Productivity and Power-aware Simulations in Cluster Environments

Extending the LAN. Context. Info 341 Networking and Distributed Applications. Building up the network. How to hook things together. Media NIC 10/18/10

The Future of Interconnect Technology

Chelsio 10G Ethernet Open MPI OFED iwarp with Arista Switch

Dragon Slayer Consulting

Question Points Score total 100

Throughput & Latency Control in Ethernet Backplane Interconnects. Manoj Wadekar Gary McAlpine. Intel

The Future of High-Performance Networking (The 5?, 10?, 15? Year Outlook)

Computer-to-Computer Networks (cont.)

Introduction Electrical Considerations Data Transfer Synchronization Bus Arbitration VME Bus Local Buses PCI Bus PCI Bus Variants Serial Buses

Part 5: Link Layer Technologies. CSE 3461: Introduction to Computer Networking Reading: Chapter 5, Kurose and Ross

Basic Low Level Concepts

Birds of a Feather Presentation

Introduction to TCP/IP Offload Engine (TOE)

Advanced Network Design

Intel QuickPath Interconnect Electrical Architecture Overview

Introducing Campus Networks

Transcription:

Infiniband Fast Interconnect Yuan Liu Institute of Information and Mathematical Sciences Massey University May 2009 Abstract Infiniband is the new generation fast interconnect provides bandwidths both inside the box and bandwidth out of the box. The InfiniBand Architecture (IBA) is mainly used by HPC and data centers. The architecture s elements as well as the advantages are discussed in this report in the areas of bandwidth, CPU utilization, latency and RAS. 1

1 Introduction Amdhal s law states that to gain the best overall performance speed up, the improvement of different part in the system should be even and Moor s Law states every 18 months, we get double performance from semiconductors. The imbalances between I/O speed and computation power results the industry and market yearn for new generation interconnect. Two major competing forces were then born, one was named Future I/O (FIO), an input/output architecture developed by IBM, Compaq and HP. it aims to increase server I/O throughput, the key problems anticipated for the next generation of computing [1], another was named Next Generation I/O (NGIO), developed by Intel Microsoft and Sun, its main focus is enabling relatively rapid PCI replacement in volume segments. [2] FIO and NGIO merged in 2000 named System I/O (SIO) with the best of the technologies from each side. SIO did not last long and renamed to InfiniBand. InfiniBand, a industry standard switch-based serial I/O interconnect architecture operating at a base speed of 2.5 Gb/s per link in each direction.[3] It is designed to provide high speed, low latency, cost efficient with enhanced reliability interconnect solution and it is mainly used in clusters and data centers, although it provides the bandwidth inside the box to replace bus structure and as well as the bandwidth out of the box. This report focuses on InfiniBand architecture elements and InfiniBand advantages. 2

2 InfiniBand Architecture 2.1 InfiniBand elements Figure 2.1.1 Infiniband structure [3] As figure 2.1.1 shows, the key elements in InfiniBands are switches, adaptors and wirings. InfiniBand switches As Infiniband is switch based, the InfiniBand switches connect end nodes and forward packets from one port to another. They do not generate or consume packets and support 1X,4X and 12X mode and each mode can be single data, double data and quad data rate speeds shown in figure 2.1.2 Figure 2.1.2 Effective theoretical throughput in different configurations[6] HCAs Host Channel Adapters are very intelligent, capable of handling large numbers of concurrent connections and typically have a large number of 3

send/receive buffers. They are used to connect from processor nodes to switches, they can have one or more ports and use local system bus interface, PCI-E for example. TCAs Target Channel Adapters, compare to HCAs, are not as much intelligence as HCAs due to the limited scope of their function. They only need to handle a small number of concurrent connections thus they do not have as much send/receive buffer space as HCAs.[3] They are used to connect from I/O nodes to switches. I/O nodes can be single or multiple storage devices. Wirings Over twisted pair copper wires, InfiniBand can be spanning up to 30 meters while using fiber cables allow 10kms distance from ordinary. Figure 2.1.3 shows that physical transport media may consist of 1,4,12 lanes, each bandwidth is 2.5Gb/s and representing a separate physical interface.[4] Figure 2.1.3 InifiniBand physical link [3] 2.2 Infiniband for clusters, Why InfiniBand is better than the others Before InfiniBand was invented, Ethernet was widely used for cluster interconnect, in 9 years, from the birth of InfiniBand, many supercomputers have been adopted to InfiniBand because it provides high bandwidth, low CPU utilization, low latency, scalability as well as RAS. This section will discuss the benefits provided by InfiniBand compare with Ethernet and the classic Transmission Control Protocol 4

(TCP). 2.2.1.1 High bandwidth, the TCP bottleneck Figure 2.2.1 TCP congestion flow control TCP uses slow-start algorithm. The behavior of the slow-start algorithm is to send a single packet, await an ACK, then doubles the number of packets needed to be sent, and await the corresponding ACKs until it reaches the slow-start threshold. Then the flow control will change the mode from slow-start to congestion avoidance, it increment number of packet to be sent by one according to last number of packets been successfully sent until router drops packet, i.e. when the router buffer is full [7]. Figure 2.2.1 shows how flow control and congestion control is achieved. It is the bottleneck for high bandwidth in two ways. First, the slow-start algorithm works well for slow connections and/or continues transmition, it does not suit the behavior of message passing in clusters, which they need to send messages around as fast as possible. The time used in parallel computing should be the computation time. Thus, the slow-start algorithm has a high chance that the messages are sent before reaching the bandwidths limit. The congestion avoidance in TCP makes the case worse because of the slow increment for number of packet to be sent at once. 5

2.2.1.2 Infiniband eliminates TCP bottleneck plus more Figure 2.2.2 Infiniband flow control [9] The InfiniBand link layer uses credit based flow control to avoid congestion, sender keeps track of number of free buffer slots in receiver so there is no slow-start and by knowing the free number of buffer slots, it can use as much of available bandwidth and more efficient. It does not loose data because data is not transmitted unless the receiver advertises credits indicating receive buffer space is available. Furthermore it is more powerful than an XON/XOFF (as twice of efficient) or CSMA/CD protocol (used by Ethernet).[8] 2.2.2.1 Head of line (HoL) blocking Figure 2.2.3 Head of Line Blocking HoL blocking happens in buffered switches. Because of the first in first out nature of the input buffer, switches can only transfer packet in head of the buffer, so in the figure 2.2.3 case, for port number 1, packets go to output port 3 cannot be switched even that is free because packets have to wait until the packets to be consumed by port 4 first. Port 4 is busy consuming input from input port 1 and 4 at a time so 6

packets for port 3 have been blocked. Ethernet and all other network uses switches will have this problem unless switch builds the feature in to avoid HoL blocking. There is nothing specified in the Ethernet. 2.2.2.2 InfiniBand Virtual Lane (VL) alleviates HoL blocking Each InfiniBand channel adapter, either HCA or TCA, shown in figure 2.2.2, has 1 or more ports, each port has multiple pairs of dedicated send and receive buffers, so it can send and receive concurrently. A Virtual Lane (VL) manages one pair of buffers and has its own flow control characteristics and priority. VL15 has the highest priority among all VLs is dedicated for management messages. Thus, each port has minimum of 2 (VL0 and VL15) and up to 16 VLs, the VL15 is dedicated for management of packets and using alternative paths, it not only alleviates HoL blocking, but also provides load-balancing [9] as well as QoS(see Appendix). 2.2.3 CPU utilization and Scalability In parallel computing, we expect the system can be scaled as we needed and the processors should do computations on tasks rather than anything else. This is not 100% true in TCP. Ethernet is not an inherently reliable network[12], it runs TCP to ensure reliable communication. At the down side, TCP adds lot of overheads to CPU and network. This leads to leak full bandwidth by passing these overheads around and steals valuable CPU cycles to processing overhead, as the cluster scales, it becomes more and more noticeable and makes Ethernet an impediment to build big clusters. At InfiniBand network level, InfiniBand provides at least ten times better CPU utilization vs. Gigabit Ethernet clusters by implementing the communications stack in hardware and taking advantage of RDMA capabilities. 2.2.4 RDMA A significant advantage of InfiniBand is it is Remote Direct Memory Access (RDMA) capability. This allows processor nodes access memory without using heavy slow protocol like TCP. In Ethernet, (figure 2.2.4.1) it takes 3 copies within system local bus to move data from a network processor node to the application over TCP. Figure 2.2.4.2 shows HCA read/write directly to/from subsystem in InfiniBand Architecture. As RDMA has been build into lowest levels of network interfaces, there is no need for a high over- 7

head protocol driver to verify integrity and de-multiplex messages to applications. This frees up CPU cycles and local bus cycles [10]. Figure 2.2.4.1 IPC using TCP over Ethernet [10] Figure 2.2.4.2 IPC using udapl(see Appendix) or SDP over InfiniBand [10] 8

2.2.5 Latency, Scalability, and RAS A very important factor for HPC(see Appendix) is latency. In message passing, we expect message to be sent and received as fast as possible. In order to achieve this, system demands high brandwitdth which has been discussed in section 2.1, low latency and efficiently use of available bandwidth. Latency directly effects scalability of the cluster. In clusters using MPI, in most cases, synchronization is required between nodes, the faster they can synchronize, the larger system can scale. Infini- Band latency is at least 10 times faster than Ethernet according to Figure 2.2.5.1, the benchmark results are from Pallas MPI Benchmarks. RAS stands for Reliability, Availability, and Serviceability. InfiniBand supports reliability with guaranteed reliable services which deliver in-order packet with 2 CRCs to ensure data integrity. The 2 CRC(see Appendix) packets, 16bit Variant CRC(VCRC) which covers the whole packet and recalculate from hop to hop and the 32 bit Invariant CRC(ICRC) which covers the fields do not change at link layer level while Ethernet uses only one single CRC. Availability is provided as it enables redundancy and supports fail-over by switching to an alternative path, should a link or device fail. Serviceability is achieved through hot swappability and special management functions, which can be either in-band or out-of-band.[12] Figure 2.2.5.1 source: http://vmi.ncsa.uiuc.edu/performance/pmb_lt.php[11] 9

2.3 Current researches and Interconnects in the future When NGIO and FIO was born, the reason was to eliminate I/O bottle neck. For bandwidth "out of the box" side, According to the InifiniBand roadmap (figure 2.3), in next 3 years, InfiniBand will push its brand width to near 1000Gb/s.[15] Figure 2.3 InifiniBand Roadmap On the other hand, for bandwidth "inside of the box", the latest Intel Core i7 family processors integrated memory controller into CPU and replaced system front side bus (FSB) by QuickPath Interconnect (QPI) which provides up to 6.4GT/s speed. The researches and developments in related areas have proofed eliminate system I/O bottle neck is a trend, it is as important as improving processor computing power, thus Amdahl's law stands. 3 Conclusions InfiniBand Architecture is industry standard fast interconnect, it can co-exist with current network and provides high bandwidth, low latency, high scalability, minimal CPU utilization, RAS and lots more. It can replace the shared bus structure in local system to switched link architecture but there are not many such boards been manufactured so far. It has been used as interconnect in the Top 500 supercomputers including the fastest IBM Roadrunner. Thus the main usage nowadays is use the bandwidth out of the box feature provided by InfiniBand to connect clusters and data centers. By seeing the fastest computers configuration currently available, we can predict more and more clusters and data centers will adapt to Infiniband Architecture. 10

4 References [1] J.Cowan,C.Madison,G.Still,D.Garcia,M.Bradley & K.Potter, PI,Proceedings of the The 6th International Conference on Parallel Interconnects, p.238, 1999. [2] T.Heil, InfiniBand Adoption Challenges, InfiniBand: A Paradigm Shift From PCI.1 Jun. 2000. [3] Mellanax Technologies Inc, Introduction to InfiniBand,White Paper, 2003. [4]T.C.Jepsen, InfiniBand,Distributed Storage NetworksArchitecture,Protocols and Management,p.159-174,2003. [5]T.M. Ruwart, InfiniBand The Next Paradigm Shift in Storage,18th IEEE Symposium on Mass Storage Systems and 9th NASA Goddard Conference on Mass Storage Systems and Technologies,17 Apr.2001 [6] Wikipedia, InfiniBand, http://en.wikipedia.org/wiki/infiniband, 18,Apr.2009 [7] G.Huston, TCP Performance, The Internet Protocol Journal - Volume 3, No. 2,2009 [8] H.T.Kung&R.Morris, Credit-Based Flow Control for ATM Networks IEEE Network Magazine, March 1995. [9] C.Eddington, InfiniBridgeTM: An Integrated InfiniBand Switch and Channel Adapter. http://mellanox.com/ 19 Apr. 2009 [10] Oracle, Achieving Mainframe-Class Performance on Intel Servers Using InfiniBand Building Blocks An Oracle White Paper, April 2003. [11] NCSA, Latency Results from Pallas MPI Benchmarks Virtual Machine Interface 2.1 23 Mar. 2005 [12] Mellanax Technologies Inc, InfiniBand Frequently Asked Questions, White Paper, 2003. [13] S.Shelvapille&V.Puri, Encapsulation Methords for Transport of InfiniBand over MPLS Networks, Internet-Draft, 12 Mar,2009 [14] Cisco Systems, Quality of Service,Internetworking Technology Handbook(http://www.cisco.com/en/US/docs/internetworking/technology/handbook/ QoS.html), 21 Apr,2009 [15] InfiniBand Trade Association, InfiniBand Roadmap, InfiniBand Roadmap (http://www.infinibandta.org/content/pages.php?pg=technology_overview), 27 May,2009 11

Apendex A: Definitions: Invariant CRC Variant CRC DAPL High-performance computing (HPC) QoS A CRC covering the fields of a packet that do not change from the source to the destination.[13] A CRC covering all the fields of a packet, including those that may be changed by switches.[13] Direct Access Programming Library is a high performance Remote Direct Memory Access API. A branch of computer science that concentrates on developing supercomputers and software to run on supercomputers. A main area of this discipline is developing parallel processing algorithms and software: programs that can be divided into little pieces so that each piece can be executed simultaneously by separate processors. Quality of Service refers to the capability of a network to provide better service to selected network traffic over various technologies, including Frame Relay, Asynchronous Transfer Mode (ATM), Ethernet and 802.1 networks, SONET, and IProuted networks that may use any or all of these underlying technologies. The primary goal of QoS is to provide priority including dedicated bandwidth, controlled jitter and latency (required by some real-time and interactive traffic), and improved loss characteristics. Also important is making sure that providing priority for one or more flows does not make other flows fail. QoS technologies provide the elemental building blocks that will be used for future business applications in campus, WAN, and service provider networks.[14] 12