Myri-10G Myrinet Converges with Ethernet

Similar documents
Myricom s Myri-10G Software Support for Network Direct (MS MPI) on either 10-Gigabit Ethernet or 10-Gigabit Myrinet

10-Gigabit iwarp Ethernet: Comparative Performance Analysis with InfiniBand and Myrinet-10G

Lessons learned from MPI

QuickSpecs. HP Z 10GbE Dual Port Module. Models

Mellanox Technologies Maximize Cluster Performance and Productivity. Gilad Shainer, October, 2007

Multifunction Networking Adapters

The NE010 iwarp Adapter

Chelsio Communications. Meeting Today s Datacenter Challenges. Produced by Tabor Custom Publishing in conjunction with: CUSTOM PUBLISHING

QuickSpecs. Models. HP NC510C PCIe 10 Gigabit Server Adapter. Overview

ARISTA: Improving Application Performance While Reducing Complexity

Solarflare and OpenOnload Solarflare Communications, Inc.

Chelsio 10G Ethernet Open MPI OFED iwarp with Arista Switch

HP BladeSystem c-class Ethernet network adaptors

High Performance Computing: Concepts, Methods & Means Enabling Technologies 2 : Cluster Networks

PCI Express x8 Single Port SFP+ 10 Gigabit Server Adapter (Intel 82599ES Based) Single-Port 10 Gigabit SFP+ Ethernet Server Adapters Provide Ultimate

HP BladeSystem c-class Ethernet network adapters

NOTE: A minimum of 1 gigabyte (1 GB) of server memory is required per each NC510F adapter. HP NC510F PCIe 10 Gigabit Server Adapter

Adaptive Routing Strategies for Modern High Performance Networks

The Convergence of Storage and Server Virtualization Solarflare Communications, Inc.

Creating an agile infrastructure with Virtualized I/O

Scalable Ethernet Clos-Switches. Norbert Eicker John von Neumann-Institute for Computing Ferdinand Geier ParTec Cluster Competence Center GmbH

Informatix Solutions INFINIBAND OVERVIEW. - Informatix Solutions, Page 1 Version 1.0

LAMMPS, LS- DYNA, HPL, and WRF on iwarp vs. InfiniBand FDR

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems.

1/5/2012. Overview of Interconnects. Presentation Outline. Myrinet and Quadrics. Interconnects. Switch-Based Interconnects

by Brian Hausauer, Chief Architect, NetEffect, Inc

Infiniband and RDMA Technology. Doug Ledford

Application Acceleration Beyond Flash Storage

vnetwork Future Direction Howie Xu, VMware R&D November 4, 2008

NFS/RDMA over 40Gbps iwarp Wael Noureddine Chelsio Communications

HP Flex-10 10Gb 2-port 530FLB adapter

Future Routing Schemes in Petascale clusters

The CMS Event Builder

HP FlexFabric 20Gb 2-port 630FLB Adapter

HP FlexFabric 10Gb 2-port 534FLB Adapter

HP FlexFabric 10Gb 2-port 534FLB Adapter

Scaling to Petaflop. Ola Torudbakken Distinguished Engineer. Sun Microsystems, Inc

RoCE vs. iwarp Competitive Analysis

2008 International ANSYS Conference

Checklist for Selecting and Deploying Scalable Clusters with InfiniBand Fabrics

BlueGene/L. Computer Science, University of Warwick. Source: IBM

HP 10 GbE Dual Port Mezzanine Adapter for HP BladeSystem c-class server blades

QLogic TrueScale InfiniBand and Teraflop Simulations

Memory Management Strategies for Data Serving with RDMA

CERN openlab Summer 2006: Networking Overview

Advanced Computer Networks. End Host Optimization

NIC-PCIE-4RJ45-PLU PCI Express x4 Quad Port Copper Gigabit Server Adapter (Intel I350 Based)

Network Adapters. FS Network adapter are designed for data center, and provides flexible and scalable I/O solutions. 10G/25G/40G Ethernet Adapters

The Future of High-Performance Networking (The 5?, 10?, 15? Year Outlook)

Data Sheet FUJITSU PLAN EP Intel X710-DA2 2x10GbE SFP+

Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms

PCI Express x8 Quad Port 10Gigabit Server Adapter (Intel XL710 Based)

Intel PRO/1000 PT and PF Quad Port Bypass Server Adapters for In-line Server Appliances

打造 Linux 下的高性能网络 北京酷锐达信息技术有限公司技术总监史应生.

QuickSpecs. Overview. HPE Ethernet 10Gb 2-port 535 Adapter. HPE Ethernet 10Gb 2-port 535 Adapter. 1. Product description. 2.

Comparing Ethernet & Soft RoCE over 1 Gigabit Ethernet

Lighting the Blue Touchpaper for UK e-science - Closing Conference of ESLEA Project The George Hotel, Edinburgh, UK March, 2007

Best Practices for Setting BIOS Parameters for Performance

XE1-P241. XE1-P241 PCI Express PCIe x4 Dual SFP Port Gigabit Server Adapter (Intel I350 Based) Product Highlight

Network Design Considerations for Grid Computing

HP Ethernet 10Gb 2-port 560FLB Adapter

Unification of Data Center Fabrics with Ethernet

Motivation CPUs can not keep pace with network

Design and Implementation of Open-MX: High-Performance Message Passing over generic Ethernet hardware

Mellanox InfiniBand Solutions Accelerate Oracle s Data Center and Cloud Solutions

10-Gigabit iwarp Ethernet: Comparative Performance Analysis with InfiniBand and Myrinet-10G

Introduction to TCP/IP Offload Engine (TOE)

QuickSpecs. HPE Ethernet 10Gb 2-port 570FLB Adapter. Overview

OpenOnload. Dave Parry VP of Engineering Steve Pope CTO Dave Riddoch Chief Software Architect

VPI / InfiniBand. Performance Accelerated Mellanox InfiniBand Adapters Provide Advanced Data Center Performance, Efficiency and Scalability

A-GEAR 10Gigabit Ethernet Server Adapter X520 2xSFP+

HP Ethernet 10Gb 2-port 560FLB Adapter

LUSTRE NETWORKING High-Performance Features and Flexible Support for a Wide Array of Networks White Paper November Abstract

Leveraging HyperTransport for a custom high-performance cluster network

Presentation_ID. 2002, Cisco Systems, Inc. All rights reserved.

HP FlexFabric 10Gb 2-port 554M Adapter. HP FlexFabric 10Gb 2-port 554M Adapter

QuickSpecs. Integrated NC7782 Gigabit Dual Port PCI-X LOM. Overview

Comparative Performance Analysis of RDMA-Enhanced Ethernet

The Road to ExaScale. Advances in High-Performance Interconnect Infrastructure. September 2011

PERFORMANCE ACCELERATED Mellanox InfiniBand Adapters Provide Advanced Levels of Data Center IT Performance, Productivity and Efficiency

iscsi Technology: A Convergence of Networking and Storage

Workshop on High Performance Computing (HPC) Architecture and Applications in the ICTP October High Speed Network for HPC

MPI History. MPI versions MPI-2 MPICH2

All Roads Lead to Convergence

Comparing Ethernet and Soft RoCE for MPI Communication

QuickSpecs. HPE FlexFabric 20Gb 2-port 630FLB Adapter. Overview. Models

MiAMI: Multi-Core Aware Processor Affinity for TCP/IP over Multiple Network Interfaces

STORAGE CONSOLIDATION WITH IP STORAGE. David Dale, NetApp

QuickSpecs. HP NC6170 PCI-X Dual Port 1000SX Gigabit Server Adapter. Overview. Retired

HP Flex-10 10Gb 2-port 530M Adapter

Birds of a Feather Presentation

Assessing the Ability of Computation/Communication Overlap and Communication Progress in Modern Interconnects

IsoStack Highly Efficient Network Processing on Dedicated Cores

ALLNET ALL0141-4SFP+-10G / PCIe 10GB Quad SFP+ Fiber Card Server

Data Center & Cloud Computing DATASHEET. PCI Express x8 Single Port SFP+ 10 Gigabit Server Adapter. Flexibility and Scalability in Data Centers

Single-Points of Performance

PLUSOPTIC NIC-PCIE-2SFP+-V2-PLU

Module 2 Storage Network Architecture

Networking for Data Acquisition Systems. Fabrice Le Goff - 14/02/ ISOTDAQ

The Optimal CPU and Interconnect for an HPC Cluster

Transcription:

Myri-10G Myrinet Converges with Ethernet David PeGan VP, Sales dave@myri.com (Substituting for Tom Leinberger) 4 October 2006 Oklahoma Supercomputing Symposium 1

New Directions for Myricom Although Myricom has done well (and we hope some good) with Myrinet in HPC, this market is limited We foresee little future for specialty networks Technical convergence/standardization and business consolidation has been evident in the computer industry over the past decade Ethernet will prevail over all of the specialty ( Ethernot ) networks Myricom has great technology for 10-Gigabit Ethernet And Myricom products have always installed like Ethernet, carried Ethernet traffic, and been interoperable with Ethernet Thus, Myri-10G, our new generation of high-performance networking products, is converging with Ethernet Diversification strategy: Dual-use 10G Ethernet & 10G Myrinet Programmable NICs are a feature crucial to both modes of operation 2

Myri-10G is 4th-generation Myricom products, a convergence that leverages 10-Gigabit Ethernet technology into the HPC world, and HPC techniques into the Ethernet world Based on 10G Ethernet PHYs (layer 1), 10 Gbit/s data rates NICs support both Ethernet and Myrinet network protocols at the Data Link level (layer 2) 10-Gigabit Ethernet products from Myricom High performance, low cost, fully compliant with IEEE 802.3ae, interoperable with 10G Ethernet products of other companies 4th-generation Myrinet A complete, low-latency, cluster-interconnect solution NICs, software, and switches software-compatible with Myrinet-2000 Switches retain the efficiency and scalability of layer-2 Myrinet switching internally, but may have a mix of 10-Gigabit Myrinet and 10-Gigabit Ethernet ports externally 3

Myri-10G NICs 10GBase-CX4 10GBase-R XAUI over ribbon fiber These programmable NICs are PCI Express x8 Protocol-offload 10-Gigabit Ethernet NICs. Use them with Myricom s bundled driver and a 10G Ethernet switch. You ll see near-wire-speed TCP/IP or UDP/IP data rates (Linux 2.6, netperf benchmark, jumbo frames). 10-Gigabit Myrinet NICs. Use them with Myricom s Myrinet Express (MX) software and a 10G Myrinet switch. You ll see performance metrics of: 2.3 s MPI latency 1.2 GBytes/s data rate Very low host-cpu utilization 4

Myri-10G Switches Switches are 10-Gigabit Myrinet Cut-through switching with source routing Low latency, economical, and scalable Full-bisection Clos networks Based on 16-port and 32-port single-chip crossbar switches These networks are scalable to thousands of nodes using efficient, layer-2 switching Connection of a 10-Gigabit Myrinet switch fabric to interoperable 10-Gigabit Ethernet ports requires only simple protocol conversion Now available in special switch line cards 5

Family of Modular Myri-10G Switches Up to 128 host ports in the Clos configuration 64 host ports + 64 interswitch ports in the Leaf configuration Mixed PHYs OK Enterprise features Redundant hot-swap power supplies and fans Hot-swap line cards Functional and physical monitoring via dual 10/100 Ethernet ports and a TFT display 6

Myri-10G Software Driver and firmware for 10-Gigabit Ethernet operation is included (bundled) with the NIC The broader software support for 10-Gigabit Myrinet and Low-Latency 10-Gigabit Ethernet is MX (Myrinet Express) MX-10G is the message-passing system for low-latency, lowhost-cpu-utilization, kernel-bypass operation of Myri-10G NICs over either 10-Gigabit Myrinet or 10-Gigabit Ethernet MX-2G for Myrinet-2000 PCI-X NICs was released in June 2005 Myricom software support always spans two generations of NICs Includes TCP/IP, UDP/IP, MPICH-MX, and Sockets-MX MPICH2-MX coming soon. Also available: OpenMPI, HP-MPI, MX-2G and MX-10G are fully compatible at the application level MX-2G applications don t even need to be recompiled. Just change LD_LIBRARY_PATH to point to the MX-10G libraries. 7

What s New: MX over Ethernet Myricom recently extended MX-10G to operate over 10-Gigabit Ethernet as well as 10-Gigabit Myrinet MXoE works with Myri-10G NICs (kernel bypass) and standard 10-Gigabit Ethernet switches 2.4-2.8 s MPI latency, 1.2 GByte/s one-way data rate Pallas/IMB benchmarks with low-latency, layer-2, 10-Gigabit Ethernet switches Nearly on-par with results with MX over Myrinet (MXoM) MXoE uses Ethernet as a layer-2 network with an MX EtherType to identify MX packets (frames) The Myri-10G NICs can carry IP traffic (IPoE) together with MX (MXoE) traffic Myricom is making the MXoE protocols open 8

Open MXoE Invitation Myricom is publishing the on the wires protocols for MX over Ethernet Myricom invites proposals from universities to produce interoperable, open-source implementations of MXoE using other Ethernet NICs, e.g., Intel or Broadcom GbE NICs, or other 10GbE NICs Myricom will support such research with Myri-10G components and a modest research stipend Given the low system-call overhead with Linux 2.6, we expect the performance of MXoE to be excellent even with ordinary GbE NICs Not kernel bypass, but bypasses the host-os TCP/IP stack Performance boost for Beowulf clusters 9

MX Software Interfaces Applications UDP TCP MPI Sockets Other Middleware In the Host OS kernel IP Kernel bypass Ethernet driver MX driver MX firmware in the Myri NIC Ethernet NIC Initialization & IP Traffic Myrinet-2000, 10G Myrinet, or 10G Ethernet ports 10

Myri-10G Software Matrix Host APIs protocols Driver & NIC firmware Network protocols Network IP Sockets IP Sockets + Sockets over MX + MPI over MX TCP/IP, UDP/IP host-os network stack TCP/IP, UDP/IP host-os network stack + MX kernel bypass Myri10GE MX-10G IPoE IP over Ethernet IPoE IP over Ethernet MXoE MX over Ethernet IPoM IP over Myrinet MXoM MX over Myrinet Ethernet Myrinet For MX, we have 2 APIs (Sockets, MPI) x 2 protocols (IP, MX) x 2 networks (Ethernet, Myrinet) = 8 possible combinations, all supported and all useful. 11

MX MPI Performance Measurements MPI Benchmark MX over Myrinet Myricom 128-port 10G Myrinet Switch MX over Ethernet Fujitsu XG700 12-port or MB8AA3020 20-port 10G Ethernet switch OpenIB with Intel MPI Mellanox InfiniBand PingPong latency 2.3 s 2.80 s (12-port) 2.63 s (20-port) 4.0 s One-way data rate (PingPong) 1204 MByte/s 1201 MByte/s 964 MByte/s Two-way data rate (SendRecv) 2397 MByte/s 2387 MByte/s 1902 MByte/s The MPI benchmarks for MX are the standard Pallas, now Intel, MPI benchmarks. The data rates are converted from the Mebibyte (2 20 Byte) per second measure reported to the standard MByte/s measure. The MPI benchmarks for OpenIB (with Intel MPI) are from a published OSU Benchmark Comparison, May 11, 2006. The numbers cited are typical of the best of 45 benchmarks reported. The reported latency does not include the latency of an InfiniBand switch; thus, the actual in-system latency will be higher. The data rates are from streaming tests, which are less demanding than and produce better throughput numbers than PingPong tests. 12

Communication/Computation Overlap The ability of application programs to make progress while communication is taking place concurrently is crucial to the performance of many production computing tasks. The graphs on the left are from Measuring MPI Send and Receive Overhead and Application Availability in High Performance Network Interfaces, Doerfler et al (Sandia), EuroPVM/MPI, September 2006 Red Storm: Cray SeaStar/Portals Tbird: Cisco (Mellanox) Infiniband/MVAPICH CBC-B: Qlogic Infinipath Odin: Myricom Myri-10G/MPICH-MX Squall: Quadrics QsNetII 13

MXoE Host-CPU Utilization For example, a 1 MByte message Transmission time: < 1000 s Host utilization (send or recv): ~10 s Host-CPU utilization: ~1% 14

Ethernet or Myrinet Switching? For small clusters, up to the size that can be supported from a single switch, 10-Gigabit Ethernet is capable with MXoE and Myri-10G NICs of performance formerly associated only with specialty cluster interconnects Not only low latency, but low-host-cpu utilization, thanks to MX s kernel-bypass mode of operation The Ethernet solutions are limited to smaller clusters that can be served with a single 10-Gigabit Ethernet switch There are performance losses in building larger Ethernet networks by connecting multiple Ethernet switches Inasmuch as there are no high-port-count, low-latency, fullbisection, 10-Gigabit Ethernet switches on the market today, MX over Myrinet with 10-Gigabit Myrinet switches will continue to be preferred for large clusters because of the economy and scalability of Myrinet switching. 15

Scalability of Myrinet Switching MareNostrum Cluster in Barcelona. The central switch has 2560 host ports. Photo courtesy of IBM. 16

Extras More good things about MX 10-Gigabit Ethernet software and performance High-performance Interoperability 17

MX over Myrinet Adaptive Dispersive Routing MX takes advantage of the multiple paths through large Myrinet networks to spread packet traffic MX mapping provides multiple routes (usually 8) to each other host Flow-control backpressure sensed Clos Network for 128 hosts on packet injection indicates contention on the route MX changes route when contention is sensed in the network Note: Dispersive, multi-path routing may cause packets to be received out-oforder, but MX can reorder packets on receive Matching (message level) is always in-order Eliminates "hot spots" in switch networks Adapts the routes to the communication patterns of the application Fault tolerance on the time scale of route selection rather than remapping Extremely valuable for large switch networks Only possible with source-routed networks 18

Fabric Management System (FMS) An all-in-one Myrinet Fabric Management package Switch, host, and link monitoring Fabric control Has largely replaced the earlier mapping software Fabric Management Agent (FMA) process on each host reports to the Fabric Management System (FMS) process FMS maintains a database in order to compare the current state of the fabric with the desired state Increased uptime Proactive problem diagnosis and reporting Tools allow monitoring and maintenance during production See the documentation on the web http:///scs/fms/ 19

The Lustre File System over MX New software package over MX-2G and MX-10G Runs native on Myrinet-2000 and Myri-10G networks Leverages MX Advantages Two-sided communication Simple kernel library mirrors user API Lower memory requirements than IB No queue pairs (IB suffers from N 2 memory growth) MX needs only ~128 Bytes per peer Excellent Performance Currently with MX-10G: 1165 MB/s read, 1175 MB/s write, and 36K metadata/s With IB: reads and writes at 900 MB/s and 30K metadata/s 20

Myri-10G in the 10GbE Market Highest throughput 10-Gigabit Ethernet NICs on the market First PCI-Express 10GbE NICs Near wire speed TCP/IP or UDP/IP with 9KByte jumbo frames Demonstrated interoperability with 10-Gigabit Ethernet switches from Foundry, Extreme, HP, Quadrics, Fujitsu, SMC, Force10, Cisco, (more to come) The Myri10GE driver and firmware is currently available for Linux, Windows, Solaris, Mac OS X, and FreeBSD Driver was contributed to and accepted in the Linux kernel; included in the 2.6.18 and later kernels. Also an iscsi target and initiator See http:///scs/iscsi/ Bell Micro is Myricom s distributor for 10-Gigabit Ethernet NICs in the US and Europe. OEMs and cluster integrators continue to buy direct from Myricom. 21

10-Gigabit Ethernet Performance http:///scs/performance/myri10ge/ includes current performance measurements with Linux, Windows, and Solaris. The performance is quite good both in throughput and in host-cpu load with 9K-Byte jumbo frames. Our current development projects aim at improving the performance with 1500-Byte frames. The netperf benchmark results below are with Linux. Netperf Test MTU BW TX_CPU % RX_CPU % ------------ ---- ------- -------- -------- UDP_STREAM_TX 9000 9915.38 38.73 00.00 UDP_STREAM_RX 9000 9908.09 00.00 44.26 TCP_STREAM 9000 9565.49 49.95 49.97 TCP_SENDFILE 9000 9496.91 12.40 50.67 22

10GbE Offloads We implement zero-copy on the send side with all OSes, and, depending on the OS, use a variety of stateless offloads in the NIC, including: Interrupt Coalescing IP and TCP checksum offload, send and receive TSO (TCP Segmentation Offload, also known as Large Send Offload) RSS (Receive-Side Scaling) LRO (Large Receive Offload) Multicast filtering We do not currently implement any stateful offloads. These NICs and their software are not TOEs. Note that TCP offload generally requires OS-kernel patches. 23

DAS-3: High-Performance Interoperability Being installed in the Netherlands Operational by the end of 2006 Five supercomputing clusters connected by a private DWDM fiber network Seamless cluster and grid operation thanks to Myri-10G Myrinet protocols within each cluster; IP protocols between clusters 24