Chelsio 10G Ethernet Open MPI OFED iwarp with Arista Switch

Similar documents
LAMMPS, LS- DYNA, HPL, and WRF on iwarp vs. InfiniBand FDR

LAMMPS and WRF on iwarp vs. InfiniBand FDR

Chelsio Communications. Meeting Today s Datacenter Challenges. Produced by Tabor Custom Publishing in conjunction with: CUSTOM PUBLISHING

NFS/RDMA over 40Gbps iwarp Wael Noureddine Chelsio Communications

The NE010 iwarp Adapter

Comparing Ethernet & Soft RoCE over 1 Gigabit Ethernet

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA

Comparing Ethernet and Soft RoCE for MPI Communication

Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms

RoCE vs. iwarp Competitive Analysis

iscsi or iser? Asgeir Eiriksson CTO Chelsio Communications Inc

Intel Cluster Toolkit Compiler Edition 3.2 for Linux* or Windows HPC Server 2008*

Performance Evaluation of Soft RoCE over 1 Gigabit Ethernet

Interconnect Your Future

Memory Management Strategies for Data Serving with RDMA

Multifunction Networking Adapters

Performance of HPC Applications over InfiniBand, 10 Gb and 1 Gb Ethernet. Swamy N. Kandadai and Xinghong He and

IBM WebSphere MQ Low Latency Messaging Software Tested With Arista 10 Gigabit Ethernet Switch and Mellanox ConnectX

Single-Points of Performance

Evaluation of the Chelsio T580-CR iscsi Offload adapter

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience

10-Gigabit iwarp Ethernet: Comparative Performance Analysis with InfiniBand and Myrinet-10G

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

Birds of a Feather Presentation

Design Alternatives for Implementing Fence Synchronization in MPI-2 One-Sided Communication for InfiniBand Clusters

Concurrent Support of NVMe over RDMA Fabrics and Established Networked Block and File Storage

NVMe over Universal RDMA Fabrics

The rcuda middleware and applications

MPI Optimizations via MXM and FCA for Maximum Performance on LS-DYNA

iscsi Technology: A Convergence of Networking and Storage

Ethernet: The High Bandwidth Low-Latency Data Center Switching Fabric

High-Performance GPU Clustering: GPUDirect RDMA over 40GbE iwarp

Technical Computing Suite supporting the hybrid system

Myri-10G Myrinet Converges with Ethernet

Unified Wire Networking Solutions

by Brian Hausauer, Chief Architect, NetEffect, Inc

ARISTA: Improving Application Performance While Reducing Complexity

PERFORMANCE ACCELERATED Mellanox InfiniBand Adapters Provide Advanced Levels of Data Center IT Performance, Productivity and Efficiency

Accelerating Hadoop Applications with the MapR Distribution Using Flash Storage and High-Speed Ethernet

Creating an agile infrastructure with Virtualized I/O

Building the Most Efficient Machine Learning System

Sun Lustre Storage System Simplifying and Accelerating Lustre Deployments

Future Routing Schemes in Petascale clusters

Comparing Server I/O Consolidation Solutions: iscsi, InfiniBand and FCoE. Gilles Chekroun Errol Roberts

RDMA in Embedded Fabrics

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters

Generic RDMA Enablement in Linux

iser as accelerator for Software Defined Storage Rahul Fiske, Subhojit Roy IBM (India)

SNAP Performance Benchmark and Profiling. April 2014

Advanced Computer Networks. End Host Optimization

Solutions for Scalable HPC

HP Cluster Interconnects: The Next 5 Years

NAMD Performance Benchmark and Profiling. January 2015

ANSYS Fluent 14 Performance Benchmark and Profiling. October 2012

QuickSpecs. HP Z 10GbE Dual Port Module. Models

Best Practices for Deployments using DCB and RoCE

Evaluation of Chelsio Terminator 6 (T6) Unified Wire Adapter iscsi Offload

In-Network Computing. Paving the Road to Exascale. 5th Annual MVAPICH User Group (MUG) Meeting, August 2017

Comparative Performance Analysis of RDMA-Enhanced Ethernet

Building the Most Efficient Machine Learning System

SMB Direct Update. Tom Talpey and Greg Kramer Microsoft Storage Developer Conference. Microsoft Corporation. All Rights Reserved.

Support for GPUs with GPUDirect RDMA in MVAPICH2 SC 13 NVIDIA Booth

Advanced Software for the Supercomputer PRIMEHPC FX10. Copyright 2011 FUJITSU LIMITED

Improving Altibase Performance with Solarflare 10GbE Server Adapters and OpenOnload

OCTOPUS Performance Benchmark and Profiling. June 2015

Study. Dhabaleswar. K. Panda. The Ohio State University HPIDC '09

Optimizing LS-DYNA Productivity in Cluster Environments

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구

Programming for Fujitsu Supercomputers

Mellanox Technologies Maximize Cluster Performance and Productivity. Gilad Shainer, October, 2007

A High-Performance Storage and Ultra- High-Speed File Transfer Solution for Collaborative Life Sciences Research

Welcome to the IBTA Fall Webinar Series

Application Acceleration Beyond Flash Storage

Interconnect Your Future

Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters

Embracing Open Technologies in the HPEC Market

Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters

SoftRDMA: Rekindling High Performance Software RDMA over Commodity Ethernet

Voltaire. Fast I/O for XEN using RDMA Technologies. The Grid Interconnect Company. April 2005 Yaron Haviv, Voltaire, CTO

Universal RDMA: A Unique Approach for Heterogeneous Data-Center Acceleration

MiAMI: Multi-Core Aware Processor Affinity for TCP/IP over Multiple Network Interfaces

TCP offload engines for high-speed data processing

The Future of High Performance Interconnects

TIBCO, HP and Mellanox High Performance Extreme Low Latency Messaging

MILC Performance Benchmark and Profiling. April 2013

PCI Express x8 Single Port SFP+ 10 Gigabit Server Adapter (Intel 82599ES Based) Single-Port 10 Gigabit SFP+ Ethernet Server Adapters Provide Ultimate

RoCE vs. iwarp A Great Storage Debate. Live Webcast August 22, :00 am PT

Performance of ORBs on Switched Fabric Transports

USING OPEN FABRIC INTERFACE IN INTEL MPI LIBRARY

STAR-CCM+ Performance Benchmark and Profiling. July 2014

Advanced Computer Networks. RDMA, Network Virtualization

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures

Sun N1: Storage Virtualization and Oracle

2008 International ANSYS Conference

Cavium FastLinQ 25GbE Intelligent Ethernet Adapters vs. Mellanox Adapters

Communication Models for Resource Constrained Hierarchical Ethernet Networks

Introduction Electrical Considerations Data Transfer Synchronization Bus Arbitration VME Bus Local Buses PCI Bus PCI Bus Variants Serial Buses

DB2 purescale: High Performance with High-Speed Fabrics. Author: Steve Rees Date: April 5, 2011

MVAPICH MPI and Open MPI

InfiniBand-based HPC Clusters

Transcription:

PERFORMANCE BENCHMARKS Chelsio 10G Ethernet Open MPI OFED iwarp with Arista Switch Chelsio Communications www.chelsio.com sales@chelsio.com +1-408-962-3600

Executive Summary Ethernet provides a reliable and ubiquitous networking protocol for high-performance computing (HPC) environments. When used in conjunction with the Internet Wide Area RDMA Protocol (iwarp), Ethernet delivers an ultralow latency, high performance, and highly scalable interconnect solution for application layer environments ranging from dozens to thousands of compute nodes. In tests conducted at the Chelsio facility, results demonstrate successful interoperability between Chelsio s latest SFP+ 10GbE adapter and Juniper s 10GbE SFP+ switch. RDMA iwarp MPI performance res ults show that the application can sustain line rate with approximately 4us latency. What is iwarp? iwarp, also called RDMA over Ethernet, is a low latency solution for supporting high -performance computing over TCP/IP. Developed by the Internet Engineering Task Force (IETF) and supported by the industry s leading 10GbE Ethernet adapters, iwarp works with existing Ethernet switches and routers to deliver low latency fabric technologies for high-performance data centers. In addition to providing all of the total cost of ownership (TCO) benefits of Ethernet, iwarp delivers several distinct advantages for use with Ethernet in HPC environments: It is a multivendor solution that works with legacy switches. It is an established IETF standard. It is built on top of IP, making it routable and scalable from just a few to thousands of collocated or geographically dispersed endpoints. It is built on top of TCP, making it highly reliable. It allows RDMA and MPI applications to be ported from InfiniBand (IB) i nterconnect to IP/Ethernet interconnect in a seamless fashion. Chelsio s iwarp and TCP Offload Engine Solutions Chelsio s T420-LL-CR 10GbE iwarp adapters improve HPC application performance by leveraging an embedded TCP Offload Engine (TOE), a technology that offloads TCP/IP stack processing from the host to the NIC. A TOE frees up server memory, bandwidth, and valuable CPU cycles to improve the performance of applications that are sensitive to these parameters. When used with higher speed interfaces such as 10GbE, the performance improvement enabled by a TOE is even more dramatic, since 10GbE delivers data rates that are so high, host-based TCP instruction execution can quickly overwhelm even the fastest servers. With the rapid adoption of 10GbE, and the resulting increase in data flow into and out of multi -core servers, TOEs have become a requirement to deliver the high throughput and low latency needed for HPC applications, while leveraging Ethernet s ubiquity, scalability, and cost-effectiveness. Chels io Communications Page 2

Test Configuration Test tools Test tool: Intel MPI Benchmark Suite V3.2 Host Systems CPU: 1 x quad-core Intel Core i7-2600k @ 3.4 GHz Root Complex: Intel Sandy Bridge Chipset Memory: 8GB OS: Linux 2.6.18-194.el5 OFED 1.5.3.1 x86_64 Open MPI 1.4.3 Chelsio T420-LL-CR with Arista Switch. One way, 1 byte Latency: 4.27 us Unidirectional Bandwidth: 1111.42 MB/s Bidirectional Bandwidth: 2224.11 MB/s Arista 10G Switch Chelsio T420-LL-CR X86_64 RHEL5.5 OFED 1.5.3.1 Chelsio T420-LL-CR X86_64 RHEL5.5 OFED 1.5.3.1 Chels io Communications Page 3

Tests Performed PingPing: Measures the startup (latency) and throughput of a single message sent between two processes, with each node issuing the commands concurrently. Sendrecv: Creates a chain of sending and receiving between upstream and downstream neighbors. Exchange: Creates a chain of sending and receiving between upstream and downstream neighbors, with each node issuing the send/receive commands concurrently. Allreduce: Measures the time to perform the allreduce collective operator where each process in a group owns a vector whose elements are combined with the corres ponding elements owned by other processes, producing a new vector of the same length. Reduce: Measures the time to perform the allreduce collective operator where each process in a group owns a vector whose elements are combined with the corresponding elements owned by other processes, producing a new vector of the same length, changing the root of the process cyclically. Reduce_scatter: Measures the time to perform the allreduce collective operator, split between all processes, where each process in a group owns a vector whose elements are combined with the corresponding elements owned by other processes, producing a new vector of the same length. Allgather: Measures the time that it sends X bytes and gathers X*processes bytes. Allgatherv: Measures the time that it sends X bytes and gathers X*(#processes) bytes, and sees if MPI produces more overhead due to the more complicated situation. Alltoall: Measures the time that it takes for every process to input X*(#processes) bytes (X for each process) and receives X*(#processes) bytes (X for each process). Bcast: Measures the time that it takes a root process to broadcast X bytes to all while the root of the operation is changed cyclically. Chels io Communications Page 4

Chels io Communications Page 5

Chels io Communications Page 6