Study. Dhabaleswar. K. Panda. The Ohio State University HPIDC '09

Similar documents
Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters

Unified Runtime for PGAS and MPI over OFED

Unifying UPC and MPI Runtimes: Experience with MVAPICH

In the multi-core age, How do larger, faster and cheaper and more responsive memory sub-systems affect data management? Dhabaleswar K.

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters

Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters

A Portable InfiniBand Module for MPICH2/Nemesis: Design and Evaluation

Can Memory-Less Network Adapters Benefit Next-Generation InfiniBand Systems?

Designing High Performance Communication Middleware with Emerging Multi-core Architectures

Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms

Memory Scalability Evaluation of the Next-Generation Intel Bensley Platform with InfiniBand

Enhancing Checkpoint Performance with Staging IO & SSD

Memcached Design on High Performance RDMA Capable Interconnects

Can Parallel Replication Benefit Hadoop Distributed File System for High Performance Interconnects?

Advanced RDMA-based Admission Control for Modern Data-Centers

MPI Alltoall Personalized Exchange on GPGPU Clusters: Design Alternatives and Benefits

A Plugin-based Approach to Exploit RDMA Benefits for Apache and Enterprise HDFS

Implementing Efficient and Scalable Flow Control Schemes in MPI over InfiniBand

A Case for High Performance Computing with Virtual Machines

Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR

Design and Evaluation of Benchmarks for Financial Applications using Advanced Message Queuing Protocol (AMQP) over InfiniBand

Optimized Distributed Data Sharing Substrate in Multi-Core Commodity Clusters: A Comprehensive Study with Applications

CRFS: A Lightweight User-Level Filesystem for Generic Checkpoint/Restart

Design Alternatives for Implementing Fence Synchronization in MPI-2 One-Sided Communication for InfiniBand Clusters

Efficient and Truly Passive MPI-3 RMA Synchronization Using InfiniBand Atomics

Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand

Performance of HPC Middleware over InfiniBand WAN

Efficient and Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning

Multi-Threaded UPC Runtime for GPU to GPU communication over InfiniBand

High-Performance and Scalable Non-Blocking All-to-All with Collective Offload on InfiniBand Clusters: A study with Parallel 3DFFT

Performance Evaluation of Soft RoCE over 1 Gigabit Ethernet

Acceleration for Big Data, Hadoop and Memcached

Intra-MIC MPI Communication using MVAPICH2: Early Experience

Performance of HPC Applications over InfiniBand, 10 Gb and 1 Gb Ethernet. Swamy N. Kandadai and Xinghong He and

High Performance Migration Framework for MPI Applications on HPC Cloud

High Performance MPI on IBM 12x InfiniBand Architecture

High Performance Distributed Lock Management Services using Network-based Remote Atomic Operations

Evaluating the Impact of RDMA on Storage I/O over InfiniBand

Designing High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters

Scaling with PGAS Languages

MVAPICH-Aptus: Scalable High-Performance Multi-Transport MPI over InfiniBand

Optimized Non-contiguous MPI Datatype Communication for GPU Clusters: Design, Implementation and Evaluation with MVAPICH2

Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters

High-Performance Training for Deep Learning and Computer Vision HPC

Performance Analysis and Evaluation of PCIe 2.0 and Quad-Data Rate InfiniBand

Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand

Mellanox Technologies Maximize Cluster Performance and Productivity. Gilad Shainer, October, 2007

GPU- Aware Design, Implementation, and Evaluation of Non- blocking Collective Benchmarks

NAMD Performance Benchmark and Profiling. November 2010

Performance Evaluation of InfiniBand with PCI Express

Designing High-Performance MPI Collectives in MVAPICH2 for HPC and Deep Learning

Comparing Ethernet and Soft RoCE for MPI Communication

Support for GPUs with GPUDirect RDMA in MVAPICH2 SC 13 NVIDIA Booth

Checklist for Selecting and Deploying Scalable Clusters with InfiniBand Fabrics

MVAPICH2 Project Update and Big Data Acceleration

2008 International ANSYS Conference

Performance Evaluation of InfiniBand with PCI Express

Introduction to High-Speed InfiniBand Interconnect

INAM 2 : InfiniBand Network Analysis and Monitoring with MPI

High-Performance Broadcast for Streaming and Deep Learning

Creating an agile infrastructure with Virtualized I/O

Infiniband and RDMA Technology. Doug Ledford

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA

Memcached Design on High Performance RDMA Capable Interconnects

10-Gigabit iwarp Ethernet: Comparative Performance Analysis with InfiniBand and Myrinet-10G

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Mohammadreza Bayatpour, Hari Subramoni, D. K.

Coupling GPUDirect RDMA and InfiniBand Hardware Multicast Technologies for Streaming Applications

Accelerating Big Data with Hadoop (HDFS, MapReduce and HBase) and Memcached

Informatix Solutions INFINIBAND OVERVIEW. - Informatix Solutions, Page 1 Version 1.0

Big Data Meets HPC: Exploiting HPC Technologies for Accelerating Big Data Processing and Management

FROM HPC TO THE CLOUD WITH AMQP AND OPEN SOURCE SOFTWARE

Solutions for Scalable HPC

NFS/RDMA over 40Gbps iwarp Wael Noureddine Chelsio Communications

Intel Cluster Toolkit Compiler Edition 3.2 for Linux* or Windows HPC Server 2008*

Comparing Ethernet & Soft RoCE over 1 Gigabit Ethernet

2-Port 40 Gb InfiniBand Expansion Card (CFFh) for IBM BladeCenter IBM BladeCenter at-a-glance guide

CP2K Performance Benchmark and Profiling. April 2011

ABySS Performance Benchmark and Profiling. May 2010

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구

The NE010 iwarp Adapter

PERFORMANCE ACCELERATED Mellanox InfiniBand Adapters Provide Advanced Levels of Data Center IT Performance, Productivity and Efficiency

Overview of MVAPICH2 and MVAPICH2- X: Latest Status and Future Roadmap

Application Acceleration Beyond Flash Storage

To Infiniband or Not Infiniband, One Site s s Perspective. Steve Woods MCNC

Interconnect Your Future

OceanStor 9000 InfiniBand Technical White Paper. Issue V1.01 Date HUAWEI TECHNOLOGIES CO., LTD.

Accelerating Big Data Processing with RDMA- Enhanced Apache Hadoop

Introduction to Infiniband

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures

Benchmarking Clusters with High Core-Count Nodes

Impact of HPC Cloud Networking Technologies on Accelerating Hadoop RPC and HBase

HYCOM Performance Benchmark and Profiling

MVAPICH2 and MVAPICH2-MIC: Latest Status

High Performance MPI Support in MVAPICH2 for InfiniBand Clusters

Comparative Performance Analysis of RDMA-Enhanced Ethernet

DB2 purescale: High Performance with High-Speed Fabrics. Author: Steve Rees Date: April 5, 2011

Enabling Efficient Use of UPC and OpenSHMEM PGAS models on GPU Clusters

Optimizing MPI Communication on Multi-GPU Systems using CUDA Inter-Process Communication

Designing Multi-Leader-Based Allgather Algorithms for Multi-Core Clusters *

Transcription:

RDMA over Ethernet - A Preliminary Study Hari Subramoni, Miao Luo, Ping Lai and Dhabaleswar. K. Panda Computer Science & Engineering Department The Ohio State University

Introduction Problem Statement Approach Performance Evaluation and Results Conclusions and Future Work

Ethernet and InfiniBand accounts for majority of interconnects in high performance distributed computing End users want InfiniBand like latencies with existing Ethernet infrastructure Can be achieved if networks converge Existing options have overhead or tradeoffs in terms of performance No solution exists that efficiently combines the ubiquitous nature of Ethernet and the high performance offered by InfiniBand RDMA over Ethernet (RDMAoE) seems to provide a good option as of date

Allows running the IB transport protocol using Ethernet frames RDMAoE packets are standard Ethernet frames with an IEEE assigned Ethertype, a GRH, unmodified IB transport headers and payload InfiniBand HCA takes care of translating InfiniBand addresses to Ethernet addresses and back Encodes IP addresses into its GIDs and resolves MAC addresses using the host IP stack Use GID s for establishing connections instead of LID s No SM/SA, Ethernet management practices are used

An industry standard for low latency, high bandwidth, System Area Networks Multiple features Two communication types Channel Semantics Memory Semantics (RDMA mechanism) Multiple virtual lanes Quality of Service (QoS) support Double Data Rate (DDR) with 20 Gbps bandwidth has been there Quad Data Rate (QDR) with 40 Gbps bandwidth is available recently Multiple generations of InfiniBand adapters are available now The latest ConnectX DDR adapters provide support for both IB as well as RDMAoE modes

Applications TCP / IP IPoIB Verbs (with RDMA) Open Fabrics Verbs (with RDMA) Application Protocol ConnectX ConnectX ConnectX ConnectX Adapter Ethernet Switch IB Switch IB Switch Ethernet Switch Switch TCP / IP IPoIB Native IB RDMAoE Network Protocol

Introduction Problem Statement Approach Performance Evaluation and Results Conclusions and Future Work

How do the different communication protocols stack up against each other as far Raw sockets / verbs level performance Performance for MPI applications Performance for Data center applications Does RDMAoE bring us a step closer to the goal of network convergence

Introduction Problem Statement Approach Performance Evaluation and Results Conclusions and Future Work

Protocol level benchmarks to evaluate very basic performance MPI level benchmarks to evaluate basic MPI performance at both point to point and collective levels Application level benchmarks to evaluate performance of real world applications Evaluation using common data center applications

Introduction Problem Statement Approach Performance Evaluation and Results Conclusions and Future Work

Compute Platform Intel Nehalem Intel Xeon E5530 Dual quad-core processors operating at 2.40 GHz 12GB RAM, 8MB cache PCIe 2.0 interface Host Channel Adapter Dual port ConnectX DDR adapter Configured in either RDMAoE mode or IB mode Network Switches 24 port Mellanox IB DDR switch 24 port Fulcrum Focalpoint 10GigE switch OFED version OFED-1.4.1 for IB and IPoIB Pre-release version of OFED-1.5 for RDMAoE and TCP / IP MPI version MVAPICH-1.1 and MPICH-1.2.7p1

High Performance MPI Library for IB and 10GE MVAPICH (MPI-1) and MVAPICH2 (MPI-2) Used by more than 960 organizations in 51 countries More than 32,000 downloads from OSU site directly Empowering many TOP500 clusters 8 th ranked 62,976-core cluster (Ranger) at TACC Available with software stacks of many IB, 10GE and server vendors including Open Fabrics Enterprise Distribution (OFED) Also supports udapl device to work with any network supporting udapl http://mvapich.cse.ohio-state.edu/

OSU Microbenchmarks (OMB) Version 3.1.1 http://mvapich.cse.ohio-state.edu/benchmarks/ Intel Collective Microbenchmarks (IMB) Version 3.2 http://software.intel.com/en-us/articles/intel-mpi-benchmarks/ NAS Parallel Benchmarks (NPB) Version 3.3 http://www.nas.nasa.gov/

Latenc cy (us) 40 12000 35 30 25 20 15 10 5 0 29.9 25.5 Laten ncy (us) 10000 8000 6000 4000 3.03 1.66 2000 0 Message Size (Bytes) Message Size (Bytes) Native IB RDMAoE TCP/IP IPoIB Native IB RDMAoE TCP/IP IPoIB For small messages Native IB verbs offers best latency of 1.66 us RDMAoE comes very close to this at 3.03 us

Latenc cy (us) 70 35000 60 50 40 30 20 10 0 47.9 45.4 3.6 1.8 Laten ncy (us) 30000 25000 20000 15000 10000 5000 0 Message Size (Bytes) Message Size (Bytes) Native IB RDMAoE TCP/IP IPoIB Native IB RDMAoE TCP/IP IPoIB For small messages Native IB verbs offers best latency of 1.8 us RDMAoE comes very close to this at 3.6 us

Late ency (us) 80 70 60 50 40 30 20 10 0 64.7 39.4 3.51 1.66 Late ency (us) 45000 40000 35000 30000 25000 20000 15000 10000 5000 0 Message Size (Bytes) Message Size (Bytes) Native IB RDMAoE TCP/IP IPoIB Native IB RDMAoE TCP/IP IPoIB 4 pairs of processes communicating simultaneously For small messages Native IB verbs offers best latency of 1.66 us RDMAoE comes very close to this at 3.51 us

Latenc cy (us) 1000 2500000 900 800 2000000 700 600 1500000 500 400 330.6 1000000 300 280.7 200 500000 22.71 100 12.97 0 0 Laten ncy (us) Message Size (Bytes) Message Size (Bytes) Native IB RDMAoE TCP/IP IPoIB Native IB RDMAoE TCP/IP IPoIB For small messages Native IB verbs offers best latency of 12.97 us RDMAoE comes very close to this at 22.71 us

Latenc cy (us) 500 600000 450 400 500000 329.1 350 400000 300 250 285.4 300000 200 200000 150 12.78 100 100000 10.05 50 0 0 Laten ncy (us) Message Size (Bytes) Message Size (Bytes) Native IB RDMAoE TCP/IP IPoIB Native IB RDMAoE TCP/IP IPoIB For small messages Native IB verbs offers best latency of 10.05 us RDMAoE comes very close to this at 12.78 us

Normali ized Time 8 7 6 5 4 3 2 1 0 LU CG EP FT MG NAS Benchmarks Native-IB RDMAoE TCP/IP IP o IB 32 process, Class C Numbers normalized to Native-IB Performance of Native IB and RDMAoE are very close with Native IB giving the best performance

File Transfe er Time (s) 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 Message Size (MB) We evaluate FTP, a common data center application We use our own version of FTP over native IB verbs (FTP-ADTS [2]) to evaluate RDMAoE and Native IB GridFTP [1] is used to evaluate performance of TCP/IP and IPoIB RDMAoE shows performance comparable to Native IB Native IB RDMAoE TCP/IP IPoIB [1] http://www.globus.org/grid_software/data/gridftp.php [2] FTP Mechanisms for High Performance Data-Transfer over InfiniBand. Ping Lai, Hari Subramoni, Sundeep Narravula, Amith Mamidala, D K. Panda. ICPP 09.

Introduction Problem Statement Approach Performance Evaluation and Results Conclusions and Future Work

Perform comprehensive evaluation of all possible modes of communication (Native IB, RDMAoE, TCP/IP, IPoIB) using Verbs MPI Application and, Data center level experiments Native IB gives the best performance followed by RDMAoE RDMAoE provides a high performance solution to the problem of network convergence As part of future work, we plan to Perform large scale evaluations including studies into the effect of network contention on the performance of these protocols Study these protocols in a comprehensive manner for file systems

{subramon, laipi, luom, panda}@cse.ohio-state.edu Network-Based Computing Laboratory http://mvapich.cse.ohio-state.edu/