Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms

Similar documents
Memory Scalability Evaluation of the Next-Generation Intel Bensley Platform with InfiniBand

Can Memory-Less Network Adapters Benefit Next-Generation InfiniBand Systems?

MVAPICH-Aptus: Scalable High-Performance Multi-Transport MPI over InfiniBand

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters

Designing High Performance Communication Middleware with Emerging Multi-core Architectures

Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters

RDMA Read Based Rendezvous Protocol for MPI over InfiniBand: Design Alternatives and Benefits

Unified Runtime for PGAS and MPI over OFED

Implementing Efficient and Scalable Flow Control Schemes in MPI over InfiniBand

High Performance MPI on IBM 12x InfiniBand Architecture

Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters

High-Performance and Scalable Non-Blocking All-to-All with Collective Offload on InfiniBand Clusters: A study with Parallel 3DFFT

Study. Dhabaleswar. K. Panda. The Ohio State University HPIDC '09

Design Alternatives for Implementing Fence Synchronization in MPI-2 One-Sided Communication for InfiniBand Clusters

Advanced RDMA-based Admission Control for Modern Data-Centers

Unifying UPC and MPI Runtimes: Experience with MVAPICH

MPI Alltoall Personalized Exchange on GPGPU Clusters: Design Alternatives and Benefits

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience

Performance Evaluation of InfiniBand with PCI Express

Performance Analysis and Evaluation of PCIe 2.0 and Quad-Data Rate InfiniBand

Optimized Distributed Data Sharing Substrate in Multi-Core Commodity Clusters: A Comprehensive Study with Applications

Efficient and Truly Passive MPI-3 RMA Synchronization Using InfiniBand Atomics

Enhancing Checkpoint Performance with Staging IO & SSD

2008 International ANSYS Conference

In the multi-core age, How do larger, faster and cheaper and more responsive memory sub-systems affect data management? Dhabaleswar K.

Multi-Threaded UPC Runtime for GPU to GPU communication over InfiniBand

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA

Evaluating the Impact of RDMA on Storage I/O over InfiniBand

Designing High Performance DSM Systems using InfiniBand Features

Intra-MIC MPI Communication using MVAPICH2: Early Experience

Optimized Non-contiguous MPI Datatype Communication for GPU Clusters: Design, Implementation and Evaluation with MVAPICH2

Efficient SMP-Aware MPI-Level Broadcast over InfiniBand s Hardware Multicast

Design challenges of Highperformance. MPI over InfiniBand. Presented by Karthik

Mellanox Technologies Maximize Cluster Performance and Productivity. Gilad Shainer, October, 2007

High Performance Distributed Lock Management Services using Network-based Remote Atomic Operations

Memcached Design on High Performance RDMA Capable Interconnects

PERFORMANCE ACCELERATED Mellanox InfiniBand Adapters Provide Advanced Levels of Data Center IT Performance, Productivity and Efficiency

Birds of a Feather Presentation

S. Narravula, P. Balaji, K. Vaidyanathan, H.-W. Jin and D. K. Panda. The Ohio State University

Solutions for Scalable HPC

Performance Evaluation of InfiniBand with PCI Express

Advanced Computer Networks. End Host Optimization

Supporting Strong Cache Coherency for Active Caches in Multi-Tier Data-Centers over InfiniBand

2-Port 40 Gb InfiniBand Expansion Card (CFFh) for IBM BladeCenter IBM BladeCenter at-a-glance guide

Performance of Mellanox ConnectX Adapter on Multi-core Architectures Using InfiniBand. Abstract

The Exascale Architecture

NFS/RDMA over 40Gbps iwarp Wael Noureddine Chelsio Communications

Future Routing Schemes in Petascale clusters

Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR

RoCE vs. iwarp Competitive Analysis

10-Gigabit iwarp Ethernet: Comparative Performance Analysis with InfiniBand and Myrinet-10G

Chelsio 10G Ethernet Open MPI OFED iwarp with Arista Switch

LiMIC: Support for High-Performance MPI Intra-Node Communication on Linux Cluster

Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand

Informatix Solutions INFINIBAND OVERVIEW. - Informatix Solutions, Page 1 Version 1.0

CRFS: A Lightweight User-Level Filesystem for Generic Checkpoint/Restart

Support for GPUs with GPUDirect RDMA in MVAPICH2 SC 13 NVIDIA Booth

Low latency, high bandwidth communication. Infiniband and RDMA programming. Bandwidth vs latency. Knut Omang Ifi/Oracle 2 Nov, 2015

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Mohammadreza Bayatpour, Hari Subramoni, D. K.

Leveraging HyperTransport for a custom high-performance cluster network

Performance of HPC Applications over InfiniBand, 10 Gb and 1 Gb Ethernet. Swamy N. Kandadai and Xinghong He and

HPC Customer Requirements for OpenFabrics Software

The Future of Interconnect Technology

A Plugin-based Approach to Exploit RDMA Benefits for Apache and Enterprise HDFS

Optimizing LS-DYNA Productivity in Cluster Environments

LAMMPS, LS- DYNA, HPL, and WRF on iwarp vs. InfiniBand FDR

Coupling GPUDirect RDMA and InfiniBand Hardware Multicast Technologies for Streaming Applications

ABySS Performance Benchmark and Profiling. May 2010

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구

Shared Receive Queue based Scalable MPI Design for InfiniBand Clusters

Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters

VPI / InfiniBand. Performance Accelerated Mellanox InfiniBand Adapters Provide Advanced Data Center Performance, Efficiency and Scalability

NAMD Performance Benchmark and Profiling. November 2010

MM5 Modeling System Performance Research and Profiling. March 2009

UNDERSTANDING THE IMPACT OF MULTI-CORE ARCHITECTURE IN CLUSTER COMPUTING: A CASE STUDY WITH INTEL DUAL-CORE SYSTEM

Memory Management Strategies for Data Serving with RDMA

Introduction to High-Speed InfiniBand Interconnect

OCTOPUS Performance Benchmark and Profiling. June 2015

Interconnect Your Future

A Case for High Performance Computing with Virtual Machines

VPI / InfiniBand. Performance Accelerated Mellanox InfiniBand Adapters Provide Advanced Data Center Performance, Efficiency and Scalability

CESM (Community Earth System Model) Performance Benchmark and Profiling. August 2011

RDMA in Embedded Fabrics

Designing Multi-Leader-Based Allgather Algorithms for Multi-Core Clusters *

Designing High-Performance and Resilient Message Passing on InfiniBand

Scaling with PGAS Languages

The NE010 iwarp Adapter

Infiniband and RDMA Technology. Doug Ledford

HYCOM Performance Benchmark and Profiling

A Portable InfiniBand Module for MPICH2/Nemesis: Design and Evaluation

MPI Optimizations via MXM and FCA for Maximum Performance on LS-DYNA

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures

LiMIC: Support for High-Performance MPI Intra-Node Communication on Linux Cluster

Multifunction Networking Adapters

Can Parallel Replication Benefit Hadoop Distributed File System for High Performance Interconnects?

Adaptive MPI Multirail Tuning for Non-Uniform Input/Output Access

Software Distributed Shared Memory with High Bandwidth Network: Production and Evaluation

InfiniBand Administration Tools (IBADM)

ARISTA: Improving Application Performance While Reducing Complexity

Transcription:

Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms Sayantan Sur, Matt Koop, Lei Chai Dhabaleswar K. Panda Network Based Computing Lab, The Ohio State University

Presentation Outline Introduction and Motivation Problem Statement and Approach Used Overview of ConnectX Architecture Micro-benchmark Level Evaluation Application Level Evaluation Conclusions and Future Work

Introduction InfiniBand is a popular interconnect used for High- Performance Computing The advent of Multi-core computing is changing overall system architecture and performance parameters Trend: increasing cores / network-interface Older systems had 2-4 processors and 1 network card Most current systems have 8-cores and 1 network card Number of cores will increase as per Moore s Law! Next generation of network-interfaces may need to be redesigned to improve performance on Multi-core platforms!

InfiniBand Overview InfiniBand is an emerging HPC interconnect Industry Standard; Open Source software Very good performance with many features Minimum Latency: ~1.2us, Peak Bandwidth: ~1500MB/s RDMA, Atomic Operations, Shared Receive Queue Hardware multicast, Quality of Service Several generations of InfiniBand hardware Third Generation: InfiniHost III Fourth Generation: ConnectX Several transport mechanisms are supported Reliable Connection (RC) Unreliable Datagram (UD) Scalable Reliable Connection (SRC) (New) Initial API discussion has started

Increasing Communication Volume in Multi-Core Clusters CPU CPU Memory CPU CPU CPU CPU Memory PCI-Ex PCI-Ex PCI-Ex Gen2 Network Interface Network Interface The amount of communication handled per network interface is increasing exponentially, as Moore s law allows for more CPU cores to be fabricated

Presentation Outline Introduction and Motivation Problem Statement and Approach Used Overview of ConnectX Architecture Micro-benchmark Level Evaluation Application Level Evaluation Conclusions and Future Work

Problem Statement Can network-interfaces be designed to offer greater levels of performance for Multi-core system architecture? Can we experimentally ascertain performance improvements offered by next generation ConnectX architecture?

Approach Used Micro-benchmarks Design suitable experiments to study the performance at the lowest network-level RDMA Write/Read Latency RDMA Write bandwidth Multi-pair performance Application level Benchmarks Communication specific benchmark HALO Molecular dynamics simulation benchmark LAMMPS

Presentation Outline Introduction and Motivation Problem Statement and Approach Used Overview of ConnectX Architecture Micro-benchmark Level Evaluation Application Level Evaluation Conclusions and Future Work

ConnectX Overview ConnectX Architecture, courtesy Mellanox Fourth Generation Silicon DDR (Double Data Rate) PCI-Express Gen1 QDR (Quad Data Rate) PCI-Express Gen2 Flexibility to configure each individual port to either InfiniBand or 10G Hardware support for Virtualization Quality of Service Stateless Offloads In this presentation, we focus on the InfiniBand device

Performance improvements in ConnectX Designed to improve the processing rate of incoming packets Scheduling done in hardware with no firmware involvement in critical path Improved WQE processing capabilities Difference b/w RDMA and Send/Receive: 0.2us For earlier generation InfiniHost III it was: 1.15us Use of PIO for very small message sizes Inline allows data to be encapsulated in the WR PIO mode allows entire WR to be sent to the HCA without any DMA operations

PIO vs DMA In DMA mode, the processor writes a smaller command over I/O bus and HCA arranges for DMA transfer Lower CPU usage More I/O bus transactions In PIO mode, the processor writes data over I/O bus to a HCA command buffer Increased CPU usage Less I/O bus transactions For small messages, CPU usage is negligible, while number of transactions can be significantly reduced Doorbell CPU CPU CPU CPU Network Interface Memory I/O Bus Transcations for DMA and PIO modes

Presentation Outline Introduction and Motivation Problem Statement and Approach Used Overview of ConnectX Architecture Micro-benchmark Level Evaluation Application Level Evaluation Conclusions and Future Work

Experimental Platform Four node Intel Bensley platform Dual Intel Clovertown 2.33 GHz Quad Core 4 GB main memory (FBDIMM) 3 x8 PCI-Express slots ConnectX DDR HCAs (MT25408) Firmware 2.0.139 Expected similar performance with current FW InfiniHost III DDR HCAs (MT25218) Firmware version 5.2.0 OpenFabrics Gen2 stack (OFED-1.2) Linux kernel 2.6.20-rc5 Two identical MTS2400 switches

Multi Pair RDMA Write Latency Multi-Pair Latency on InfiniHost III Multi-Pair Latency on ConnectX OpenFabrics/Gen2 level RDMA Write test with multiple pairs InfiniHost III latencies increase linearly with number of pairs ConnectX latencies are almost same regardless of pairs For 256 bytes, size of WQE exceeds PIO limit and DMA is used Our platform doesn t execute so many PCI-E reads concurrently as issued by ConnectX firmware For 8 pairs, ConnectX factor of 6 improvement for <= 512 bytes

Multi Pair RDMA Write Latency Comparison Multi-Pair Latency Comparison for 8-pairs For 256 bytes, size of WQE exceeds PIO limit and DMA is used Our platform doesn t execute so many PCI-E reads concurrently as issued by ConnectX firmware For 8 pairs, ConnectX factor of 6 improvement for <= 128 bytes

Multi Pair RDMA Read Latency Multi-Pair Read Latency on InfiniHost III Multi-Pair Read Latency on ConnectX OpenFabrics/Gen2 level RDMA Read test with multiple pairs InfiniHost III latencies increase linearly with number of pairs ConnectX latencies increase by lesser increments For Read, no PIO can be used for data, DMA needs to be used Our chipset does not seem to issue as many concurrent reads With our settings, ConnectX can issue up to 16 concurrent reads on the bus

Multi Pair RDMA Read Latency Comparison Multi-Pair Latency Comparison for 8-pairs For 8 pairs, ConnectX factor of 3 improvement for <= 512 bytes

Multi Pair RDMA Write Bandwidth OpenFabrics/Gen2 level RDMA Write bw test with multiple pairs InfiniHost III bandwidth decreases with number of pairs ConnectX bandwidth improves with number of pairs For ConnectX, even 1-pair can achieve closer to maximum bandwidth! For 8 pairs, ConnectX factor of 10 improvement for 256 bytes

Multi Pair RDMA Write Bandwidth Comparison Multi-Pair Bandwidth Comparison for 8-pairs For 8 pairs, ConnectX factor of 10 improvement for 256 bytes

Presentation Outline Introduction and Motivation Problem Statement and Approach Used Overview of ConnectX Architecture Micro-benchmark Level Evaluation Application Level Evaluation Conclusions and Future Work

MVAPICH and MVAPICH2 Software Distributions High Performance MPI Library for InfiniBand and iwarp Clusters MVAPICH (MPI-1) and MVAPICH (MPI-2) Used by more than 540 organizations in 35 countries Empowering many TOP500 clusters Available with software stacks of many InfiniBand, iwarp and server vendors including Open Fabrics Enterprise Distribution (OFED) http://mvapich.cse.ohio-state.edu

Halo Communication Benchmark Simulates Layered Ocean model communication characteristics MVAPICH-0.9.9 is used to execute this benchmark Processes scattered in cyclic manner For small number of tiles {2, 64}, ConnectX performance is better by a factor of 2 to 5

LAMMPS Benchmark Molecular dynamics simulator from Sandia National Labs MVAPICH-0.9.9 is used to execute this benchmark Processes scattered in cyclic manner Loop Time is reported as per benchmark specs in.rhodo benchmark used For 16 processes, ConnectX outperforms InfiniHost III by 10% This is explainable by higher bandwidths for ConnectX when all 8- pairs of processes are communicating

Presentation Outline Introduction and Motivation Problem Statement and Approach Used Overview of ConnectX Architecture Micro-benchmark Level Evaluation Application Level Evaluation Conclusions and Future Work

Conclusions and Future Work In this work we presented Network and Application level performance characteristics of ConnectX architecture Outlined some of the major architectural improvements in ConnectX Performance improvements for multi-core architectures are impressive In the future, we want to Study performance on larger scale clusters Leverage novel features of ConnectX such as: Scalable Reliable Connection (SRC), QoS, Reliable Multicast etc. into future MVAPICH designs

Acknowledgements Our research is supported by the following organizations Current Funding support by Current Equipment support by 27

Web Pointers http://nowlab.cse.ohio-state.edu/ MVAPICH Web Page http://mvapich.cse.ohio-state.edu 28