iwarp Learnings and Best Practices

Similar documents
ETHERNET XXX IWARP PERFORMANCE STUDY

The NE010 iwarp Adapter

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA

Study. Dhabaleswar. K. Panda. The Ohio State University HPIDC '09

46PaQ. Dimitris Miras, Saleem Bhatti, Peter Kirstein Networks Research Group Computer Science UCL. 46PaQ AHM 2005 UKLIGHT Workshop, 19 Sep

LAMMPS and WRF on iwarp vs. InfiniBand FDR

10-Gigabit iwarp Ethernet: Comparative Performance Analysis with InfiniBand and Myrinet-10G

NTRDMA v0.1. An Open Source Driver for PCIe NTB and DMA. Allen Hubbe at Linux Piter 2015 NTRDMA. Messaging App. IB Verbs. dmaengine.h ntb.

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters

Chelsio iwarp Installation and Setup Guide.

Dell EMC Ready Bundle for HPC Digital Manufacturing Dassault Systѐmes Simulia Abaqus Performance

Comparing Ethernet and Soft RoCE for MPI Communication

RoCE vs. iwarp Competitive Analysis

TIBCO, HP and Mellanox High Performance Extreme Low Latency Messaging

Application Acceleration Beyond Flash Storage

Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms

Checklist for Selecting and Deploying Scalable Clusters with InfiniBand Fabrics

LAMMPS, LS- DYNA, HPL, and WRF on iwarp vs. InfiniBand FDR

OPEN MPI WITH RDMA SUPPORT AND CUDA. Rolf vandevaart, NVIDIA

Chelsio 10G Ethernet Open MPI OFED iwarp with Arista Switch

Meltdown and Spectre Interconnect Performance Evaluation Jan Mellanox Technologies

Birds of a Feather Presentation

ICON Performance Benchmark and Profiling. March 2012

CPMD Performance Benchmark and Profiling. February 2014

Performance of HPC Applications over InfiniBand, 10 Gb and 1 Gb Ethernet. Swamy N. Kandadai and Xinghong He and

LAMMPSCUDA GPU Performance. April 2011

Future Routing Schemes in Petascale clusters

2008 International ANSYS Conference

Accelerating Real-Time Big Data. Breaking the limitations of captive NVMe storage

Comparing Ethernet & Soft RoCE over 1 Gigabit Ethernet

NFS on the Fast track - fine tuning and futures

Unified Runtime for PGAS and MPI over OFED

Open Fabrics Workshop 2013

OCTOPUS Performance Benchmark and Profiling. June 2015

Advanced Computer Networks. End Host Optimization

RDMA in Embedded Fabrics

NAMD Performance Benchmark and Profiling. January 2015

Altair OptiStruct 13.0 Performance Benchmark and Profiling. May 2015

DB2 purescale: High Performance with High-Speed Fabrics. Author: Steve Rees Date: April 5, 2011

by Brian Hausauer, Chief Architect, NetEffect, Inc

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience

10GE network tests with UDP. Janusz Szuba European XFEL

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구

打造 Linux 下的高性能网络 北京酷锐达信息技术有限公司技术总监史应生.

MILC Performance Benchmark and Profiling. April 2013

iser as accelerator for Software Defined Storage Rahul Fiske, Subhojit Roy IBM (India)

CP2K Performance Benchmark and Profiling. April 2011

ABySS Performance Benchmark and Profiling. May 2010

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

OFED Storage Protocols

MM5 Modeling System Performance Research and Profiling. March 2009

Multifunction Networking Adapters

IBM WebSphere MQ Low Latency Messaging Software Tested With Arista 10 Gigabit Ethernet Switch and Mellanox ConnectX

Parallel Computing at DESY Zeuthen. Introduction to Parallel Computing at DESY Zeuthen and the new cluster machines

Storage System. David Southwell, Ph.D President & CEO Obsidian Strategics Inc. BB:(+1)

NFS/RDMA over 40Gbps iwarp Wael Noureddine Chelsio Communications

Simulation using MIC co-processor on Helios

Chelsio Communications. Meeting Today s Datacenter Challenges. Produced by Tabor Custom Publishing in conjunction with: CUSTOM PUBLISHING

CERN openlab Summer 2006: Networking Overview

SMB Direct Update. Tom Talpey and Greg Kramer Microsoft Storage Developer Conference. Microsoft Corporation. All Rights Reserved.

STAR-CCM+ Performance Benchmark and Profiling. July 2014

To Infiniband or Not Infiniband, One Site s s Perspective. Steve Woods MCNC

Cloud Networking with OpenStack. 马力 海云捷迅 (AWcloud)

Memory Scalability Evaluation of the Next-Generation Intel Bensley Platform with InfiniBand

Generic RDMA Enablement in Linux

Designing High Performance Communication Middleware with Emerging Multi-core Architectures

LS-DYNA Performance Benchmark and Profiling. April 2015

UCX: An Open Source Framework for HPC Network APIs and Beyond

Containing RDMA and High Performance Computing

Advanced RDMA-based Admission Control for Modern Data-Centers

GROMACS Performance Benchmark and Profiling. August 2011

CESM (Community Earth System Model) Performance Benchmark and Profiling. August 2011

AcuSolve Performance Benchmark and Profiling. October 2011

Accelerating Hadoop Applications with the MapR Distribution Using Flash Storage and High-Speed Ethernet

The Impact of Inter-node Latency versus Intra-node Latency on HPC Applications The 23 rd IASTED International Conference on PDCS 2011

NAMD Performance Benchmark and Profiling. February 2012

CP2K Performance Benchmark and Profiling. April 2011

Optimization of Lustre* performance using a mix of fabric cards

Implementing Efficient and Scalable Flow Control Schemes in MPI over InfiniBand

Introduction to Ethernet Latency

Sharing High-Performance Devices Across Multiple Virtual Machines

Infiniband and RDMA Technology. Doug Ledford

Informatix Solutions INFINIBAND OVERVIEW. - Informatix Solutions, Page 1 Version 1.0

Dell EMC Ready Bundle for HPC Digital Manufacturing ANSYS Performance

SNIA Developers Conference - Growth of the iscsi RDMA (iser) Ecosystem

ANSYS Fluent 14 Performance Benchmark and Profiling. October 2012

Spark Over RDMA: Accelerate Big Data SC Asia 2018 Ido Shamay Mellanox Technologies

The Future of High Performance Interconnects

Netronome 25GbE SmartNICs with Open vswitch Hardware Offload Drive Unmatched Cloud and Data Center Infrastructure Performance

EVPath Performance Tests on the GTRI Parallel Software Testing and Evaluation Center (PASTEC) Cluster

Workshop on High Performance Computing (HPC) Architecture and Applications in the ICTP October High Speed Network for HPC

Memory Management Strategies for Data Serving with RDMA

InfiniPath Drivers and Software for QLogic QHT7xxx and QLE7xxx HCAs. Table of Contents

Single-Points of Performance

BUILDING A BLOCK STORAGE APPLICATION ON OFED - CHALLENGES

Adaptive MPI Multirail Tuning for Non-Uniform Input/Output Access

SNAP Performance Benchmark and Profiling. April 2014

FROM HPC TO THE CLOUD WITH AMQP AND OPEN SOURCE SOFTWARE

LAMMPS-KOKKOS Performance Benchmark and Profiling. September 2015

Fast packet processing in the cloud. Dániel Géhberger Ericsson Research

Transcription:

iwarp Learnings and Best Practices Author: Michael Fenn, Penn State Date: March 28, 2012 www.openfabrics.org 1

Introduction Last year, the Research Computing and Cyberinfrastructure group at Penn State collaborated with Intel to produce some data on how iwarp performs in a real environment with real codes Results of that study are available at: http://rcc.its.psu.edu/education/white_papers/pe nnstate_iwarp_wp_final.pdf Today I ll be sharing what we learned along the way about running an iwarp network www.openfabrics.org 2

Roadmap What is iwarp? Our Test Environment Networking Considerations iwarp Software Setup MPI Considerations Performance Observations Multi-Fabric Hosts Conclusions Acknowledgements www.openfabrics.org 3

What is iwarp? Internet Wide Area RDMA Protocol Essentially, RDMA over TCP thus allowing RDMA applications to work over an arbitrary TCP connection Ways to lower latency: Kernel bypass drivers Acceleration of the transport protocols A low-latency, well-provisioned fabric www.openfabrics.org 4

Our Test Environment Dell PowerEdge R710 servers Two Xeon X5560 processors (2.80 GHz) 48GB DDR3 1333 memory Intel NetEffect 10Gb Ethernet Adapter Red Hat Enterprise Linux 5.6 OFED 1.5.2 Arista 7148SX 10Gb Ethernet Switch www.openfabrics.org 5

The Fabric For good performance in RDMA applications, you need a low-latency Ethernet switch Our 7148SX has 1.2µs latency It s a couple years old, newer switches are well into the 100 s of ns Dropped packets and retransmissions kill performance Use flow control: 802.3x if you must, but 802.1Qbb if you can Better yet, design a fully non-blocking network www.openfabrics.org 6

Non-blocking Ethernet Fabrics Is the dream yet a reality? With a small number of hosts, this is easy: < 48, use a fixed-port switch < ~384, use a chassis switch with non oversubscribed line cards Beyond that? That darn spanning tree gets in the way of designing a true fat tree fabric Transparent Interconnect of Lots of Links (TRILL) seems to be the solution, but so far implementations are proprietary www.openfabrics.org 7

Jumbo Frames If your network doesn t support Jumbo frames yet, don t worry VASP results: iwarp without jumbo frames: 52.30 minutes iwarp with jumbo frames: 51.26 minutes Surprisingly (at least to me) 1500 byte frames are only about 2% slower than 9000 byte frames in message passing applications My theory is that these classes of application are more latency-sensitive than bandwidth-sensitive This was largely borne out in our larger comparison between IB and iwarp www.openfabrics.org 8

iwarp Software Setup BIOS setup similar to other high-performance RDMA networks (IB, RoCE) Disable C-states Disable PCIe link power management Increase memlock ulimits Need to use a recent OFED RHEL 5 s bundled OFED is too old www.openfabrics.org 9

More Software Setup The iw_nes driver needs some extra parameters in /etc/modprobe.conf: options iw_nes nes_drv_opt=0x110 options rdma_cm unify_tcp_port_space=1 alias eth2 iw_nes install iw_nes /sbin/sysctl -w net.ipv4.tcp_sack=0 > /dev/null 2>&1; /sbin/modprobe --ignore-install iw_nes Needing to keep track of which eth* device the NetEffect card is can somewhat complicate deployments on diverse hardware www.openfabrics.org 10

More Software Setup The file /etc/dat.conf lets you pass parameters to udapl providers The OpenIB-iwarp and ofa-v2-iwarp need to be modified to refer to the NetEffect eth* device www.openfabrics.org 11

Optional TCP Tuning These sysctl parameters control the behavior of the Linux TCP stack, but don t affect the hardware TCP engine in the NetEffect: net.ipv4.tcp_timestamps=1 net.ipv4.tcp_sack=0 net.ipv4.tcp_rmem=4096 87380 4194304 net.ipv4.tcp_wmem=4096 16384 4194304 net.core.rmem_max=131071 net.core.wmem_max=131071 net.core.netdev_max_backlog=1000 net.ipv4.tcp_max_syn_backlog=1024 net.ipv4.tcp_window_scaling=1 net.core.rmem_default=126976 net.core.wmem_default=126976 net.core.optmem_max=20480 www.openfabrics.org 12

MPI Implementations iwarp is well-supported by popular MPI implementations OpenMPI MVAPICH2 Intel MPI Platform MPI (neé HP-MPI) We used OpenMPI and HP-MPI in our testing www.openfabrics.org 13

OpenMPI Example arguments to mpiexec mpiexec -np 16 -machinefile mymachines --mca btl_openib_flags 2 --mca btl_mpi_leave_pinned 0 --mca btl_openib_max_inline_data 64 --mca btl openib,self,sm Notice we are using the verbs interface here We found that OpenMPI over the verbs interface works well with the NetEffect adapter Setting the btl_openib_flags to 2 allows the sender to use RDMA writes on the NetEffect adapter Also leaving memory pinned, adjusting some buffer sizes www.openfabrics.org 14

HP-MPI HP-MPI (now Platform MPI) is a very popular MPI implementation for ISV s to use in their software products due to its stable ABI HP-MPI uses the udapl interface to communicate with the NetEffect card mpirun -UDAPL -prot -intra=shm -e MPI_HASIC_UDAPL=ofa-v2-iwarp -1sided The MPI_HASIC_UDAPL environment variable controls which udapl provider to use (set back in /etc/dat.conf However www.openfabrics.org 15

More HP-MPI /etc/dat.conf parsing is completely broken! This was a real head scratcher until we realized that HP-MPI doesn t actually parse the providers in /etc/dat.conf, it just picks the first one! The solution is to make sure that the provider you want to use is the first (or only) line in the file www.openfabrics.org 16

Performance Observations With our 7148SX, 7.5µs IMB PingPong latency 2.4µs is attributable to going through the 7148SX twice Application codes scaled well, (within the limits of our environment and benchmark) VASP (OpenMPI) Quantum Espresso Plane Wave (OpenMPI) WRF (OpenMPI) Abaqus (HP-MPI) LAMMPS (OpenMPI) MPPdyna (OpenMPI) www.openfabrics.org 17

More Performance Observations Sometimes we did notice performance degradation High amount of time spent in systems calls No apparent extra load on the network Some more driver tuning would be useful (probably addressed in newer OFED) Read the paper for the full results www.openfabrics.org 18

Multi-Fabric Hosts What if you want to have iwarp and Infiniband interconnects on the same machine? Could imagine a situation where RDMA over TCP is used for (say) storage, but Infiniband is used for MPI interconnect Another case is in a benchmarking environment, it is very useful to be able to run tests back to back with no hardware reconfiguration required This is possible, at least for some subset of cases www.openfabrics.org 19

More Multi-Fabric Hosts OpenMPI is easy, it is an mca parameter: NetEffect: --mca btl_openib_if_include nes0 Mellanox: --mca btl_openib_if_include mlx4_0 Others are possible HP-MPI should be easy, just change MPI_HASIC_UDAPL However, since /etc/dat.conf parsing is broken, changing fabrics ends up requiring an system config file change www.openfabrics.org 20

Conclusions iwarp and RDMA over Ethernet networks in general require a change in mindset A 48-port 10GbE switch with a few uplinks is not sufficient Need a fully non-blocking network, either in a chassis switch or with TRILL NetEffect hardware looks similar enough to IB to be supported by MPI with minor alterations (MPI applications themselves don t care) www.openfabrics.org 21

More Conclusions However, the rest of the ecosystem is still catching up ISV codes bundled with old MPI versions are the biggest offender Impact depends on your environment Could be a non-issue for environments with heavy open-source or community code usage If you heavily rely on ISV codes, it could be a big impediment to an iwarp deployment www.openfabrics.org 22

Acknowledgements Julie Cummings of Intel for providing expert technical assistance Tom Stachura and William Meigs of Intel for coordinating everything www.openfabrics.org 23

Questions? www.openfabrics.org 24