Comparative Performance Analysis of RDMA-Enhanced Ethernet

Size: px

Start display at page:

Download "Comparative Performance Analysis of RDMA-Enhanced Ethernet"

Zoe Shaw
5 years ago
Views:

1 Comparative Performance Analysis of RDMA-Enhanced Ethernet Casey B. Reardon and Alan D. George HCS Research Laboratory University of Florida Gainesville, FL July 24, 2005 Clement T. Cole Ammasso Inc. Boston, MA

2 Outline I. Paper Overview II. Background A. RDMA and iwarp B. AMSO 1100 III. Benchmarks and Configuration A. Experimental Testbed Overview B. MPI Benchmarks Overview IV. Experiments and Results A. Pallas Results B. Gromacs Results C. NPB Results D. Analysis V. Conclusions July 24,

3 Paper Overview! RDMA-enhanced Ethernet provides a new alternative for HPC networking! This work attempts to provide a useful performance comparison of a first-generation, RDMA-Ethernet implementation with two existing popular network technologies, namely: " Ethernet, the most common interconnect in distributed computing " InfiniBand, an existing high-performance RDMA network technology! Performance will be analyzed using a range of parallel benchmarks, from low-level to application tests " All benchmarks are based on MPI [1] July 24,

4 Background RDMA and iwarp! RDMA permits data to be moved from memory subsystem of one system on network to another with limited involvement from either host CPU! New standards (e.g. iwarp) are enabling RDMA to be implemented using Ethernet as physical/datalink layers and TCP/IP as transport " Combines performance and latency advantages of RDMA while maintaining cost, maturity, and standardization benefits of Ethernet and TCP/IP Application Application Application Memory Memory MPI udapl User Space OS Control RDMA Adapter Data Ethernet Switch Data RDMA Concept Control RDMA Adapter OS Verbs RDMAP DDP MPA TCP/IP Physical iwarp Layers RNIC: Hardware, Software, and Firmware July 24,

5 Background AMSO 1100! RDMA-capable Ethernet NIC from Ammasso, Inc. [6]! Implements iwarp, user and kernel DAPL, and MPI specifications [5, 2, 1]! Hardware interface extensions developed by UC Berkeley and Ohio State to call an IB-based RDMA implementation were modified to call Ammasso iwarp RDMA Verbs Application Memory Virtual Address Translation Direct Data Placement Payload Ethernet MAC Network RNIC Chip RDMA & TCP Protocol Control IP Block Diagram of AMSO 1100 RNIC July 24,

6 Experimental Testbed Overview! Cluster of 16 Server Nodes, each with: " Dual AMD Opteron 240 processors (only one used) " Tyan Thunder K8S motherboard w/ PCI-X bus slots " 1 GB memory " Suse Linux 9! Conventional Gigabit Ethernet Network " On-board Broadcom PCI-X BCM5704C controller " Nortel Baystack-5510 GigE switch " LAM MPI implementation [10]! RDMA-Based Gigabit Ethernet Network " AMSO 1100 PCI-X RNIC " Nortel Baystack-5510 GigE switch " Ammasso-distributed implementation of MPICH-1.2.5! 4X InfiniBand Network " Voltaire 400LP Host Channel Adapter " Voltaire 9024 InfiniBand Switch Router " MVAPICH MPI from Ohio State University [11] July 24,

7 MPI Benchmarks Overview! All experiments conducted at University of Florida s HCS Laboratory! Selected tests from three MPI benchmark suites " PMB, Gromacs, NPB! Pallas MPI Benchmarks (PMB) [7] " Includes variety of raw, low-level MPI communication tests " Two tests included from this suite: PingPong and SendRecv " PingPong! Reports one-way latency as half of measured round-trip time! Throughput derived by dividing message size by latency " SendRecv! Forms periodic chain of participating nodes! Nodes simultaneously send messages to next node in chain while receiving from previous node! Reported latency is time for all nodes to complete a send! Throughput derived by dividing message size by latency July 24,

Simulations include numerous incremental time-steps!

8 MPI Benchmarks Overview! Gromacs [8] " Developed at Groningen University s Department of Biophysical Chemistry " Simulates molecular dynamics systems " Very latency-sensitive application! Simulations include numerous incremental time-steps! Synchronization before each time-step stresses Image courtesy [8] network latencies " Three systems considered: Villin, DPPC, and LZM! These benchmarks are distributed by developers of Gromacs " Results reported represent simulation time that can be completed by the system in one day! NAS Parallel Benchmarks (NPB) [9] " Wide array of scientific applications " Two tests selected here: integer sort (IS) and conjugate gradient (CG) " Class B data sizes used in all tests July 24,

9 Experimental Results: PMB - PingPong GigE RDMA GigE IB (4x) GigE RDMA GigE IB (4x) Throughput (MB/s) Latency (us) , ,024 Frame Size (Bytes) Frame Size (Bytes)! Latencies as low as 6 µs for IB using small messages! RDMA Ethernet showed latencies over 15 µs lower than conventional Ethernet for small messages July 24,

10 Experimental Results: PMB - SendRecv GigE RDMA GigE IB (4x) GigE RDMA GigE IB (4x) Throughput (MB/s) Latency (us) Frame Size (Bytes) Frame Size (Bytes)! Results similar to those seen in PingPong tests! IB offers even larger advantage in latency and throughput over two Ethernet networks! RDMA Ethernet latencies are 20 µs less than conventional Ethernet July 24,

11 Experimental Results: PMB! IB also offers highest throughput as expected " 4X IB is capable of approx. ten times throughput of both Ethernets " Graphs become more skewed as message size grows, thus only messages up to 1 KB are included here! 20 µs latencies seen with RDMA are extremely low for Ethernet " This leads directly to higher throughputs observed with RDMA! Larger performance disparities seen in SendRecv results " SendRecv stresses network performance more than PingPong! Note: All numbers measured here are MPI latencies " Can be highly dependent upon MPI implementation July 24,

12 Experimental Results: Gromacs Villin Nanoseconds / Day GigE RDMA GigE IB (4x)! IB scales better than either Ethernet in Villin system Number of Nodes " Villin least computationally intensive of three Gromacs systems, leading to more frequent synchronizations " As system size grows, timesteps (and thus messaging) become more frequent " IB best equipped to handle bandwidth required of this traffic, but RDMA Ethernet surpasses conventional Ethernet! For system sizes up to 8 nodes, RDMA performs midway between IB and conventional Ethernet! Performance gap between conventional and RDMA Ethernet increases for system sizes less than 16 nodes July 24,

13 Experimental Results: Gromacs - LZM GigE RDMA GigE IB (4x)! RDMA Ethernet is able to scale fairly closely with IB for all system sizes Nanoseconds / Day Number of Nodes " Performance almost identical until system size reaches or exceeds 8 nodes! With 12-node system, RDMA offers 25% enhancement over conventional Ethernet, while only 10% less than IB! Note: LZM benchmark is not capable of scaling beyond 12 nodes July 24,

14 Experimental Results: Gromacs - DPPC! DPPC exhibits least variance in performance between interconnects among Gromacs systems Nanooseconds / Day GigE RDMA GigE IB (4x) Number of Nodes! Overall performance trends similar to those from LZM system! For all cases here, RDMA Ethernet performs more closely to IB than conventional Ethernet! Performance of IB and RDMA Ethernet are almost identical for systems with less than 8 nodes " Less than 10% performance difference at 16 nodes July 24,

15 Experimental Results: NPB - IS! Integer sort (IS) appears to be more network-sensitive than other tests in NPB suite " Uses high proportion of large messages! RDMA Ethernet offers significant improvement over conventional Ethernet in all cases in this study " RDMA Ethernet scales more closely with IB than conventional Ethernet Execution Time (s) GigE RDMA GigE IB (4x) Number of Nodes July 24,

16 Experimental Results: NPB - CG! Memory accesses are highly prevalent in CG, causing superlinear speedups to be observed! Conventional and RDMA Ethernet provide almost identical performance for all system sizes! IB provides ~30% increase in performance over both Ethernet networks for intermediate sizes, but nearly identical at largest size considered here Execution Time (s) GigE RDMA GigE IB (4x) Number of Nodes July 24,

17 Analysis! As expected, IB provided the highest performance in all tests " IB offers higher bandwidths and lower latencies than both two Gigabit Ethernet networks considered here! RDMA Ethernet showed varying performance gains over conventional Ethernet in every scenario " Gains seen in both low-level and application-level experiments " For latency-sensitive applications, this step-up in performance is significant! Offloading of network processing provides additional advantages for RDMA networks " Results from tests such as LZM and DPPC in Gromacs, where IB and RDMA Ethernet scale comparably and significantly better than conventional Ethernet, suggests that RNIC provides processor offloading on par with IB " Detailed effects of RDMA processor offloading are left for future research! Despite lower latencies and processor offloading, RDMA cannot offer significant speedup to all applications, as expected " e.g. results from CG in NPB July 24,

18 Analysis! RDMA Ethernet can be a cost-effective alternative to expensive high-performance networks such as InfiniBand for applications sensitive to latency " An RNIC such as AMSO1100 featured here expected to sell for far less than typical IB HCAs " In addition, cost for Ethernet switches (per port) are a fraction of those for IB switches " RDMA Ethernet can use existing Ethernet switching infrastructures, further cutting possible implementation costs! Our study focused only on performance with a modest-sized cluster " Superior bandwidth offered by IB may provide better scalability to larger systems " As system size grows, so too will cost-disparity between IB and RDMA Ethernet implementations! High-number of additional switch ports needed to accommodate large systems may make an IB implementation infeasible for systems with limited budgets " Performance analyses with larger clusters are left for future research July 24,

19 Conclusions! As results in this study show, network interconnects can have a major impact on performance in distributed and parallel systems! IB still provides best overall performance in all applications among the networks considered, as expected! RDMA Ethernet showed significant improvement over conventional Ethernet in many applications " In some cases, performance of RDMA Ethernet approaches performance of IB! RDMA and iwarp offer an attractive technology with significant potential for achieving increasingly high performance at low cost " Savings become even greater when a user can leverage existing Ethernet infrastructure! One can foresee RDMA-capable Ethernet provided by default on all servers and operating with iwarp at 10 Gb/s data rates and more July 24,

20 References [1] MPI: A Message Passing Interface Standard v1.1, The MPI Forum, [2] User-Level Direct Access Transport APIs (udapl v1.2), udapl Homepage, DAT Collaborative. [3] T. Talpey, NFS/RDMA. IETF NFSv4 Interim WG meeting. June 4, 2003, [4] S. Bailey and T. Talpey, The Architecture of Direct Data Placement (DDP) and Remote Direct Memory Access (RDMA) on Internet Protocols, The IETF Internet Report, Feb. 2, 2005, [6] Ammasso AMSO 1100, Ammasso Inc., [7] Pallas MPI Benchmarks PMB, Part 1, Pallas GmbH., July 24,

21 References [8] Gromacs: The World s Fastest Molecular Dynamics, Dept. of Biophysical Chemistry, Groningen University, [9] NAS Parallel Benchmarks, NASA Advanced Supercomputing Division, [10] LAM/MPI: Enabling Efficient and Productive MPI Development, University of Indiana at Bloomington, [11] MVAPICH: MPI for InfiniBand over VAPI Layer, Networked-Based Computing Lab, Ohio State University, June 2003, [12] J. Boisseau, L. Carter, K. Gatlin, A. Majumdar, and A. Snavely, NAS Benchmarks on the Tera MTA, Proc. of Workshop on Multi- Threaded Execution, Architecture, and Compilers, Las Vegas, NV, February 1-4, July 24,

10-Gigabit iwarp Ethernet: Comparative Performance Analysis with InfiniBand and Myrinet-10G

10-Gigabit iwarp Ethernet: Comparative Performance Analysis with InfiniBand and Myrinet-10G Mohammad J. Rashti and Ahmad Afsahi Queen s University Kingston, ON, Canada 2007 Workshop on Communication Architectures