Performance Analysis and Evaluation of LANL s PaScalBB I/O nodes using Quad-Data-Rate Infiniband and Multiple 10-Gigabit Ethernets Bonding

Size: px

Start display at page:

Download "Performance Analysis and Evaluation of LANL s PaScalBB I/O nodes using Quad-Data-Rate Infiniband and Multiple 10-Gigabit Ethernets Bonding"

Jesse Briggs
6 years ago
Views:

1 Performance Analysis and Evaluation of LANL s PaScalBB I/O nodes using Quad-Data-Rate Infiniband and Multiple 10-Gigabit Ethernets Bonding Hsing-bugn Chen, Alfred Torrez, Parks Fields HPC-5, Los Alamos National Lab Los Alamos, New Mexico 87111, USA {hbchen, atorrez, parks}@lanlgov Juan C Franco, Daniel Illescas, Rocio Perez-Medina, Jharrod LaFon, Ben Haynes, John Herrera INST-OFF, HPC Summer School Los Alamos National Lab Abstract - In the LANL s PaScalBB network I/O nodes carry data traffic between backend compute nodes and global scratch based file systems An I/O node is normally equipped with one Infiniband Nic for backend traffic and one or more 10-Gigabit Ethernet Nics for parallel file system data traffic With the growing deployment of multiple, multi-core processors in server and storage systems, overall platform efficiency and CPU and memory utilization depends increasingly on interconnect bandwidth and latency PCI- Express (PCIe) generation 20 has recently become available and has doubled the transfer rates available This additional I/O bandwidth balances the system and makes higher data rates for external interconnects such as Infiniband feasible As a result, Infiniband Quad-Data Rate (QDR) mode has become available on the Infiniband Host Channel Adapter (HCA) with a 40 Gb/sec signaling rate Combining HCA QDR data rates with multiple 10-Gigabit Ethernet links and using it in an IO node has created the potential to solve some of the I/O traffic bottlenecks that currently exist We have setup a small-scale PaScalBB testbed and conduct a sequence of I/O node performance tests The goal of this I/O node performance testing is to figure out an enhanced network configuration that we can apply to the LANL s Cielo machine and future LANL HPC machines using PaScalBB architecture Keywords- Server I/O networking, High Performance Networking, Infiniband, 10 Gigabit Ethernet, Link aggregation, Load balancing 1 INTRODUCTION Commercial off the shelf based cluster computing Systems have delivered reasonable performance to technical and commercial areas for years High speed computing, global storage, and networking (IPC and I/O) are the three most critical elements to build a large scale HPC cluster system Without these three elements being well balanced, we cannot fully utilize a HPC cluster High data bandwidth I/O networking provides a data super-highway to meet the needs of constantly increasing computation power and storage capacity LANL s PaScalBB server I/O architecture is designed to support data-intensive scientific applications running on very large-scale clusters The main goal of PaScalBB is to provide high performance, efficient, reliable, parallel, and scalable I/O capabilities for data-intensive scientific applications running on very large-scale clusters Data-intensive scientific simulation-based analysis normally requires efficient transfer of a huge volume of complex data among simulation, visualization, and data manipulation functions To date PaScalBB has been implemented on most of HPC production machines at LANL; Roadrunner (1 st Petaflops machine), RedTail, LOBO, Turing, TLCC, etc I/O nodes are used in the LANL s PaScalBB network to carry data traffic between backend compute nodes and global scratch based file systems An I/O node is normally equipped with one Infiniband NIC for backend IPC traffic and one or more 10-Gigabit Ethernet NICs for parallel file system data traffic With the growing deployment of multiple, multi-core processors in server and storage systems, overall platform efficiency and CPU and memory utilization depends increasingly on interconnect bandwidth and latency PCI- Express (PCIe) generation 20 has recently become available and has doubled the transfer rates available This additional I/O bandwidth balances the system and makes higher data rates for external interconnects such as Infiniband feasible As a result, Infiniband Quad-Data Rate (QDR) mode has become available on the Infiniband Host Channel Adapter (HCA) with a 40 Gb/sec signaling rate Combining HCA QDR rates with multiple 10-Gigabit IPC Ethernet links has the potential to solve some of the I/O traffic bottlenecks that currently exist We have setup a small-scale PaScalBB test bed and conduct a sequence of I/O node performance tests The goal of this I/O node performance testing is to figure out an enhanced network configuration that we can apply to the LANL s Cielo machine and future LANL HPC machines using PaScalBB architecture The rest of this paper is organized as follows In section two we describe LANL s PaScalBB server I/O infrastructure Section three introduces Infiniband/QDR and 10Gigabit Ethernet technologies We then illustrate our experimental setup and discuss testing results and performance data in section four Finally, we present our conclusion and future works in section five 2 PASCALBB SERVER I/O BACKBONE ARCHITECTURE LANL s PaScalBB [10] adopts several hardware and software components to provide a unique and scalable server This work was carried out under the auspices of the National Nuclear Security Administration of the US Department of Energy at Los Alamos National

2 I/O networking architecture Figure-1 illustrates the system components used in PaScalBB 21 Hardware Components used in PaScalBB 211 Level-1 High Speed Interconnection Network The Level-1 interconnect uses (a) high speed interconnect systems such as Quadrics, Myrinet, or Infiniband for fulfilling requirements of low latency, high speed, high bandwidth cluster IPC communication and (b) aggregating I/O-Aware multi-path routes for load-balancing and failover 212 Level 2 IP based Interconnection Network The Level-2 interconnect uses multiple Gigabit Ethernet switches/routers with layer-3 network routing support to provide latency-tolerant I/O communication and global IP based storage systems Without using the Federated network solution, we can linearly expand the Level-2 IP based network by employing a global host domain multicasting feature in metadata servers of a global file system With this support we can maintain a single name space global storage system and provide a linear cost growing path for I/O networking 213 Compute node A Compute node is equipped with at least one high-speed interface card connected to a high-speed interconnect fabric in Level-1 The node is setup with Linux multi-path equalized routing to multiple available I/O nodes for load balancing and failover (high availability) A Compute node is used for computing only and is not involved with any routing activities 214 I/O node I/O node: An I/O routing node has two network interfaces One high-speed interface card is connected to the Level-1 network for communication with Compute nodes One or more Gigabit Ethernet interface cards (bondable) are connected to the Level-2 linear scaling Gigabit switches I/O nodes serve as the routing gateways between Level-1 and Level-2 network Every I/O has the same networking capability 22 System Software Components used in PaScalBB 221 Equal Cost Multi-path routing for load balancing Multi-path routing is used to provide balanced outbound traffic to the multiple I/O gateways It also supports failover and dead-gateway detection capability for choosing good routes from active I/O gateways Linux Multi-Path routing is a destination address-based load-balancing algorithm Multipath routing should improve system performance through load balancing and reduce end-to-end delay Multi-path routing overcomes the capacity constraint of single-path routing and routes through less congested paths Each Compute node is setup with N-ways multi-path routes thru N I/O nodes Multi-path routing also balances the bandwidth gap between the Level-1 and the Level-2 interconnects We use the Equal Cost Multi-path (ECMP) routing strategy on compute nodes so compute nodes can evenly distribute traffic workloads on all I/O nodes With this bi-directional multi-path routing, we can sustain parallel data paths for both write (outbound) and read (inbound) data transfer This is especially useful when applied to concurrent socket I/O sessions on IP based storage systems PaScalBB can evenly allocate socket I/O sessions to routing available I/O routing nodes I/O nodes are used heavily in the LANL s PaScalBB network to carry data traffic between backend compute nodes and global scratch based file systems An I/O node is normally equipped with one Infiniband NIC for backend IPC traffic and one or more 10-Gigabit Ethernet NICs for parallel file system data traffic [6][7][8] 3 INFINIBAND AND 10 GIGABIT ETHERNET Infiniband [3] is a standard switched fabric communication link used in high performance computing and enterprise data centers The InfiniBand Architecture (IBA) is designed to provide high bandwidth, low-latency computing; the scalability to support thousands of nodes and multiple processor cores per server; and efficient utilization of compute processing resources The TOP-500 list published in November 2010 shows that more than 42% of the computing systems use Infiniband as their primary high-speed interconnecting network The growth rate of Infiniband in the TOP-500 systems is about 30% This is an indication of a strong momentum in adoption of Infiniband technology in HPC and Enterprise communities Ethernet has long been the dominant LAN technology Now the availability of 10-Gigabit Ethernet has enabled new applications in the data center and IP based storage systems Because 10-Gigabit Ethernet is based on the core Ethernet technology, it takes advantage of the wealth of improvement that has been developed over the years and simplifies the migration to this higher-speed technology With the growing deployment of multiple, multi-core processors in server and storage systems, overall platform efficiency and CPU and memory utilization depends increasingly on interconnect bandwidth and latency PCI- Express (PCIe) generation 20 has recently become available and has doubled the transfer rates available This additional I/O bandwidth balances the system and makes higher data rates for external interconnects such as Infiniband feasible As a result, Infiniband Quad-Data Rate (QDR) mode has become available on the Infiniband Host Channel Adapter (HCA) with a 40 Gb/sec signaling rate Combining Infiniband HCA QDR data rates with multiple 10-Gigabit Ethernet links and using it in IO node nodes has created the potential to solve some of the I/O traffic bottlenecks that currently exist in HPC machines 4 EXPERIMENTAL TESTING SETUP AND PERFORMANCE EVALUATION We setup a small-scale PaScalBB test bed and conduct a sequence of I/O node performance tests This work was carried out under the auspices of the National Nuclear Security Administration of the US Department of Energy at Los Alamos National

3 41 Testing setup and configuration Hardware equipment includes (a) Twelve Linux server machine Intel Nehalem 5600 DualQuad-core with 16GB DDR3 memory: seven Compute nodes with one Mellanox ConnectX Infiniband QDR on each compute node, one I/O node with Mellanox ConnectX Infiniband QDR [10] and multiple Mellanox ConnectX 10-Gigabit Ethernet Nics, and four data nodes with one 10-Gigabit Ethernet connection on each node, (b) One Mellanox 36-port Infiniband QDR switch, and (c) One Arista 24-port 10-Gigabit Ethernet Switch [11] Software components include (a) Fedora 12/Linux64-bit OS, (b) OFED (OpenFabrics Enterprise Distribution) [9] Infiniband/10Gigabit Ethernet system software, (c) Linux Ethernet bonding driver, and (d) netperf [12] - a network performance benchmark software 42 Performance testing and evaluation 421 Infiniband SDR/DDR/QDR performance testing Figure-2 shows the one-way communications from IB/SDR(single data rate), IB/DDR(double data rate) and IB/QDR(quad data rate) This figure illustrates the improvement of 75% of bi-directional bandwidth when moving from DDR to QDR Figure-3 shows the latency testing results from IB/SDR IB/DDR, and IB/QDR This result demonstrates the advantage of using QDR in terms of lower latency Figure- 4 shows that MPI I/O testing using various message packet sizes from 1MB to 200MB This result shows that IB/QDR can persistently provide consistent bandwidth when various message sizes are applied in MPI applications Figure-5 shows the results of (a) QDR/UC (unreliable connection) one way communication bandwidth (b) QDR/RC (reliable connection) one way communicaiton bandwidth, and (c) QDR/SRQ(shared receiving queue) bi-direction communication bandwidth We can see that IB/QDR can reach a peak of 5600MB+/sec bi-directional bandwidth from multiple streams of netperf testing Gigabit Ethernet performance testing Figure-6 shows the performance results for back-to-back connection using one single 10-Gigabit Ethernet link between two server nodes We can reach 95% bandwidth of a physical 10-Gigable link Figure-7 shows the performance result from triple 10-Gigabit Ethernet bounding back-to-back connection This figure illustrates that we can reach a peak 2300MB/sec bandwidth from three-10gige link bounding Figure-8 shows the performance result from quad 10-Gigabit Ethernet bounding back-to-back connection It only improve 5% -10% bandwidth compared it with the three-10-gigabit Ethernet bounding It may be due to the Ethernet chip-set processing capability or the Linux TCP/IP software stack 423 I/O node performance testing and justification Figure-9 shows the results of using four compute nodes and sending concurrent multiple streams of netperf data traffic through one I/O node and arriving at four different data nodes Data includes four individual links, data bandwidth, and the accumulated data bandwidth It can reach about 2950MB/sec Figure-10 shows the result of using seven compute nodes We can push the bandwidth to 4100MB/sec Figure-9 and Figure- 10 prove that we can gain more bandwidth when more compute nodes are involved in sending networking traffics This also demonstrates the scaling capability of using the LANL s PaScalBB server I/O infrastructure In Figure-11, we verify the advantage of using Linux Ethernet bonding capability We try two Ethernet bonding algorithm implemented in Linux Kernel: mode-0 and mode-5 Linux Ethernet bonding algorithm mode-0, named balancerr or Round-robin policy It transmits data packets in sequential order from the first available slave through the last This mode provides load balancing and fault tolerance Linux Ethernet bonding algorithm mode-5, named balance-tlb or Adaptive transmits load balancing It supports channel/port bonding that does not require any special switch support The outgoing data traffic is well distributed according to the current load on each slave link In-coming data traffic is received by the current slave link If the receiving slave fails, another slave takes over the MAC address of the failed receiving slave The purpose of this testing is to figure out a better traffic load balancing algorithm that can accommodate the advantage of parallel file systems used in HPC machines Our results show that mode-5 (Adaptive transmit load balancing) can obtain 10%-15% more bandwidth compared with mode-0 (a simple Round-robin policy) From the above results, we can conclude that there is definitely an advantage of using multiple 10-gigabits Ethernet bonding in an I/O node when transferring data through an IB/QDR link We also learn how to tune 10-Gigabit Ethernet bonding algorithms to come out with the best fit for HPC parallel file system such as the Paransas Panfs ActiverScale Parallel File storage system 5 CONCLUSIONS AND FUTURE WORKS We evaluate the bandwidth performance of using IB/SDR/, IN/QDR, and IB/QDR We also evaluate of various bonding algorithms of using multiple 10-Gigabie Ethernet links We verify the capability of an I/O node equipped with one IB/QDR and multiple 10-Gigabit Ethernet links We study the Linux Ethernet bonding algorithms We observe the scaling capability of an I/O when it handling more network traffics We figure out a better way of network setup and configuration for LANL s PaScalBB network We have applied our testing results to LANL s production machines As part of the future works, we intend to conduct evaluations on larger test beds, possibly using some available production HPC machines, and studying the impact of new PaScalBB network setups and configuration We also intend to carry more in-depth studies of applying different network This work was carried out under the auspices of the National Nuclear Security Administration of the US Department of Energy at Los Alamos National

4 benchmarking testing, MPI-IO testing, and parallel file system testing REFERENCES [1] Hari Subramoni, Matthew Koop and Dhabaleswar K Panda, Designing Next generation Clusters: Evaluation of Infiniband DDR/QDR on Intel Computing Platforms, HOTI th IEEE Annual Symposium on High- Performance Interconnects [2] Matthew J Koop, Wei Huang, Karthik Gopalakrishanan, Dhabaleswar K Panda, Performance Analysis and Evaluation of PCIe 20 Quad-Data Rate Infiniband, HOTI th IEEE Annual Symposium on High- Performance Interconnects [3] Infiniband Road map, Infiniband Trace Association, [4] HPC Advisory Coucil Network of Expertise, Interconnect Analysis: 10GigE and infiniband in High Performance Computing, 2009 [5] Munira Hussain, Gilad Shalner, Tong Liu, Onur Celebioglu, Comparing DDR and QDR Infiniband 11 th - generation Dell Poweredge Clusters, DELL Power Solution, 2010 Issue 1 [6] Gary Grider, Hsing-bung Chen, James Nunez, Steve Poole, Rosie Wacha, Parks Fields, Robert Martinez, Paul Martinez, Satsangat Khalsa, PaScal A New Parallel and Scalable Server IO Networking Infrastructure for Supporting Global Storage/File Systems in Large-size Linux Clusters, Proceedings of the 25th IEEE International Performance, Computing, and Communications Conference, 2006 (IPCCC 2006) April 2006 [7] Hsing-bung Chen, Gary Grider, Parks Fields, A Cost- Effective, High Bandwidth Server I/O network Architecture for Cluster Systems, 2007 IEEE IPDPS Conference [8] Hsing-bung Chen, parks Fields, Alfred Torrez, An Intelligent Parallel and Scalable Server I/O Networking Environment for High Performance Cluster Computing Systems, PAPTA 2008 Conference [9] OFED OpenFabrics, [10] Mellanox [11] Arista network - [12] Netperf - Comp Node Comp Node Comp nodes - Outbound N-way load balancing Multi-path routing Level-1 Interconnect network I/O Node I/O Node Switch - Inbound M-way multiple streams Equal Cost Multi-path routing - switch Level-2 Interconnect network Global File System Comp Node I/O Node I/O nodes/vlan use OSPF to route inbound and outbound traffics for Level-1 and Level- 2 networks Figure 1: System diagram LANL s PaScalBB Server I/O architecture This work was carried out under the auspices of the National Nuclear Security Administration of the US Department of Energy at Los Alamos National

5 Figure-2:IB/SDR, IB/DDR, and IB/QDR performance testing Figure-3: IB/SDR, IB/DDR, and IB/QDR latency testing Figure-4: Multithread MPI testing using IB/QDR Figure-5: IB/QDR bi-directional bandwidth testing This work was carried out under the auspices of the National Nuclear Security Administration of the US Department of Energy at Los Alamos National

6 Figure-6: back-to-back one single 10-Gigabie Ethernet testing Figure-7: Three 10Gigabit Ethernet bonding performance testing Figure 8: Four 10Gigabit Ethernet bonding performance testing Figure 9: Using four compute nodes scaling testing Figure 10: Using seven compute nodes scaling testing Figure 11: Linux bonding mode-0 vs mode-5 testing This work was carried out under the auspices of the National Nuclear Security Administration of the US Department of Energy at Los Alamos National

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters Hari Subramoni, Ping Lai, Sayantan Sur and Dhabhaleswar. K. Panda Department of