TIBCO, HP and Mellanox High Performance Extreme Low Latency Messaging

Size: px

Start display at page:

Download "TIBCO, HP and Mellanox High Performance Extreme Low Latency Messaging"

Egbert Henry
5 years ago
Views:

1 TIBCO, HP and Mellanox High Performance Extreme Low Latency Messaging

2 Executive Summary: With the recent release of TIBCO FTL TM, TIBCO is once again changing the game when it comes to providing high performance messaging middleware. Many solutions have emerged to try and provide next generation systems with extreme low latency but they are doing this by sacrificing the traditional features and functions that mission critical middleware solutions require. TIBCO s approach is to offer a middleware solution that offers extreme low latency without sacrifice, allowing for the scalability not only to meet the demands for low latency data distribution but also to meet the demands as the application grows from a few instances to thousands of instances. In this report with the assistance of HP and Mellanox/Voltaire, TIBCO provides benchmarks using TIBCO FTL 1.0 across a number of physical transports to show the latency, average latency for a given transport. In addition TIBCO is showing how variable message size has minimal impact on the latency metrics depending on what transport is being used. The goal of this report is not to show all the different use case, as there are other benchmarks that provide these types of reports. For this report TIBCO, HP and Mellanox/Voltaire wanted to show the performance benefits of the end-to-end solution and give a general overview of how infrastructure and data distribution decisions can impact overall latency. The final results of these test show that in all categories using TIBCO FTL, HP DL380 G7 systems and network equipment for Mellanox/Voltaire, customers can get the lowest latency data distribution. For more information about testing methodology and system configuration please contact TIBCO Software at

3 Test Setup: The purpose of these tests is to show the relative latency performance of a given architectural setup and to show comparisons in latency for a given distribution transport over a number of varying message sizes. For these tests TIBCO, HP and Voltaire had a simple setup of 2 DL380 G7 Server Machines each having 2 Intel Xeon X5687 processors (4 cores each) and at least 48 Gigs of Memory per machine operating at 1333 MHz. The DL380s have RHEL 5.5, Firmware Version: OFED 1.5.2, BIOS Release Date 01/30/2011 and ilo Figure 1 (System Setup) Providing the network layer connectivity the 2 DL380 G7 systems were connect by a Voltaire 10 Gigabit Ethernet Switch (Vantage 6024) and by a Voltaire QDR Infiniband Switch (Voltaire 4036). NIC Interfaces: Mellanox ConnectX-2 InfiniBand/10GbE PCIe Adapters: Mellanox Technologies MT26428 Firmware GbE Mellanox/Voltaire switch is model Vantage Mellanox Technologies MT26448 firmware

4 BIOS Settings: BIOS parameter Value Hyperthreading Disabled HP_Power_Regulator HP_Static_High_Performance_Mode CPU_Virtualization Disabled Intel_Processor_Turbo_Mode Disabled Intel_VT-d2 Disabled HP_Power_Profile Maximum_Performance Intel_QPI_Link_Power_Management Disabled Intel_Minimum_Processor_Idle_Power_State No_C-States Intel_Hyperthreading Disabled Collaborative_Power_Control Disabled Intel_Turbo_Boost_Optimization Optimized_for_Performance PowerMonitoring Disabled DisableMemoryPrefailureNotification Yes All tests used where conducted using the sample latency tools provided with TIBCO FTL 1.0 the two sample programs used where the C implementations of tibping and tibpong. The Shared Memory tests were all run on the DL380 G7 system that had 96 Gigs of Memory. All tests using TCP and Reliable Multicast where tested over the 10 Gigabit Ethernet Switch. RDMA tests where conducted using both the 10 Gigabit Ethernet switch and the QDR Infiniband switch. The 2 DL380 G7 systems used Mellanox Connect X-2 10 Gig Ethernet/Infiniband adapters for interconnectivity.

5 Test Results: Below are the individual results for each unique transport. Transports tested were Shared Memory, RDMA over Infiniband, RDMA over 10 Gigabit Ethernet, TCP over 10 Gigabit Ethernet and Reliable Multicast over 10 Gigabit Ethernet. In these tests we did not include tests that used technology like Voltaire VMA kernel bypass for transports like TCP and Reliable Multicast as we wanted to show the raw performance of TIBCO FTL operating on the native transport with our assistance. Shared Memory Transport: Variable Message Size Latency for Shared Memory Transport Message Test Total Average One-Way Avg. One-Way Size No. Time Total Time Latency Latency (Bytes) (Seconds) (Seconds) (Nano Seconds) (Nano Seconds)

6 750 Latency for Shared Memory Transport with Variable Message Size Latency in Nanoseconds Message Size in Bytes The Shared Memory Transport for TIBCO FTL allows for extremely high performance ultra low latency message distribution for components that are operating in a single host environment. With TIBCO FTL s multi-transport send functionality these components can send a message once and have the message delivered to local components via shared memory and distributed components via a network transport like RDMA, TCP or Reliable UDP.

7 RDMA Transport over Infiniband: Variable Message Size Latency for RDMA Transport over Infiniband Message Test Total Average One-Way Avg. One-Way Size No. Time Total Time Latency Latency (Bytes) (Seconds) (Seconds) (Micro Seconds) (Micro Seconds)

8 5625 Latency for RDMA (Infiniband) Transport with Varible Message Size 3750 Latency in Nanoseconds Message Size in Bytes For network infrastructures that support RDMA, RDMA provides the lowest latency data distribution of any of the network transports available. Some latency gains can be had by using Infiniband over 10 Gigabit Ethernet and comparisons between these to physical distribution layers is available later in this document.

9 RDMA Transport over 10 Gigabit Ethernet (RoCe): Variable Message Size Latency for RDMA Transport over 10 Gigabit Ethernet Message Test Total Average One-Way Avg. One-Way Size No. Time Total Time Latency Latency (Bytes) (Seconds) (Seconds) (Micro Seconds) (Micro Seconds)

10 7500 Latency for RDMA (10 GigE) Transport with Varible Message Size 5625 Latency in Nanoseconds Message Size in Bytes While Infiniband provides some minor (~1 microsecond) latency gains over 10 Gigabit Ethernet, Infiniband is not as pervasive as ethernet based deployments. Because of this, Mellanox s 10 Gigabit Ethernet support using RoCe allows for all the benefits of using RDMA over an existing 10 Gigabit infrastructure.

11 TCP Transport over 10 Gigabit Ethernet: Variable Message Size Latency for TCP Transport over 10 Gigabit Ethernet Message Test Total Average One-Way Avg. One-Way Size No. Time Total Time Latency Latency (Bytes) (Seconds) (Seconds) (Micro Seconds) (Micro Seconds)

12 15000 Latency for TCP (10 GigE) Transport with Varible Message Size Latency in Nanoseconds Message Size in Bytes If latency is a significant priority RDMA over either 10 Gigabit Ethernet or Infiniband is clearly a superior choice for data distribution, however many applications still need to distribute data to application endpoints that either don t have RDMA support or don t require the extreme low latency that RDMA can provide. For these applications TIBCO FTL s TCP transport can provide low latency distribution without having to support new networking paradigms.

13 Reliable UDP Transport over 10 Gigabit Ethernet: Variable Message Size Latency for Reliable Multicast Transport over 10 Gigabit Ethernet Message Test Total Average One-Way Avg. One-Way Size No. Time Total Time Latency Latency (Micro (Micro (Bytes) (Seconds) (Seconds) Seconds) Seconds)

14 15000 Latency for Reliable Multicast (10 GigE) Transport with Variable Message Size Latency in Nanoseconds Message Size in Bytes Even though the race to extremely low latency is encouraging the adoption of new network distribution technology, there is still a requirement to provide a extremely scalable low latency distribution pattern for high fanout situations. TIBCO FTL s Reliable UDP transport allows for applications that require high speed message distribution to multiple nodes within the infrastructure.

15 Transport Comparisons: TIBCO FTL provides the flexibility to dynamically change what transport is being used by the application without requiring any code changes. This provides the application administrator a simplified model for adopting new data distribution paradigms as they are introduced into the application environment. Because of this flexibility it becomes necessary to evaluate what benefit a given transport has over another transport. In addition to individual transport test results reported above a number of comparison results can be provided with regards to performance and benefits for a given solution. Below are comparisons for RDMA over Infiniband and 10 Gigabit Ethernet, comparisons for TCP and Reliable Multicast and finally a comparison chart showing the latency for each transport. RDMA over Infiniband and 10 Gigabit with Variable Message Size Message Size Infiniband Latency (Nanoseconds) 10 Gig Latency (Nanoseconds) RDMA Transport Infiniband versus 10 GigE Latency in Nanoseconds Message Size in Bytes RDMA over Infiniband RDMA over 10 GigE

16 TCP versus Reliable Multicast with Variable Message Size Message Size TCP Latency (Nanoseconds) Multicast Latency (Nanoseconds) Latency Comparison between All Transports with Variable Message Size TCP Transport versus Reliable Multicast over 10 GigE Latency in Nanoseconds Message Size TCP over 10 GigE Reliable Multicast over 10 GigE

17 Latency Comparison between All Transports with Variable Message Size Message Size Shared Memory Latency (Nanoseconds) Infiniband Latency (Nanoseconds) 10 Gig Latency (Nanoseconds) TCP Latency (Nanoseconds) Multicast Latency (Nanoseconds) Latency (Nanoseconds) Message Size Shared Memory RDMA over Infiniband RDMA over 10 Gig TCP

18 Conclusions: The Shared Memory Transport for TIBCO FTL allows for extremely high performance ultra low latency message distribution for components that are operating in a single host environment. The average one way latency for Shared memory transport is nano seconds for 16 Bytes message. For network infrastructures that support RDMA, RDMA provides the lowest latency data distribution of any of the network transports available. While Infiniband provides some minor (~1 microsecond) latency gains over 10 Gigabit Ethernet. If latency is a significant priority RDMA over either 10 Gigabit Ethernet or Infiniband is clearly a superior choice for data distribution, however many applications still need to distribute data to application endpoints that either don t have RDMA support or don t require the extreme low latency that RDMA can provide. TIBCO FTL s TCP transport can provide low latency distribution without having to support new networking paradigms. Even though the race to extremely low latency is encouraging the adoption of new network distribution technology, there is still a requirement to provide a extremely scalable low latency distribution pattern for high fanout situations. TIBCO FTL s Reliable UDP transport allows for applications that require high speed message distribution to multiple nodes within the infrastructure. With TIBCO FTL s multi-transport send functionality these components can send a message once and have the message delivered to local components via shared memory and distributed components via a network transport like RDMA, TCP or Reliable UDP. Another set of tests were performed to determine FTL 1.0 performance improvements on changing processor clock speed. Using the above test environment, testing was done using Intel X5680, 3.33 processors. Message Average One-Way Latency comparison for X5680 (3.33 ) and X5687 (3.60 ) Size Shared Memory RDMA IB RDMA 10GbE TCP MCAST (Nano Seconds) (Micro Seconds) (Micro Seconds) (Micro Seconds) (Micro Seconds) (Bytes)

As indicated in table above, for shared memory transport average performance improvement for 3.60 processor is about 7.5% compared to 3.33 processor.

19 As indicated in table above, for shared memory transport average performance improvement for 3.60 processor is about 7.5% compared to 3.33 processor. Similarly noticeable improvements were noticed for remaining transports like RDMA over IB (~ 5.5 %), RDMA over 10GbE (~ 4.5 %), TCP (~ 6 %). As newer processors with higher clock speeds are introduced in markets in future, FTL performance improvement is expected.

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA Pak Lui, Gilad Shainer, Brian Klaff Mellanox Technologies Abstract From concept to