When evaluating choices between

Size: px

Start display at page:

Download "When evaluating choices between"

Allison Barnett
5 years ago
Views:

Fabric Bandwidth Comparisons on Backplanes Open Backplane Fabric Choice Calls for Careful Analyses There s a host of factors to consider when evaluating the system of various Open fabrics.

1 Fabric Bandwidth Comparisons on Backplanes Open Backplane Fabric Choice Calls for Careful Analyses There s a host of factors to consider when evaluating the system of various Open fabrics. A detailed comparison of Gbit Ethernet, Serial Rapid IO and InfiniBand sheds some light. Peter Thompson, Director of Applications, Military and Aerospace GE Intelligent Platforms When evaluating choices between interconnect fabrics and topologies as part of a systems engineering exercise, there are many factors to be considered. Papers have been published that purport to illustrate the advantages of some schemes over others. However, some of these analyses adopt a simplistic model of the architectures that can be misleading when it comes to mapping a real-world problem onto such systems. A more rigorous approach to the analysis is needed in order to derive metrics that are more meaningful and which differ significantly from those derived from a simplistic analytical approach. This article compares the available to two common types of dataflow for systems based on the VITA 65 CEN central switched topology, using three different fabrics Serial RapidIO (), Gigabit Ethernet () and Double Rate InfiniBand (DDR IB). The analysis will show that the difference in routing for the three fabrics on the CEN backplane is imal, and that for the use cases presented, is closer in performance to than is claimed elsewhere, and that DDR IB matches. The System Architecture For the purposes of this analysis, consider an Open CEN backplane that allows for payload (processor) slots and two switch slots (Figure 1). This is a non-uniform topology that routes one connection from each payload slot to each of the two switch slots, plus one connection to the adjacent slot on each side. First consider a system built from boards that each have two processing nodes (which may be multicore), where each node is connected via Gen2 Serial RapidIO () to an onboard switch, which in turn has four connections to the backplane. Each processor has two connections. That is then compared with a similar system based around dual processor boards with two links per node and no onboard switch. If, as in the case of the GE DSP280 multiprocessor board, a Mellanox network interface chip is used, the same board can be software reconfigured to support InfiniBand: In fact, the only change required to migrate from to DDR InfiniBand is to change the system central switch from (for example, GE s GBX460) to InfiniBand (such as GE s IBX400). The backplane can remain unchanged, as can the payload boards. The interfaces can be software selected as or IB. By using appropriate middleware such as AXIS or MPI, the application code remains agnostic to the fabric. Assumptions and Rate Arithmetic It is assumed that the switches for all three fabrics are non-blocking for the number of ports that each switch chip supports. However, as will be seen, the number of chips and the hierarchy used to construct a 20-port central switch board can have a significant impact on the true network topology and therefore the available to an application. One factor that can be overlooked is that in addition to the primary data fabric connections, there can be an alternate path between nodes on the same board that can be seamlessly leveraged by the application. For example, GE s

2 Slots / Management Slots Slot numbers are logical, physical slot numbers may be different sion (DFP) } FP (FP) } FP TP (UTP) UTP Management (IPMB) IPMC IPMC IPMC IPMC ChMC ChMC IPMC IPMC IPMC IPMC IPMC IPMC IPMC IPMC IPMC IPMC Utility Includes Power Figure 1 An Open CEN backplane that allows for payload (processor) slots and two switch slots. This is a non-uniform topology that routes one connection from each payload slot to each of the two switch slots, plus one connection to the adjacent slot on each side. DSP280 multiprocessor board has eight lanes of PCIe Gen2 between the two processors via a switch with non-transparent bridging capability. This adds a path with up to 32 Gbit/s available. It s important that the inter-processor communication software is able to leverage mixed data paths within a system. The AXIS tools from GE can do that and can be used to build a dataflow model that represents the algorithm s needs, and the user has complete control over which interconnect mechanism is used for each data link. Gen2 (which is only just starting to emerge on products as of early 20) runs at 5 GHz with the chipsets commonly in use. A 4-lane connection, with the overhead of 8bb encoding, yields a raw rate of 4 x 5 Gbit/s x 0.8 = Gbit/s. clocks at 3.5 GHz on 4 lanes with the same 8bB encoding, so has a raw rate of 4 x 3.5 Gbit/s x 0.8 = Gbit/s. DDR InfiniBand clocks at 5 GHz, with a raw rate of 4 x 5 Gbit/s x 0.8 = Gbit/s. Mellanox interface chips that support both and IB have been available and have been deployed for some time now, and are considered a mature technology with widespread adoption in mainstream high-performance computing. ded System Architecture Now consider a system built from fourteen such boards in an Open chassis with a backplane that conforms to the BKP6-CEN-.2.2.n profile. This supports fourteen payload boards and two central switch boards, and yields a noal interconnect diagram as shown in Figure 2 for the case. For or InfiniBand, the same backplane results in an inter-connect mapping that is represented in Figure 3. Those diagrams do not tell the whole story however. They would be correct if the central switches shown were constructed from a single, non-blocking, 18- to 20-port switch device. However, this is not the case for all the fabrics. In the case, a GBX460 switch card can be used, which employs a single 24-port switch chip. For an InfiniBand system, IBX400 can be used, which has a single, 36-port switch chip where each port is x4 lanes wide. In the case of Gen2, the switch chip commonly selected is a device that supports 48 lanes in other words ports of x4 links. In order to construct a switch of higher order, it is necessary to use several chips in some kind of a tree structure. Here a tradeoff must be made of the number of chips used against the overall performance of the aggregated switch. All-to-All Measurement When evaluating network architectures, a common approach is to look Reprinted from June 20 COTS Journal

3 Figure 2 This interconnect diagram for the Serial RapidIO use case has an Open backplane that conforms to the BKP6-CEN-.2.2.n profile. This supports payload boards and two central switch boards. at an all-to-all exchange of data. This is of interest as it represents a common problem encountered in embedded processing systems: a distributed corner turn of matrix data. This is a core function in synthetic aperture radars, for instance, where it is termed a corner turn. It is commonly seen when the processing algorithm calls for a two (or higher) dimensional array to be subjected to a two dimensional (or higher) Fast Fourier Transform. In order to meet system time constraints, the transform is often distributed across many processor nodes. Between the row FFTs and the column FFTs the data must be exchanged between nodes. This requires an all-to-all exchange of data that can tax the available of a system. A simple analysis of this topology might make the following assumptions: there are links between nodes on each board via the onboard switch, there are links to nodes on adjacent cards via links between the onboard switches, and there are 22 connections made via the central switches. In this approach, the overall performance for an all-to-all exchange might be assumed to be detered by the lowest aggregate of these three connection types in other words that of a single link divided by the number of connections. This equates to 4 lanes x 5 Gbit/s x 0.8 encoding / 22 nodes = 0.73 Gbit/s. If we apply the same simplistic analysis to the system, this suggests that the available for all-to-all transfers is 4 lanes x 3.5 Gbit/s x 0.8 encoding of x8 connections between switches / 368 paths = Gbit/s per when using. That means has an apparent speed advantage of 3.4 to 1. However, this is a flawed analysis and gives a misleading impression as to the relative performance that might be expected from the two systems when doing a corner turn. The two architectures are evaluated with different methods one by dividing the worst-case by the number of processors sharing it, and the other by dividing the worst-case by the number of links that share it.

4 Figure 3 This interconnect diagram has an Open backplane with an interconnect mapping for or InfiniBand. Architecture Matters A second potential error is to ignore the internal architecture of each switch device, as this can have an effect in cases where the switch does not have balanced. However, the biggest flaw is the suggestion that the performance of a non-uniform tree architecture can be modeled by deriving the lowest connection in the system. In network theory, it is widely accepted that the best metric for the expected performance of such a system is represented by the of the network. The of a system is found by dividing the system into two equal halves along a dividing line, and enumerating the rate at which data can be communicated between the two halves. Reconsidering the network diagram of the system, the bisection width is defined by the number of paths that the median line crosses, which adds up to be 1. Similarly, the width of the or DDR IB system would add up to be 1. Given that the link for the system is Gbit/s and for is Gbit/s, the of the system is 1 x = 304 Gbit/s, and for the system it is 1 x = 10 Gbit/s. This represents an expected performance ratio for the total exchange scenario of 1.6 to 1 in favor of the system not the 3.4 to 1 predicted in the simplistic model. If we now replace the switch with an InfiniBand switch, which fits the same slot and backplane profiles, the is 1 x = 304 Gbit/s. Therefore the performance of DDR InfiniBand matches that of. Bandwidth Calculations Pipeline Case Another dataflow model commonly considered is a pipeline, where data streams from node to node in a linear manner. When designing such a dataflow, it is normal to map the tasks and flow to the system in an optimal manner. This can include using different fabric connections for different parts of the flow. A good IPC library and infrastructure will allow the designer to do so without requiring any modifications to the application code. AXIS has this characteristic. Here, for simplicity, it is assumed that the input and output data sizes at each processing stage are the same (no data reduction or increase). In this instance the rate of the slowest link in the chain dictates the overall achievable performance. Reprinted from June 20 COTS Journal

1 to 2 Figure 4 1 2 3 4 5 6 If Task 1 is mapped to 1, Task 2 to 2 and Task 3 to 3, the available paths are shown in yellow in Figure 4 for the system.

Therefore the imum path is 20 Gbit/s. In the DDR IB system, the path from Task 2 to Task 3 has access to two IB links, an aggregate rate of 32 Gbit/s.

5 1 to 2 Figure If Task 1 is mapped to 1, Task 2 to 2 and Task 3 to 3, the available paths are shown in yellow in Figure 4 for the system. The path from Task 1 to Task 2 is over x8 PCIe Gen2, with an available of 32 Gbit/ ss. The path from Task 2 to Task 3 has access to two links, an aggregate rate of 20 Gbit/s. Therefore the imum path is 20 Gbit/s. In the DDR IB system, the path from Task 2 to Task 3 has access to two IB links, an aggregate rate of 32 Gbit/s. The PCIe link is unchanged, so the imum leg here is 32 Gbit/s. Now, for the system, with paths between 1 and 2 and between 2 and 3, two separate links are available, so 32 Gbit/s is available for both legs. 2 to 3 2 to 3 Shown here is a pipeline dataflow scheme mapped to a system. Backplane Use Case SR DDR 1B : IB: CEN payload 2 switch All-to-all 10 Gbits/s 304 Gbits/s 304 Gbits/s 1.6x 1x CEN payload 2 switch Figure 5 Pipeline 20 Gbits/s 32 Gbits/s 32 Gbits/s 1.6x 1x The table summarizes the system analyses for the, and DDR InfiniBand systems. The DDR InfiniBand system matches the performance of the system for both use cases. The result of all this is that the limiting s for the pipeline use case are 20 Gbit/s for, 32 Gbit/s for DDR IB and 32 Gbit/s for. Other Factors to Consider The push to support open software architectures MOSA, FACE and so on is leading the military embedded processing industry to support middleware packages such as Open Fabric Enterprise Distribution (OFED) and OpenMPI for data movement. Typically OpenMPI is layered over a network stack, and its performance is highly reliant on the efficiency of how the layers map to the underlying fabric. Some implementations rely on rionet, a Linux network driver that presents a TCP/IP interface to. Contrast this with an OpenMPI implementation that maps through OFED to RDMA over or InfiniBand, and it can be seen that the potential exists for a large gap in performance at the application level, with RDMA being much more efficient. Meanwhile, it is sometimes claimed that is more power efficient than the other fabrics. If we total up the power of the bridge and switch components for each -slot system, a truer picture emerges. If you do the math, the power efficiency of and DDR IB is on par, with fairly close. Differences Not Significant Figure 5 summarizes the system analyses for the, and DDR InfiniBand systems. These show that for both use cases, the simplistic analysis presented elsewhere overestimates the performance advantage of over by a factor of two, and that the advantage is completely attributable to the difference in clock rates. The CEN topology has little to no effect in reality. It also shows that the DDR InfiniBand system matches the performance of the system for both use cases. GE Intelligent Platforms Charlottesville, VA. (800) [

Gedae cwcembedded.com. The CHAMP-AV6 VPX-REDI. Digital Signal Processing Card. Maximizing Performance with Minimal Porting Effort

Gedae cwcembedded.com. The CHAMP-AV6 VPX-REDI. Digital Signal Processing Card. Maximizing Performance with Minimal Porting Effort Technology White Paper The CHAMP-AV6 VPX-REDI Digital Signal Processing Card Maximizing Performance with Minimal Porting Effort Introduction The Curtiss-Wright Controls Embedded Computing CHAMP-AV6 is