When evaluating choices between

Similar documents
Gedae cwcembedded.com. The CHAMP-AV6 VPX-REDI. Digital Signal Processing Card. Maximizing Performance with Minimal Porting Effort

InfiniBand SDR, DDR, and QDR Technology Guide

RapidIO.org Update. Mar RapidIO.org 1

Introduction to PCI Express Positioning Information

RapidIO.org Update.

Sub-microsecond interconnects for processor connectivity The opportunity

1 of 5 10/6/2014 3:22 PM

Study. Dhabaleswar. K. Panda. The Ohio State University HPIDC '09

RDMA in Embedded Fabrics

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA

Verification and Validation of X-Sim: A Trace-Based Simulator

Lowering Cost per Bit With 40G ATCA

SDACCEL DEVELOPMENT ENVIRONMENT. The Xilinx SDAccel Development Environment. Bringing The Best Performance/Watt to the Data Center

The rcuda middleware and applications

Ensemble 6000 Series OpenVPX HCD6210 Dual QorIQ T4240 Processing Module

Choosing the Right COTS Mezzanine Module

01/21/2014 Charles Patrick Collier Embedded Tech Trends

Implementing RapidIO. Travis Scheckel and Sandeep Kumar. Communications Infrastructure Group, Texas Instruments

Embracing Open Technologies in the HPEC Market

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience

NVM PCIe Networked Flash Storage

JMR ELECTRONICS INC. WHITE PAPER

Key Measures of InfiniBand Performance in the Data Center. Driving Metrics for End User Benefits

1. Introduction 2. Methods for I/O Operations 3. Buses 4. Liquid Crystal Displays 5. Other Types of Displays 6. Graphics Adapters 7.

Creating an agile infrastructure with Virtualized I/O

The S6000 Family of Processors

Deploying 10 Gigabit Ethernet with Cisco Nexus 5000 Series Switches

TECHNOLOGIES FOR IMPROVED SCALING ON GPU CLUSTERS. Jiri Kraus, Davide Rossetti, Sreeram Potluri, June 23 rd 2016

Maximizing heterogeneous system performance with ARM interconnect and CCIX

Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms

Using Mezzanine Card Assemblies: Power Dissipation & Airflow Evaluation

Ensemble 6000 Series OpenVPX Intel Xeon Dual Quad-Core HDS6600 Module

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters

Messaging Overview. Introduction. Gen-Z Messaging

Workshop on High Performance Computing (HPC) Architecture and Applications in the ICTP October High Speed Network for HPC

Single-Points of Performance

Designing High Performance Communication Middleware with Emerging Multi-core Architectures

Dell EMC Ready Bundle for HPC Digital Manufacturing Dassault Systѐmes Simulia Abaqus Performance

OCTOPUS Performance Benchmark and Profiling. June 2015

Blue Gene/Q. Hardware Overview Michael Stephan. Mitglied der Helmholtz-Gemeinschaft

RACEway Interlink Modules

ANSYS Fluent 14 Performance Benchmark and Profiling. October 2012

MVAPICH2 vs. OpenMPI for a Clustering Algorithm

SUN CUSTOMER READY HPC CLUSTER: REFERENCE CONFIGURATIONS WITH SUN FIRE X4100, X4200, AND X4600 SERVERS Jeff Lu, Systems Group Sun BluePrints OnLine

Creating High Performance Clusters for Embedded Use

2008 International ANSYS Conference

Mellanox Technologies Maximize Cluster Performance and Productivity. Gilad Shainer, October, 2007

SNAP Performance Benchmark and Profiling. April 2014

iwarp Learnings and Best Practices

Optimizing LS-DYNA Productivity in Cluster Environments

Octopus: A Multi-core implementation

MILC Performance Benchmark and Profiling. April 2013

Embedded Tech Trends 2014 Rodger H. Hosking Pentek, Inc. VPX for Rugged, Conduction-Cooled Software Radio Virtex-7 Applications

Birds of a Feather Presentation

OCP Engineering Workshop - Telco

OPTIMISING NETWORKED DATA ACQUISITION FOR SMALLER CONFIGURATIONS

FROM HPC TO THE CLOUD WITH AMQP AND OPEN SOURCE SOFTWARE

John Fragalla TACC 'RANGER' INFINIBAND ARCHITECTURE WITH SUN TECHNOLOGY. Presenter s Name Title and Division Sun Microsystems

Components of a MicroTCA System

CPU Agnostic Motherboard design with RapidIO Interconnect in Data Center

M7: Next Generation SPARC. Hotchips 26 August 12, Stephen Phillips Senior Director, SPARC Architecture Oracle

Five Ways to Build Flexibility into Industrial Applications with FPGAs

Chelsio 10G Ethernet Open MPI OFED iwarp with Arista Switch

Highly Accurate, Record/ Playback of Digitized Signal Data Serves a Variety of Applications

Comparing Ethernet & Soft RoCE over 1 Gigabit Ethernet

Extending PCI-Express in MicroTCA Platforms. Whitepaper. Joey Maitra of Magma & Tony Romero of Performance Technologies

Sugon TC6600 blade server

The desire for higher interconnect speeds between

Low Latency Server Virtualization

Memory Scalability Evaluation of the Next-Generation Intel Bensley Platform with InfiniBand

CPMD Performance Benchmark and Profiling. February 2014

New! New! New! New! New!

GW2000h w/gw175h/q F1 specifications

Leveraging the Intel HyperFlex FPGA Architecture in Intel Stratix 10 Devices to Achieve Maximum Power Reduction

CRFS: A Lightweight User-Level Filesystem for Generic Checkpoint/Restart

Solaris Engineered Systems

Meltdown and Spectre Interconnect Performance Evaluation Jan Mellanox Technologies

High Performance Ethernet for Grid & Cluster Applications. Adam Filby Systems Engineer, EMEA

AXIe : AdvancedTCA Extensions for Instrumentation and Test. Autotestcon 2016

FPGA Solutions: Modular Architecture for Peak Performance

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구

Building the Most Efficient Machine Learning System

The VITA Radio Transport as a Framework for Software Definable Radio Architectures

VPI / InfiniBand. Performance Accelerated Mellanox InfiniBand Adapters Provide Advanced Data Center Performance, Efficiency and Scalability

Module 2 Storage Network Architecture

Survey of ETSI NFV standardization documents BY ABHISHEK GUPTA FRIDAY GROUP MEETING FEBRUARY 26, 2016

Fibre Channel over Ethernet and 10GBASE-T: Do More with Less

Intel Enterprise Processors Technology

Unified Runtime for PGAS and MPI over OFED

What is Parallel Computing?

RoCE vs. iwarp Competitive Analysis

Ensemble 6000 Series OpenVPX Intel 4 th Generation Core i7 module LDS6525-CX

Deep Learning Performance and Cost Evaluation

Linköping University Post Print. epuma: a novel embedded parallel DSP platform for predictable computing

Flex System IB port FDR InfiniBand Adapter Lenovo Press Product Guide

Intel Xeon Sandy Bridge Server-Class Processor DATASHEET

Parallel Computing: Parallel Architectures Jin, Hai

GPM0001 E9171 GPU-based Processor Module

Version PEX 8516

Introduction to High-Speed InfiniBand Interconnect

Transcription:

Fabric Bandwidth Comparisons on Backplanes Open Backplane Fabric Choice Calls for Careful Analyses There s a host of factors to consider when evaluating the system of various Open fabrics. A detailed comparison of Gbit Ethernet, Serial Rapid IO and InfiniBand sheds some light. Peter Thompson, Director of Applications, Military and Aerospace GE Intelligent Platforms When evaluating choices between interconnect fabrics and topologies as part of a systems engineering exercise, there are many factors to be considered. Papers have been published that purport to illustrate the advantages of some schemes over others. However, some of these analyses adopt a simplistic model of the architectures that can be misleading when it comes to mapping a real-world problem onto such systems. A more rigorous approach to the analysis is needed in order to derive metrics that are more meaningful and which differ significantly from those derived from a simplistic analytical approach. This article compares the available to two common types of dataflow for systems based on the VITA 65 CEN central switched topology, using three different fabrics Serial RapidIO (), Gigabit Ethernet () and Double Rate InfiniBand (DDR IB). The analysis will show that the difference in routing for the three fabrics on the CEN backplane is imal, and that for the use cases presented, is closer in performance to than is claimed elsewhere, and that DDR IB matches. The System Architecture For the purposes of this analysis, consider an Open CEN backplane that allows for payload (processor) slots and two switch slots (Figure 1). This is a non-uniform topology that routes one connection from each payload slot to each of the two switch slots, plus one connection to the adjacent slot on each side. First consider a system built from boards that each have two processing nodes (which may be multicore), where each node is connected via Gen2 Serial RapidIO () to an onboard switch, which in turn has four connections to the backplane. Each processor has two connections. That is then compared with a similar system based around dual processor boards with two links per node and no onboard switch. If, as in the case of the GE DSP280 multiprocessor board, a Mellanox network interface chip is used, the same board can be software reconfigured to support InfiniBand: In fact, the only change required to migrate from to DDR InfiniBand is to change the system central switch from (for example, GE s GBX460) to InfiniBand (such as GE s IBX400). The backplane can remain unchanged, as can the payload boards. The interfaces can be software selected as or IB. By using appropriate middleware such as AXIS or MPI, the application code remains agnostic to the fabric. Assumptions and Rate Arithmetic It is assumed that the switches for all three fabrics are non-blocking for the number of ports that each switch chip supports. However, as will be seen, the number of chips and the hierarchy used to construct a 20-port central switch board can have a significant impact on the true network topology and therefore the available to an application. One factor that can be overlooked is that in addition to the primary data fabric connections, there can be an alternate path between nodes on the same board that can be seamlessly leveraged by the application. For example, GE s

Slots / Management Slots Slot numbers are logical, physical slot numbers may be different sion (DFP) 1 2 3 4 5 6 7 8 } FP (FP) } FP TP (UTP) UTP Management (IPMB) IPMC IPMC IPMC IPMC ChMC ChMC IPMC IPMC IPMC IPMC IPMC IPMC IPMC IPMC IPMC IPMC Utility Includes Power Figure 1 An Open CEN backplane that allows for payload (processor) slots and two switch slots. This is a non-uniform topology that routes one connection from each payload slot to each of the two switch slots, plus one connection to the adjacent slot on each side. DSP280 multiprocessor board has eight lanes of PCIe Gen2 between the two processors via a switch with non-transparent bridging capability. This adds a path with up to 32 Gbit/s available. It s important that the inter-processor communication software is able to leverage mixed data paths within a system. The AXIS tools from GE can do that and can be used to build a dataflow model that represents the algorithm s needs, and the user has complete control over which interconnect mechanism is used for each data link. Gen2 (which is only just starting to emerge on products as of early 20) runs at 5 GHz with the chipsets commonly in use. A 4-lane connection, with the overhead of 8bb encoding, yields a raw rate of 4 x 5 Gbit/s x 0.8 = Gbit/s. clocks at 3.5 GHz on 4 lanes with the same 8bB encoding, so has a raw rate of 4 x 3.5 Gbit/s x 0.8 = Gbit/s. DDR InfiniBand clocks at 5 GHz, with a raw rate of 4 x 5 Gbit/s x 0.8 = Gbit/s. Mellanox interface chips that support both and IB have been available and have been deployed for some time now, and are considered a mature technology with widespread adoption in mainstream high-performance computing. ded System Architecture Now consider a system built from fourteen such boards in an Open chassis with a backplane that conforms to the BKP6-CEN-.2.2.n profile. This supports fourteen payload boards and two central switch boards, and yields a noal interconnect diagram as shown in Figure 2 for the case. For or InfiniBand, the same backplane results in an inter-connect mapping that is represented in Figure 3. Those diagrams do not tell the whole story however. They would be correct if the central switches shown were constructed from a single, non-blocking, 18- to 20-port switch device. However, this is not the case for all the fabrics. In the case, a GBX460 switch card can be used, which employs a single 24-port switch chip. For an InfiniBand system, IBX400 can be used, which has a single, 36-port switch chip where each port is x4 lanes wide. In the case of Gen2, the switch chip commonly selected is a device that supports 48 lanes in other words ports of x4 links. In order to construct a switch of higher order, it is necessary to use several chips in some kind of a tree structure. Here a tradeoff must be made of the number of chips used against the overall performance of the aggregated switch. All-to-All Measurement When evaluating network architectures, a common approach is to look Reprinted from June 20 COTS Journal

1 2 3 4 5 6 7 8 27 28 25 26 23 24 21 22 1 20 17 18 Figure 2 This interconnect diagram for the Serial RapidIO use case has an Open backplane that conforms to the BKP6-CEN-.2.2.n profile. This supports payload boards and two central switch boards. at an all-to-all exchange of data. This is of interest as it represents a common problem encountered in embedded processing systems: a distributed corner turn of matrix data. This is a core function in synthetic aperture radars, for instance, where it is termed a corner turn. It is commonly seen when the processing algorithm calls for a two (or higher) dimensional array to be subjected to a two dimensional (or higher) Fast Fourier Transform. In order to meet system time constraints, the transform is often distributed across many processor nodes. Between the row FFTs and the column FFTs the data must be exchanged between nodes. This requires an all-to-all exchange of data that can tax the available of a system. A simple analysis of this topology might make the following assumptions: there are links between nodes on each board via the onboard switch, there are links to nodes on adjacent cards via links between the onboard switches, and there are 22 connections made via the central switches. In this approach, the overall performance for an all-to-all exchange might be assumed to be detered by the lowest aggregate of these three connection types in other words that of a single link divided by the number of connections. This equates to 4 lanes x 5 Gbit/s x 0.8 encoding / 22 nodes = 0.73 Gbit/s. If we apply the same simplistic analysis to the system, this suggests that the available for all-to-all transfers is 4 lanes x 3.5 Gbit/s x 0.8 encoding of x8 connections between switches / 368 paths = 0.217 Gbit/s per when using. That means has an apparent speed advantage of 3.4 to 1. However, this is a flawed analysis and gives a misleading impression as to the relative performance that might be expected from the two systems when doing a corner turn. The two architectures are evaluated with different methods one by dividing the worst-case by the number of processors sharing it, and the other by dividing the worst-case by the number of links that share it.

1 2 3 4 5 6 7 8 27 28 25 26 23 24 21 22 1 20 17 18 Figure 3 This interconnect diagram has an Open backplane with an interconnect mapping for or InfiniBand. Architecture Matters A second potential error is to ignore the internal architecture of each switch device, as this can have an effect in cases where the switch does not have balanced. However, the biggest flaw is the suggestion that the performance of a non-uniform tree architecture can be modeled by deriving the lowest connection in the system. In network theory, it is widely accepted that the best metric for the expected performance of such a system is represented by the of the network. The of a system is found by dividing the system into two equal halves along a dividing line, and enumerating the rate at which data can be communicated between the two halves. Reconsidering the network diagram of the system, the bisection width is defined by the number of paths that the median line crosses, which adds up to be 1. Similarly, the width of the or DDR IB system would add up to be 1. Given that the link for the system is Gbit/s and for is Gbit/s, the of the system is 1 x = 304 Gbit/s, and for the system it is 1 x = 10 Gbit/s. This represents an expected performance ratio for the total exchange scenario of 1.6 to 1 in favor of the system not the 3.4 to 1 predicted in the simplistic model. If we now replace the switch with an InfiniBand switch, which fits the same slot and backplane profiles, the is 1 x = 304 Gbit/s. Therefore the performance of DDR InfiniBand matches that of. Bandwidth Calculations Pipeline Case Another dataflow model commonly considered is a pipeline, where data streams from node to node in a linear manner. When designing such a dataflow, it is normal to map the tasks and flow to the system in an optimal manner. This can include using different fabric connections for different parts of the flow. A good IPC library and infrastructure will allow the designer to do so without requiring any modifications to the application code. AXIS has this characteristic. Here, for simplicity, it is assumed that the input and output data sizes at each processing stage are the same (no data reduction or increase). In this instance the rate of the slowest link in the chain dictates the overall achievable performance. Reprinted from June 20 COTS Journal

1 to 2 Figure 4 1 2 3 4 5 6 If Task 1 is mapped to 1, Task 2 to 2 and Task 3 to 3, the available paths are shown in yellow in Figure 4 for the system. The path from Task 1 to Task 2 is over x8 PCIe Gen2, with an available of 32 Gbit/ ss. The path from Task 2 to Task 3 has access to two links, an aggregate rate of 20 Gbit/s. Therefore the imum path is 20 Gbit/s. In the DDR IB system, the path from Task 2 to Task 3 has access to two IB links, an aggregate rate of 32 Gbit/s. The PCIe link is unchanged, so the imum leg here is 32 Gbit/s. Now, for the system, with paths between 1 and 2 and between 2 and 3, two separate links are available, so 32 Gbit/s is available for both legs. 2 to 3 2 to 3 Shown here is a pipeline dataflow scheme mapped to a system. Backplane Use Case SR DDR 1B : IB: CEN payload 2 switch All-to-all 10 Gbits/s 304 Gbits/s 304 Gbits/s 1.6x 1x CEN payload 2 switch Figure 5 Pipeline 20 Gbits/s 32 Gbits/s 32 Gbits/s 1.6x 1x The table summarizes the system analyses for the, and DDR InfiniBand systems. The DDR InfiniBand system matches the performance of the system for both use cases. The result of all this is that the limiting s for the pipeline use case are 20 Gbit/s for, 32 Gbit/s for DDR IB and 32 Gbit/s for. Other Factors to Consider The push to support open software architectures MOSA, FACE and so on is leading the military embedded processing industry to support middleware packages such as Open Fabric Enterprise Distribution (OFED) and OpenMPI for data movement. Typically OpenMPI is layered over a network stack, and its performance is highly reliant on the efficiency of how the layers map to the underlying fabric. Some implementations rely on rionet, a Linux network driver that presents a TCP/IP interface to. Contrast this with an OpenMPI implementation that maps through OFED to RDMA over or InfiniBand, and it can be seen that the potential exists for a large gap in performance at the application level, with RDMA being much more efficient. Meanwhile, it is sometimes claimed that is more power efficient than the other fabrics. If we total up the power of the bridge and switch components for each -slot system, a truer picture emerges. If you do the math, the power efficiency of and DDR IB is on par, with fairly close. Differences Not Significant Figure 5 summarizes the system analyses for the, and DDR InfiniBand systems. These show that for both use cases, the simplistic analysis presented elsewhere overestimates the performance advantage of over by a factor of two, and that the advantage is completely attributable to the difference in clock rates. The CEN topology has little to no effect in reality. It also shows that the DDR InfiniBand system matches the performance of the system for both use cases. GE Intelligent Platforms Charlottesville, VA. (800) 368-2738. [www.ge-ip.com].