Avoid Bottlenecks Using PCI Express-Based Embedded Systems

Similar documents
MicroTCA / AMC Solutions for Real-Time Data Acquisition

Introduction Electrical Considerations Data Transfer Synchronization Bus Arbitration VME Bus Local Buses PCI Bus PCI Bus Variants Serial Buses

Solving the Data Transfer Bottleneck in Digitizers

PCI-X Bus PCI Express Bus Variants for Portable Computers Variants for Industrial Systems

VXS-621 FPGA & PowerPC VXS Multiprocessor

SBC-COMe FEATURES DESCRIPTION APPLICATIONS SOFTWARE. EnTegra Ltd Tel: 44(0) Web:

William Stallings Computer Organization and Architecture 10 th Edition Pearson Education, Inc., Hoboken, NJ. All rights reserved.

VXS-610 Dual FPGA and PowerPC VXS Multiprocessor

Choosing the Right COTS Mezzanine Module

(Advanced) Computer Organization & Architechture. Prof. Dr. Hasan Hüseyin BALIK (3 rd Week)

XMC-FPGA05F. Programmable Xilinx Virtex -5 FPGA PMC/XMC with Quad Fiber-optics. Data Sheet

6.9. Communicating to the Outside World: Cluster Networking

What is PXImc? By Chetan Kapoor, PXI Product Manager National Instruments

High Bandwidth Electronics

Optimizing HDL IP Development with Real-World I/O. William Baars National Instruments

Intel QuickPath Interconnect Electrical Architecture Overview

Simplify System Complexity

FC-NVMe. NVMe over Fabrics. Fibre Channel the most trusted fabric can transport NVMe natively. White Paper

New! New! New! New! New!

STANDARD I/O INTERFACES

New! New! New! New! New!

VPXI epc. Datasheet. AmpliconBenelux.com. Air Cooled 4U 1/2 Rack OpenVPX Windows/Linux Computer with Four Expansion Slots DESCRIPTION

Product Overview. Programmable Network Cards Network Appliances FPGA IP Cores

The RM9150 and the Fast Device Bus High Speed Interconnect

New! New! New! New! New!

How to validate your FPGA design using realworld

System-on-a-Programmable-Chip (SOPC) Development Board

FPGA Solutions: Modular Architecture for Peak Performance

Analog & Digital I/O

Embedded Tech Trends 2014 Rodger H. Hosking Pentek, Inc. VPX for Rugged, Conduction-Cooled Software Radio Virtex-7 Applications

Bus Example: Pentium II

Components for Integrating Device Controllers for Fast Orbit Feedback

RAID SEMINAR REPORT /09/2004 Asha.P.M NO: 612 S7 ECE

Extending PCI-Express in MicroTCA Platforms. Whitepaper. Joey Maitra of Magma & Tony Romero of Performance Technologies

Using FPGAs as a Flexible PCI Interface solution

Computer Systems Laboratory Sungkyunkwan University

Technical Backgrounder: The Optical Data Interface Standard April 28, 2018

Air Cooled 4U 1/2 Rack OpenVPX Windows/Linux Computer with Four Expansion Slots DESCRIPTION

Computer Architecture and Organization (CS-507)

AXI4 Interconnect Paves the Way to Plug-and-Play IP

CEC 450 Real-Time Systems

DRAM and Storage-Class Memory (SCM) Overview

PC-based data acquisition I

FlexRIO. FPGAs Bringing Custom Functionality to Instruments. Ravichandran Raghavan Technical Marketing Engineer. ni.com

FPGAs and Networking

Maximum Performance. How to get it and how to avoid pitfalls. Christoph Lameter, PhD

The Nios II Family of Configurable Soft-core Processors

PXI Remote Control and System Expansion

Product Information Sheet PX Channel, 14-Bit Waveform Digitizer

40Gbps+ Full Line Rate, Programmable Network Accelerators for Low Latency Applications SAAHPC 19 th July 2011

FPGA Provides Speedy Data Compression for Hyperspectral Imagery

Chapter 3. Top Level View of Computer Function and Interconnection. Yonsei University

AD GSPS Analog Input XMC/PMC with Xilinx Virtex -5 FPGA. Data Sheet

Simplify System Complexity

64-Channel, 16-Bit Simultaneous Sampling PMC Analog Input Board

Octopus: A Multi-core implementation

New Software-Designed Instruments

Nitro240/260 CPU Board Scalable 680x0 VME board for I/O intensive applications

Cobalt Performance Features

Maximizing heterogeneous system performance with ARM interconnect and CCIX

Facilitating IP Development for the OpenCAPI Memory Interface Kevin McIlvain, Memory Development Engineer IBM. Join the Conversation #OpenPOWERSummit

Interconnecting Components

JMR ELECTRONICS INC. WHITE PAPER

Machine Vision Camera Interfaces. Korean Vision Show April 2012

End-to-End Adaptive Packet Aggregation for High-Throughput I/O Bus Network Using Ethernet

Intelop. *As new IP blocks become available, please contact the factory for the latest updated info.

CS/ECE 217. GPU Architecture and Parallel Programming. Lecture 16: GPU within a computing system

14HSAI4. 14-Bit, 4-Channel, 50MSPS/Channel PMC Analog Input Board. With 66MHz PCI Compatibility, Multiple Ranges, and Data Packing

Architecting the High Performance Storage Network

Artisan Technology Group is your source for quality new and certified-used/pre-owned equipment

Chapter 13: I/O Systems. Operating System Concepts 9 th Edition

COPious-PXIe. Single Board Computer with HPC FMC IO Site DESCRIPTION APPLICATIONS SOFTWARE V0.2 01/17/17

Components of a personal computer

The CoreConnect Bus Architecture

QuiXilica V5 Architecture

D Demonstration of disturbance recording functions for PQ monitoring

Abstract of the Book

A Scalable Multiprocessor for Real-time Signal Processing

CHAPTER 6 FPGA IMPLEMENTATION OF ARBITERS ALGORITHM FOR NETWORK-ON-CHIP

Exploiting the full power of modern industry standard Linux-Systems with TSM Stephan Peinkofer

An Open Architecture for an Embedded Signal Processing Subsystem

Messaging Overview. Introduction. Gen-Z Messaging

WHITE PAPER. High-Performance Embedded Applications: The SUMIT Standard

New! New! New! New! New!

Internet users are continuing to demand

Connectivity. Module 2.2. Copyright 2006 EMC Corporation. Do not Copy - All Rights Reserved. Connectivity - 1

OP5600 & OP7000. High performance Real-Time simulators. Yahia Bouzid 25 th June2013

Highly Accurate, Record/ Playback of Digitized Signal Data Serves a Variety of Applications

PMC-440 ProWare FPGA Module & ProWare Design Kit

Example Networks on chip Freescale: MPC Telematics chip

New Initiatives and Technologies Brighten Embedded Software Radio

Multi-Function VPX COTS Boards

Next Generation Architecture for NVM Express SSD

Achieving UFS Host Throughput For System Performance

RACEway Interlink Modules

UWB PMC/XMC I/O Module

Design Concepts For A 588 Channel Data Acquisition & Control System

AMC516 Virtex-7 FPGA Carrier for FMC, AMC

Compute Node Design for DAQ and Trigger Subsystem in Giessen. Justus Liebig University in Giessen

Calypso-V6 VME / VXS. Extreme Signal Acquisition. and FPGA-based Processing. Without Compromise

Transcription:

Avoid Bottlenecks Using PCI Express-Based Embedded Systems Implementing efficient data movement is a critical element in high-performance embedded systems, and the advent of PCI Express has presented us with some very effective approaches to optimize data flow. BY JIM HENDERSON, INNOVATIVE INTEGRATION

A good friend will help you move, but a trusted friend will help you move the bodies. Fortunately, I haven t any direct experience to substantiate this old quip, but I chuckle at the premise. Embedded system architects occasionally need help moving as well, and for this we turn to our trusted friends DMA, bus-mastering, shared memory and coprocessing. Embedded systems are ubiquitous, used within communications installations, automobiles, cell phones and even table saws. Most systems interact with their environment through transducers connected to analog-todigital converters (ADCs/DACs), communications ports such as USB and Ethernet, and myriad other specialized devices. It is commonplace for such devices to generate or consume gigabytes per second, well beyond the capacity of a typical embedded processor to manipulate directly. Usually, this problem is addressed by including some form of coprocessing or direct memory access (DMA) within the embedded system. For example, consider the narrowband receiver inside a software radio depicted in Figure 1. A wideband IF signal is digitized at 250 MHz, with 16-bit resolution, down-converting to produce 12.5 MHz of narrowband data. This operation might be implemented using a COTS device such as the TI GC6016 DDC or using custom VHDL firmware within an FPGA, such as a Xilinx Virtex 6. Regardless, the output data rate is substantially reduced the complex, 16-bit output data samples are produced at 12.5 MHz, equivalent to just 50 Mbyte/s. However, even this mitigated rate may represent a substantial burden to an embedded processor. It might be necessary to implement a decoder to bring the signal to baseband, reducing the data rate by another order of magnitude before that rate is suitable for the embedded controller. In this scenario, the efficiency of data flow is improved by delegating the per-sample manipulations to the coprocessor, reducing the bandwidth from 500 to just 5 Mbyte/s. The bandwidth of the embedded computer s bus is preserved by reducing its load. Figure 1 Digital Down Converter implementation. But in some situations, coprocessing is not feasible or desirable. Implementing the necessary coprocessing functionality might exceed available FPGA resources. Or, such computations may be deferred or implemented on another system in non-real-time. For example, it is commonplace to capture wideband recordings of signals to disk, subsequently analyzing the waveforms in environments such as Matlab. Multipath, Doppler fading and other impairments on a cellular transmission might be captured by a portable 250 MHz IF recorder mounted in the trunk of a car driving throughout a city. These real-world signals could subsequently be used to stimulate prototype receiver equipment in a lab. What architecture is required to implement such a system at these rates?

Direct Memory Access (DMA) is a feature of many embedded systems that allows I/O subsystems to access memory-mapped peripherals independently from the embedded CPU, microcontroller or digital signal processor. Once initialized, a DMA controller autonomously reads from source memory addresses and writes to destination addresses. Typically, DMA controllers initiate data movement when signaled by an external stimulus signal, such as an FIFO level signal indicating that data can be read or written to a device. Many modern DMA controllers provide substantial addressing flexibility, allowing transactions to/from fixed addresses (memory mapped FIFOs), address ranges (shared memory) and even non-contiguous ranges (image buffers). Since DMA responds to the stimulus signal independently from the CPU and is granted higher priority than the CPU in accessing the bus, DMA transactions are a mainstay for reliable, deterministic data flow in real-time applications. Nearly all desktop PCs and many embedded systems now incorporate PCI Express (PCIe) as the communications bus between I/O cards and host system memory. PCIe is a standard, point-to-point serial implementation that has displaced the earlier (parallel) PCI standard that was a dominant expansion card interface used until circa 2005. PCIe gained market acceptance very rapidly, primarily due to the signal routing density intrinsic to its serial implementation and its correspondingly scalable performance. PCIe traffic flows in lanes, each implemented as a matched pair of differential signals implementing a bidirectional data path. Data flow is 8/10B encoded at a bit rate of 2.5 Gbit/s in the PCIe v1 standard, providing a rated 200 Mbyte/s for each lane. PCIe v2-compliant hardware doubles that bit rate to 5.0 Gbit/s, doubling throughput. To support high bandwidth devices, lanes may be bonded to improve throughput. For example, graphics adapters in desktop PCs often bond sixteen v2-compliant lanes to establish an effective 6400 Mbyte/s bidirectional link between system memory and the GPU (graphics processor unit). Use of PCIe is not restricted to the desktop form factor. VITA Standard 42 combines the popular PCI Mezzanine Card (PMC) format with the PCIe serial fabric technology to support standardized PCIe peripheral cards in embedded applications, known as XMC modules (Figure 2). Likewise, VITA Standard 46 supports use of PCIe within VPX platforms, the popular rugged successor to VME (Figure 3). Figure 2 Figure 3 XMC Module with FPGA and Analog I/O. Embedded VPX System. To effect a transfer, a PCIe device first arbitrates with the host processor(s) for memory bus ownership. Once granted, the master writes or reads small packets of 32- or 64-bit words to another memory-mapped peripheral, the slave. Packet sizes are governed by cache memories within the PCIe chipset. These are typically less than 2K words in size, although they are frequently much smaller on the chipsets found in embedded systems. The slave device accepts or rejects the packet by emitting an acknowledgement or negative acknowledgement, respectively. This protocol allows for flow control if data is being sourced too rapidly for the slave by the master.

PCIe cards typically implement embedded data movement engines that implement bus mastering a specialized form of DMA transfer. Bus mastering offers several advantages compared to using a legacy DMA controller. While both DMA and bus mastering arbitrate with the host processor for access to the host memory bus (exhibiting similar transaction latency), each DMA transaction involves both a bus read and write. However, a bus mastering peripheral can access onboard I/O devices via a local bus, so a transfer to system memory or another memory-mapped PCIe peripheral requires only a bus write operation, effectively doubling the efficiency of DMA. Bus mastering transfers occur between a PCIe device and another memory-mapped peripheral. But the latter need not be memory, per se. For instance, one PCIe device may transfer directly to another, bypassing host system memory. PCIe systems can incorporate switch devices that allow multiple attached devices to communicate locally, mitigating traffic on the primary system bus. Traffic between such devices behaves as if interconnected directly. Switch devices are typically configured either statically from flash ROM during system initialization or dynamically by the host processor during OS initialization. Once configured, even persistent PCIe traffic between local devices will consume no system bus bandwidth. The high-speed cellular data storage application referenced earlier seems a suitable candidate to exploit inter-device PCIe communications. One would assume that a high-speed PCIe digitizer could bus master directly to a RAID disk controller to implement wideband recording, without loading the host system bus or processor. While local, inter-board communications is exploited frequently in VPX platforms to share data between two FPGA cards, it is rarely feasible when mixing PCIe devices from multiple vendors. Most RAID controllers, for instance, are closed designs provided with drivers only for popular operating systems such as Windows or Linux. These drivers perform all low-level access to the array and are the exclusive vehicle for interacting with the array. The manufacturers do not document or support use of the RAID controller as a slave device. The zero work solution is to accept the implicit inefficiency of a single bus mastering transfer from the source PCIe device to host system memory. If acquiring at a relatively modest rate like 250 MHz from 16- bit ADC devices, this represents a load of only 1000 Mbyte/s on the system bus (500 Mbyte/s for the BM from the acquisition device to system memory, and another 500 Mbyte/s when the RAID controller BMs from system memory to the RAID array), which is well within the available bandwidth of modern motherboards and RAID subsystems. But if the data rate escalates, this approach becomes increasingly unattractive. A typical i7-based embedded system will provide approximately 6-8 Gbyte/s of aggregate system memory bandwidth. If the acquisition card samples at 2 Gsamples/s at 12-bit resolution with no packing, the system bus utilization becomes an unmanageable 8000 Mbyte/s for the real-time storage to the array. Interestingly, CPU utilization is nearly zero the CPU is involved only occasionally to initiate a write of data to the array when the acquisition device completes each bus mastering transfer. But the miniscule CPU usage is irrelevant in this application; We re bus-bound. To address this problem, it is necessary to redesign the interface on the acquisition device. Instead of using a bus mastering interface in which the module autonomously delivers data into host system memory consuming host memory bus bandwidth available memory on the acquisition card must be mapped as shared memory. Acquired samples from the ADC on the card are stored into consecutive memory locations in onboard memory, similar to a FIFO. But these memory locations are also mapped into the memory space of the host, and appear as a memory bank addressable by the host CPU and the RAID controller. The key advantage of this approach is that during acquisition, samples will appear in the memory space of the read-only memory, without requiring a memory bus arbitration or write cycles. Effectively, the data transfer is free. Of course, when the RAID controller reads from this memory area, the bus will be utilized normally and a fraction of the available bus bandwidth will be consumed. However, the utilization is halved, compared to the bus-master scenario described earlier. On modern acquisition cards that employ reprogrammable FPGAs, this sort of repurposing is relatively easy to implement. What a difference compared to the cut and jumper world in which we used to live!