Performance Evaluation of a Full Speed PCI Initiator and Target Subsystem using FPGAs

Performance Evaluation of a Full Speed PCI Initiator and Target Subsystem using FPGAs David Robinson, Patrick Lysaght, Gordon M c Gregor and Hugh Dick* Dept. Electrical and Electronic Engineering University of Strathclyde 204 George Street Glasgow G1 1XW United Kingdom * Dynamic Imaging Ltd 9 Cochrane Square Brucefield Industrial Park Livingston Scotland United Kingdom Abstract State-of-the-art FPGAs are just capable of implementing PCI bus initiator and target functions at the original bus speed of 33 MHz. This paper reports on the use of a Xilinx 4000 series FPGA and LogiCore macros to implement a fully compliant PCI card for a specialist data acquisition application. The design required careful performance analysis and manual intervention during the design process to ensure successful operation. 1. Introduction The Peripheral Component Interconnect (PCI) bus is an important component in high performance, data intensive computer systems [1]. Its maximum bandwidth of 132 Mbytes/s and its support for automatic configuration of peripheral cards have been major forces behind the wide acceptance of PCI in PC and workstation environments. To achieve 100% PCI compliance, a PCI card must adhere to the very strict electrical, timing and protocol specifications imposed by the bus standard. Although mask programmed ASIC devices can easily meet these specifications, FPGA implementations push current technology to the limit [2]. FPGAs that meet the strict specifications are now available, as are third-party PCI interface macros [3]. However, implementing a PCI interface that achieves optimum performance remains difficult because state-of-the-art FPGAs and careful design are needed to meet the timing requirements of the bus. This paper describes the performance evaluation of a PCI card for a specialist data acquisition application. The aim of the design was to transfer image data quickly over the PCI bus between a proprietary interface to an ultrasound scanner and a graphics controller card. The PCI interface is implemented in a Xilinx XC4013E FPGA using the LogiCore PCI macros. The design is distinguished from earlier reported cards [2],[4] in that it includes both 32-bit PCI target and initiator capabilities operating at the maximum bus speed of 33 MHz. The card is designed to operate under the

Windows 95 PC operating system via a custom virtual device driver (VxD). The card also uses a Xilinx XC6216 FPGA to emulate the characteristics of the data source prior to its implementation. PCI bus performance analysis was conducted before the design phase to guarantee the data transfer rate under worst case conditions. 2. System Overview PCI is a high performance, processor independent, local bus standard. Reflected wave signalling is used to allow CMOS ASICs to interface directly to the bus. Each signal wave propagates to the end of the unterminated bus and is reflected back to the point of origin, doubling the voltage on the bus trace. For 33 MHz operation, the round trip delay can last only 10 ns [1]. This places severe restrictions on the length and capacitive loading of the bus and also the electrical characteristics of any device connected to it. The original PCI bus specification stipulated 32-bit data transfer at 33 MHz, offering a maximum bandwidth of 132 Mbytes/s. (A 64-bit, 66 MHz version of the bus has more recently been defined). Data and address lines are time division multiplexed, reducing the required number of pins to 49 in the 32-bit version. All data is transferred in burst mode, starting with a single address phase and followed by one or more data phases. PCI devices come in two types, Masters and Targets (also referred to as Initiators and Slaves respectively). An Initiator can request access to the bus for data transfer whilst a Target has to wait until it is accessed by an Initiator. A single PCI card can perform both Target and Initiator functions. Access to the PCI bus is request-based rather than time-slot based and is controlled by a PCI arbiter chip. Each initiator device on the bus has its own request and grant lines connected to the arbiter chip. This allows arbitration to occur while the bus is in use by other PCI devices. ISA Bus Image Data Source Card #1 PCI Slot #3 PCI to ISA Bridge (i.c.) Data Transfer PCI Bus Host to PCI Bridge (i.c.) Intel Pentium Host PC Graphics Controller #2 PCI Slot #4 Bus Arbitrer Fig. 1. Image Data Source System Configuration

The image data source (IDS) card was developed for use in an Intel PC motherboard equipped with a 166 MHz Intel Pentium processor, 16 MB RAM and the Intel 430 VX PCISET PCI chipset. Fig. 1 shows the system configuration. Typical PCI systems can support up to 10 PCI loads. An add-in card counts as two loads and a silicon interface counts as one [5]. The motherboard s Host to PCI bridge and the PCI to ISA (Industry Standard Architecture) bridge, shown in Fig. 1, constitute two loads. Up to four add-in cards can be accommodated. Two of these are used for the IDS card and the graphics controller. Fig. 2 shows the block diagram of the IDS card. XC6216-2 Back-end Emulator Fast Data Port Slow Data Port XC4013EPQ208-1 Initiator Target LogiCore PCI Interface PCI Bus Fig. 2. Block Diagram of IDS Card Back-end functionality is emulated by the XC6216, prior to the availability of the data source circuitry. The slow data port handles command and status information to regulate the operation of the image source. Image transfer to video memory is via the fast data port. 3. PCI Bus Performance Evaluation The quoted bus bandwidth of 132 Mbytes/s is an optimum parameter derived from the product of the bus width and clock frequency. Determination of the actual bus bandwidth requires examination of performance of the individual devices and the role of the bus arbiter. The performance of an individual PCI device can greatly influence the performance of the bus. Data transfers can only be sustained at the speed of the slowest device involved in a transaction. A target device that inserts a single wait state after every data transfer effectively halves the initiator s bandwidth. Devices that only transfer data in small bursts before disconnection incur additional address phase overheads.

The PCI specifications are surprisingly vague when describing the algorithm used by the PCI arbiter chip to award bus access: An arbiter can implement any scheme as long as it is fair and only a single GNT# is asserted on any rising clock [1]. GNT# refers to the set of grant lines through which all initiators have a unique connection to the arbiter. The important implication of this statement is that the effective PCI bus bandwidth cannot be calculated directly from the bus specifications. Instead it is dependent on the PCI chipset used in the target system. The following calculations determine the worst case per card bandwidth of a PCI system with four initiator cards, based on the 438VX PCI chipset in mode 0 [6]. Fig. 3 shows the bus access priority sequence. CPU ISA IDS #2 CPU ISA #3 #4 CPU ISA IDS #2 9 9 7 PCI Transactions Fig. 3. Bus Access Priority Queue for PCI Arbiter This scheme grants access priority to one device for every slot in the queue. Each slot lasts the maximum 256 clock cycles defined by the chipset. Under worst case conditions each device will always request the bus and will perform maximum length transfers. Under these conditions bus access effectively becomes time slot based. From Fig. 3 one can calculate that the maximum bandwidth for a single initiator card has dropped from 132 Mbytes/s to 16.5 Mbytes/s. Half the available bandwidth is reserved by the arbiter for use by the CPU and ISA bridges. The remaining bandwidth (66 Mbytes/s) is shared equally between the four PCI masters. 4. Software Device Drivers Almost all hardware devices added to the PCI bus depend on software to function correctly. This can result in a substantial decrease in device performance. Software overheads are typically large and non-deterministic. Windows 95 operates a complex virtual machine (VM) environment [7]. Software running in a VM operates as if it has exclusive access to a particular hardware device. Communication between hardware and software is routed through system VxDs to allow resource arbitration between VMs. The non-deterministic delays incurred are one of the disadvantages of this environment. All PCI device drivers must support interrupt chaining [1]. Multiple VxDs may respond to an interrupt and a subset of these may invoke further interrupt handler

routines. A VxD responding to an interrupt may notify several VMs. These in turn may pass an interrupt on to several software applications. This situation is shown in Fig. 4. The delays experienced depend heavily on the system configuration and current activity. Application #1 #1 Application #2 #2 Application #3 #3 VM VM #1 #1 VM VM #2 #2 VM VM #3 #3 VxD VxD #1 #1 VxD VxD #2 #2 Virtual Virtual Interrupt Controller Card Asserts Interrupt Fig. 4. Possible Interrupt Flow in Windows 95 5. System Design The image acquisition card is required to transmit a picture of 640 by 440 pixels at 30 frames per second. Each pixel is described by one byte of data. This necessitates a bandwidth of 8.448 Mbytes/s. During burst writes, the LogiCore initiator interface automatically inserts one wait state per data transfer. This doubles the required bandwidth to 16.896 Mbytes/s. Worst case analysis of the PCI environment established that a single card could only achieve an effective bandwidth of 16.5 Mbytes/s, which is less than the required amount. Under fully loaded conditions, the much quoted maximum PCI bus bandwidth of 132 Mbytes/s and the maximum achievable data transfer rate with a fully compliant card differ by almost an order of magnitude. Two actions are taken to guarantee the card the required bandwidth. 1. The system is limited to a maximum of three PCI master devices. This allows each card a maximum bandwidth of 22 Mbytes/s.

2. Interaction with software is kept to a minimum to avoid poor performance. The VxD configures the card with the information necessary to start a transfer to the graphics controller. Once a transfer has begun, software intervention is limited to infrequent use of control commands only. The IDS card has no support for interrupts. 6. FPGA Implementation Due to increasingly complex designs and faster time to market requirements, individual companies will be unlikely to design all the blocks necessary for their systems in the future. Instead, designers may use system components designed externally [8]. The PCI LogiCore macro from Xilinx is one of the first examples of an intellectual property (IP) building block commercially available for an FPGA to realise the systems on a chip concept. Fig. 5 shows a floorplan of the LogiCore PCI interface after placement and routing. Fig. 5. Floorplan of the LogiCore PCI Interface The PCI bus is connected to the pins on the left hand side of the chip in Fig. 5. Free space is plentiful on the right of the chip for user logic. Problems arise when connecting user logic on the right to PCI signals on the left. Timing specifications limit the distance that a heavily loaded net can be routed. User logic must be carefully placed next to the PCI logic, as near as possible to the source of the required signals.

Default timing constraints in the LogiCore macro limit the delay between any two flip flops in the user design to a maximum of 30ns. Combinatorial block delays can easily exceed 30ns, causing the design to fail the timing specifications. Routing delays also quickly exceed this specification. Path analysis of user logic can identify components that should be removed from this group. Creating more realistic timings relaxes the overall constraints on the partition, place and route (PPR) software. In cases where the timing constraints cannot be adjusted, careful floorplanning must be conducted to meet the maximum path delays. Detailed knowledge of the device architecture allows the designer to create a more efficient design. For example, each CLB in the XC4013E contains two flip-flops. Every flip-flop has an associated threestate-buffer (TBUF) which is connected to a horizontal longline. Columns of TBUFs can be driven by a single, vertical longline. These resources can be used to make highly efficient registers that drive a shared bus, Fig. 6. CLB Horizontal longlines used as bus CLB Vertical longline used as as TBUF enable Fig. 6. Efficient Register in XC4013E The PPR software does not automatically exploit this feature. Flip-flops are commonly connected to TBUFs in other columns, registers are not aligned with the bus and vertical longlines are not used to control the TBUFs. Floorplanning and using attributes at the schematic level force the PPR software to create a more efficient layout. Logic duplication was also used in an effort to meet the timing specifications. Duplicating the generation of certain control signals increased the probability of the PPR software satisfying the timing constraints. Register control signals benefited most from this technique. The XC4013E is segmented into quadrants to make efficient use of longline resources. 32-bit registers were split across two quadrants, 16 bits in each quadrant. Providing separate, yet identical, control lines to each half of the register allowed the PPR software to place and route the register within the timing constraints.

A critical point is quickly reached where no additional logic can be added to the circuit without consistently failing the timing specifications. This effect removes one advantage of using FPGAs which is the ability to add and remove extra circuitry for debugging purposes. Simple tasks, such as routing signals to I/O pins for monitoring with an external logic analyser, become impossible. To achieve a working 33 MHz PCI initiator and target design, the LogiCore macro itself must be modified. The macro generates a data valid control signal, called DATA_VLD. This signal is derived from circuitry in both the Target and Initiator sections of the LogiCore macro. For its operation, the IDS design needs to decompose DATA_VLD back into two separate signals. One of the signals represents DATA_VLD for the Target while the other represents DATA_VLD for the Initiator. The effect of splitting DATA_VLD outside the macro is to heavily load nets that are already critical with respect to the timing specifications. The same goal can be achieved more simply by altering the macro slightly to provide external access to the two signals that are combined inside the macro to create DATA_VLD. The same technique was repeated several times with more complex signals. 7. Practical Results The chip resource utilisation of the final design can be seen in Table 1 and the floorplan in Fig. 7. Available Actual Used % Used CLBs 576 403 69% Bonded I/O Pins 160 123 76% F & G Function Generators 1152 458 39% H Function Generators 576 118 20% CLB Flip-flops 1152 395 34% IOB Input Flip-flops 192 45 23% IOB Output Flip-flops 192 44 22% 3-State-Buffers 1248 353 28% 3-State Half Longlines 96 32 33% Table 1 Resource Utilisation It can be seen from Table 1 that the design uses a large number of three-state-buffers. These are used by both the LogiCore interface and the user application to drive a shared internal bus. To avoid bus contention which could easily destroy the device, extra care must be taken during the design and simulation phases. Treating the LogiCore macro as a complete black box design may lead to designs that fail in the long term.

Fig. 7. Floorplan of Complete PCI Design The frame rate of the final design was measured using a Hewlett Packard 1671D logic analyser and a FuturePlus PCI Preprocessor FS16P64 card. When the IDS was the only initiator device on the PCI bus, and the CPU was performing no background processing, a frame rate of 213 frames per second (fps) was measured. This corresponds to a bandwidth of 119.96 Mbytes/s. This figure is the product of the image width, the image height, the frame rate and the wait states inserted by the LogiCore macro. 8. Conclusions Performance evaluation of the PCI bus revealed important issues concerning the bus s bandwidth. The omission of the arbiter algorithm from the PCI specification prevents a complete analysis of the bus s performance. Worst case bandwidth can only be calculated relative to a specific PCI chipset. Bus performance is limited by the quality of an individual vendor s PCI chipset. Analysis of the bus with a specific chipset revealed that the maximum worst case bandwidth for a single card was almost an order of magnitude lower than the headline bus bandwidth. FPGA implementations of a PCI bus interface remain difficult, even with state-of-theart devices. As can be seen from Fig. 7, modern, high-capacity FPGAs have ample resources to implement PCI interfaces. The XC4013 is a smaller member of the XC4000 family. PCI design pushes FPGAs very close to the limits of the technology

with respect to the operating speed of the logic. Designs must be hand crafted to ensure correct logic operation. Using the LogiCore interface greatly accelerated the design process. Creating a highly efficient IP building block that is both flexible and easy to use is difficult and tradeoffs have to be expected. While not quite a black box solution, designing a complete PCI interface would have been more difficult than modifying aspects of the LogiCore interface. 9. Acknowledgements The design and development of the PCI system was a joint project between the University of Strathclyde and Dynamic Imaging Ltd. The authors wish to gratefully acknowledge the support of all the staff who helped to make it successful. 10. References [1] PCI Local Bus Specification Revision 2.1, PCI Special Interest Group, USA, 1995 [2] Fawcett, B K, Designing PCI Bus Interfaces with Programmable Logic, Eighth Annual IEEE International ASIC Conference, Austin, USA, 1995 [3] LogiCore PCI Master and Slave Interface User s Guide Version 1.1. Section 8.1, Xilinx, USA, 1996 [4] Luk, W and Shirazi, N, Modelling and Optimising Run-Time Reconfigurable Systems, Proc. IEEE Symposium on FPGAs for Custom Computing Machines, USA, 1996 [5] The Peripheral Component Interconnect Bus X-Note Number 5A, Xilinx, USA, January 1995 [6] Intel 430VX PCISET 4.6.1, Intel Corporation, USA, 1996 [7] VtoolsD, Version 2.01.001, Vireo Software, USA, 1995 [8] Holmberg, Per, CORE-BASED FPGA DESIGNS, Special Report: "Intellectual Property: Reusable Cores & Macros", http://www.pldsite.com/, 1996