Novel Intelligent I/O Architecture Eliminating the Bus Bottleneck

Size: px

Start display at page:

Download "Novel Intelligent I/O Architecture Eliminating the Bus Bottleneck"

Justin Pope
5 years ago
Views:

1 Novel Intelligent I/O Architecture Eliminating the Bus Bottleneck Volker Lindenstruth; The continued increase in Internet throughput and the emergence of broadband access networks drive the development of communication processors. Other developing arenas for the application of intelligent I/O are storage area networks and system area networks used to cluster computers and mass storage systems, respectively. However, given all advantages of such devices, they have a common memory bottleneck originating from the internal bus that connects the I/O ports with the internal processor's core. This paper presents a novel intelligent I/O architecture that eliminates this bottleneck by implementing a novel data transfer controller that grants a four-fold improvement over conventional designs. I. INTRODUCTION The amount of processing that is related to any kind of input/output is increasing as is the throughput demand. In addition, latency and processor overhead become increasingly important factors with respect to the performance of input/output systems. One consequence is the intelligent I/O paradigm [4]. Communication processors are being used in various applications ranging from intelligent port cards for scalable multi-protocol routers [2] to intelligent network interfaces that use local intelligence to relieve the host processor of low-level network transactions and reduce the interrupt rate in order to gain scalability of parallel computers and clusters. Other applications include intelligent network ports performing digital video compression for the particular network link and data encryption/decryption functionality in the network interface allowing the transparent use of public networks for protected data transfers. Various processor architectures are being studied or have been implemented [1, 3]. The data paths in a communication processor are typically connected by an internal bus. The advantage is that all devices connected to this bus can simply exchange mutual data without requiring additional hardware. The disadvantage is that buses are broad cast type devices. Any data sent is visible by all devices and entirely blocks the bus. In the case of unified transaction buses and read transactions, any response latency results in stalling the bus. Pipeline arbitration and split transaction buses help to optimally utilize any bus, but the principle limitation of it supporting only one master at any point in time remains. As one example, Figure 1 sketches the architecture of the Intel 960Rx Intelligent I/O Processor as published in [3]. It implements two ports, which allow direct, mutual data exchange using the internal PCI bridge, but also direct data access by the internal microcontroller to any of the two ports. If, however, data is required to be handled by the processor, it cannot bypass the PCI bridge but has to be received by the internal processor, requiring the use of the internal data bus. Incoming Data Outgoing Data Instruction Fetches Bus MIU DMA Interface 2 Processor Internal data bus PCI-PCI Bridge Local Bus BIU DMA Interface 1 Figure 1: The Intel IOP architecture Consider one typical scenario: an Ethernet packet being received that is to be inspected by the processor in order to determine whether it is a data or control packet and where to route it inside the host computer based on the packet header and routing information stored within the IOP. The packet is arriving, for example, at port 1 and is routed through DMA interface 1 to the internal data bus. Since the internal memory is typically small, the data may have to be sent to the internal back-end bus using the memory bus interface unit (MIU). Now the internal processor needs to inspect the data also utilizing the internal data bus and MIU. Note that during the data transmission, the processor is unlikely to be able to perform any accesses to the internal bus, thus being totally blocked. Any access to the packet is to be done that way. Therefore, any data word in the packet that has to be modified, for example implementing a data encryption/decryption algorithm, needs to be read and written involving two accesses to the internal data bus. Any cache miss resulting in additional instruction and data fetches, which are not addressing the internal memory, also have to utilize the internal bus. Then, if the packet is to be forwarded to the second port, it again utilizes the internal bus in order to be transmitted to port 2 through the DMA interface 2. In this scenario, at least 4 accesses per data word and packet plus any instruction and data fetches which are cache misses of the microprocessor are required. So in order to achieve highest throughput, the internal data bus would have to run at more than 4x speed. Still the example shows that the given scenario would basically implement a store/forward scenario rather than a cut through paradigm. The situation could be improved slightly by adding dual-

ported memories or FiFos to the DMA interface, but the internal data bus bottleneck remains. This is a fundamental design problem of all bus-based architectures.

2 ported memories or FiFos to the DMA interface, but the internal data bus bottleneck remains. This is a fundamental design problem of all bus-based architectures. It is common that with any network interface or router port card a cut through elasticity buffers is required at the input and output ports allowing to decouple the various data streams with respect to access latencies, such as arbitration delays, different clock domains, and the like. Typically, such buffers are implemented as FiFos. Fifos, on the other hand, are typically based internally on a dual-ported memory with independent read and write pointers at the appropriate ports. Select2 = 1 If address translation and access control tables are also put into the triple-port memory, this routing functionality can even be implemented by simple state machines with very little overhead. This functionality grants PCI bridge functionality as indicated in the figure (pass-through-logic). In that case, the triple-port memory functions as an elasticity buffer for both input and output ports without requiring the movement of data at all, resulting in minimal latency and power consumption. Which packets are to be forwarded directly and which packets require processor intervention can be determined easily by the DMA interfaces as they can snoop the packet headers at their appropriate input ports. Bus MIU Select1 = 1 Processor Local Bus BIU Processor Bus Q1 Q3 Q6. Q5... DMA Interface 2 Q4 Q2.. Triple Port DMA Interface 1 Pass-Through-Logic Q8 bit2 bit1 bit1 bit2 Q7 Figure 2: multi port memory architecture Figure 2 shows the principle operation of a dual-port memory. Q1 through Q4 form the internal data storage cell storing the bit while Q5,Q6 and Q7,Q8 are used to readout the bits at the two ports, respectively. It becomes obvious that it is possible to add further independent readout paths without affecting the existing data path. II. ARCHITECTURE As should be obvious now, combining an at least three-port memory with a processor and two external ports grants a new kind of performing IOPs as sketched in Figure 3 below. Using the same scenario as above, the packet is received at port 1 and flows directly into the triple-port memory. At any point in time, the internal processor can access any already received portion of the data using its processor bus and private port to the triple-port memory without affecting the data stream at all. Any data conversion can happen basically one clock cycle after the word was received. As soon as the processor is done processing, the packet can be forwarded to the second port. If that port runs at the same speed, no further handshake is required and the transmission can start immediately even if the packet has not completely been received, thus implementing cut through routing. Figure 3: the IOP architecture III. IMPLEMENTATION To prove the presented concept and prove its operability, a prototype was built based on commercial off-the-shelf components. Obviously, in the long run, the majority of the logic can be integrated, reducing cost. The prototype implements a PCI-SCI [5] system area network adapter. However, as should be obvious, the architecture is neither limited to the system area network environment nor to the use of the particular network transport. Rather, it allows very flexible adoption between various bus and network standards. Figure 4: First implementation of the IOP The network interface implements protocol conversion between the unified bus PCI and the split transaction

network SCI. Incoming SCI requests are address translated and executed as PCI master, provided proper access rights are configured in the internal access control tables.

3 network SCI. Incoming SCI requests are address translated and executed as PCI master, provided proper access rights are configured in the internal access control tables. PCI requests are translated transparently into split transactions. In case of PCI write bursts, the data is posted within the NIC and appropriate network write requests are produced. In case of errors, which would be out of sequence, they have to be dispatched by an appropriate error handler. For debugging, writes can be made synchronous. PCI read transactions operate against a local cache. In case of a cache hit, the data resides inside the multi-port memory and is returned immediately. In case of a read miss, the host is stalled by issuing PCI retry cycles and appropriate read request packets are generated at the network port. As soon as the successful read response is being received, the internal cache tags are updated and the stalled read transaction is completed. completely asynchronously and independent of any of the other bus transactions. IV. NETWORK TO PCI TRANSACTIONS FLYBY ADDRESS TRANSLATION Any incoming network packet needs to be routed according to it being data or control. In the case of data packets, address translation and access control functionality is to be performed in order to allow copying of the data directly to the destination buffer in the host's memory, thus implementing zero copy network communication. Figure 6 below sketches the translation scheme here assuming a 4 kbyte page size. Some bits from both the source network ID and target address fields are used to identify the entry in the address translation table. Here, nine bits were chosen without restricting the general case. SCI Source ID =0 SCI Address SCI->PCI Lut CTL PCI Address Figure 5: SCI PCI DMA transacion while microcontroller accesses elasticity buffer Figure 5 above shows a board level simulation of a burst of incoming SCI 64-byte packets being stored in the multi-port ram. The data is immediately (here 12 clocks delay) forwarded to the PCI front-end bus. This latency includes all address translation, access control and buffer management inside the device, which is carried out on the fly. The transmission of one 64-byte packet takes 20 clocks, of which 16 clocks are the data itself (32-bit bus), plus one address phase and three clocks for the fly-by address translation. As is shown in the following paragraph, these extra clocks are not principally required and are only an implementation artifact required due to the particular architecture of the multi-port memories chosen. Should the device be integrated into an ASIC, these clocks can be saved, however, assuming 33 MHz PCI 20 clocks per 64- byte packet corresponds to 107 MB/sec aggregate throughput. In order to demonstrate the independence of the various memory ports, the PCI and SCI transactions happen simultaneously. In addition the microcontroller performs additional read/write transactions to the multi-port memory core as shown in figure 5. All these transactions happen Figure 6: address translation scheme An address translation requires the new address from the ATE to be read and typically a write transaction to generate the updated address. Such transactions typically cause bus cycles and thus cause memory bandwidth and latency as the packet cannot be forwarded prior to the completion of the address translation to complete. SCI SourceID Address Translation Index buffer 1 buffer 2 buffer 3... buffer N Address Translation Table address domain B page size SCI Subaddress address domain A Figure 7: Triple-Port layout for FlyBy Address Translation Figure 7 shows the triple-port memory layout for the FlyBy address translation. The incoming data address in the

4 transport address space is snooped together with the sender's network source ID by the DMA controller handling the incoming packet while it is storing this incoming packet in one available buffer (her buffer 2). Since this information is part of the packet header, it is available before the first data word is received. The address translation table index is snooped from the packet header and derived according to the scheme depicted in Figure 6. Then it is forwarded immediately to the PCI state machine, which immediately starts requesting the PCI bus. The first word to be transmitted is the target address, which is composed of the lowest 12 bits of the target subaddress directly taken from the packet header (4 kb pages) and the translated address taken from the address translation table. The PCI port of the triple-port memory is implemented as two separate memories with independent data and address buses with LSW being 12 bits wide. All other ports of the TPM have their address buses connected. However, for the PCI bus port, it is now possible to select different regions of the MPM for the lower and upper data bus word and thus to compile the translated address without a single memory reference. The Buffer slot number and address translation index are known a priori. The address of the least significant data bus word is driven such that the target address of the packet header is selected. The address of the most significant word is driven to select the appropriate index of the ATE stored in the TPM. Therefore the correct target address is assembled in FlyBy without any additional memory references. As indicated in Figure 7 there are additional bits provided allowing to define access controlled and write protected memory regions. Upon completion of the PCI burst, an appropriate network acknowledgement is generated in order to complete the split transaction protocol with the peer requestor. In order to allow a larger subaddress region than defined by a 512 entry times 4kByte page window, the address translation table is implemented as address translation cache with appropriate ATT tags. V. PCI TO NETWORK TRANSACTIONS PCI to network transactions are more complex as they are unified transactions and need to be broken up into request/response transactions. In case of a PCI write burst, the target address is stored together with the appropriate data in an available buffer in the triple-port memory. After the target address is received (after the first clock), the outbound address translation and access control is performed and an appropriate network header is generated. Once the data is completely received, the write packet is queued for shipment to the network and the PCI host is signaled posted completion. For debugging, this scheme can be made synchronous, however, at the corresponding performance loss. Read transactions are more complex as they require remote data to be available before the transaction can be completed. The cycle starts similar to a write transaction. The PCI target address is stored in an available buffer slot. The address is also snooped by the PCI state machine, which uses address bits 6-11 as index into a directly mapped data cache. The cache tag is being read from the same port of the triple-port memory. This is possible since, after one bus turnaround cycle, the initiator now tries to read, thus keeping its data outputs high impedance. If this is the first request of its kind, the cache tag will be invalid (V-bit clear or invalid Address Tag) requiring a network read request, which is generated similar to the PCI write transaction. At this point the PCI requestor stalls. 2 Tag (RT) VR PCIAdr AdrTag BufId STag S 0 TagSel XX V: (valid) Data/BufID valid R: (request sent) AdrTag valid, Data/BufID invalid S: This is a single transaction cache tag (include Tag) X: ignore Figure 8: transparent PCI to network transaction cache tag At some point in time, a read response is received and stored anywhere in the input buffer space. Upon arrival of the read response packet, the cache tag is updated accordingly by writing the correct Address Tag and BufferID. This can be easily done while the requestor is locking the PCI port of the triple-port memory by using, for example, the microcontroller port of the memory. As soon as the requestor sees the valid read cache tag, it matches the Address Tag and uses the BufID entry to calculate the correct address for the read data and completes the transaction. Any further requests to this memory block (here 64 bytes) will result in an immediate cache hit, thus being fast. There is a large variety of caching and prefetching strategies conceivable. Given the high cost of triple-port memories, an additional backing store is provided that allows to implement a larger second level cache on the network interface. Data is moved between the triple-port memory and the backing store using FlyBy DMA. All accesses to the device are intercepted by the packet state machine described above. This functionality can also be used for accesses to the local CSR space of the device, which is not just a memory mapped region of hardware status bits. The devices control and status region is a specific memory region, which is treated like any other transaction by the hardware. However the firmware will interpret the CSR read/write commands and execute them accordingly. In order to reduce latency, a mirror of the internal CSR status can be produced at a defined location in the host memory. Given this architecture, it is possible to implement any memory map or CSR layout without changing any hardware. Basically any software interface

5 can be accommodated by adopting the firmware accordingly. VI. OUTLOOK As should be obvious, given the high flexibility of the device, it can implement many input/output architectures including I2O and VIA. Further its applicability is not limited also to the particular choice of SCI as network transport. To demonstrate the flexibility of the architecture, a potential application to another network architecture, such as InfiniBand, shall be outlined here. The baseline InfiniBand Architecture is modular and flexible implementing separate functions as Host Channel Adapter (HCA) and IB-switch. However, any host will require both functions to be present. This results in unnecessary additional latency, overhead and required silicon real estate. address translation can be performed here also. The IOP can be tightly integrated into the HCA architecture and thus can very effectively access any part of the data stream without affecting any of the other ports. In fact, any data access by the IOP is completely asynchronous and independent of any of the other ports. VII. SUMMARY A novel intelligent network interface architecture is presented as concept that avoids the throughput bottleneck of conventional IOP architectures. The first prototype implementation, a symmetric PCI-SCI bridge, demonstrates various advantages of this architecture such as combined input/output buffers, zero copying of any data, FlyBy address translations. The architecture also supports effective bridging between radically different network or bus standards. All I/O ports implement queuing functionality, that supports multiple outstanding transactions. For further reading refer to [6]. CPU CPU Mem Mem Ctl. Figure 9: transparent PCI to network transaction cache tag Figure 9 above sketches an appropriate application of the discussed multi-ported IOP to the InfiniBand architecture. Here, the necessary switch is combined with the host channel adapter, using the multi-port memory. The device is further supplemented by the intelligent I/O processor. IOP Switch Control MultiPort VIII. BIBLIOGRAPHY [1] Architectural Considerations for CPU and Network Interface Integration, Hot Interconnects 1999 [2] Router Architectures and the True Data Transport Infrastructure, Hot Interconnects 1999 [3] i960rd, Garbus et al, US Patent 5,734,847 [4] I2Osig, [5] Scalable Coherent Interface, IEEE [6] Method and apparatus for enabling high performance intelligent I/O using multi port memories, US Patent 7,042,961 The multi-port memory can supply one port to each IB- Link, thus allowing completely independent and asynchronous data exchange between any ports. Data being received can be forwarded to any port without the requirement to move any bit of the message. Cut-through routing is a simple consequence. The available elasticity buffer space can be dynamically assigned to any channel. Any port of the switch implemented here can be operated at any speed without affecting any other port, including the interface to the IOP and the host memory controller. Any data bit being routed through the HCA can be accessed by the IOP without affecting any other part of the data stream. Packets can be reformatted on the fly while others are being received and/or forwarded through third ports of the device. This architecture merges the switches input and output buffers into the same memory bits, while avoiding the requirement for internal buses cross-bars or multiplexers. By placing address translation/access control tables within the multi-port memory appropriately, the discussed fly-by

ECE 551 System on Chip Design

ECE 551 System on Chip Design Introducing Bus Communications Garrett S. Rose Fall 2018 Emerging Applications Requirements Data Flow vs. Processing µp µp Mem Bus DRAMC Core 2 Core N Main Bus µp Core 1 SoCs