PNoC: a flexible circuit-switched NoC for FPGA-based systems

Size: px

Start display at page:

Download "PNoC: a flexible circuit-switched NoC for FPGA-based systems"

Rhoda McCarthy
5 years ago
Views:

1 FIELD PROGRAMMABLE LOGIC AND APPLICATIONS PNoC: a flexible circuit-switched NoC for FPGA-based systems C. Hilton and B. Nelson Abstract: Increases in chip density due to Moore s law allow for the implementation of ever larger and more complex systems on a single chip (SoCs). The communication mechanisms employed in such SoCs are an important contribution to their overall performance. Networks on chip (NoCs) promise to overcome the scalability problems found in bus-based interconnect. To date, most work has focused on packet-switched NoCs. Circuit-switched networks are an intriguing alternative, which promise high communication rates and predictable communication latencies. A new lightweight circuit-switched architecture called programmable NoC (PNoC) is described. PNoC is a flexible architecture that is suitable for use in FPGA-based systems. Implementation results on a Virtex-II Pro device are given using an image binarisation demonstration which resulted in as much as a 23 speedup compared with a shared bus implementation. 1 Introduction Increases in chip density due to Moore s law allow for the implementation of ever larger systems on a single chip. Known as systems on chip (SoCs), these systems usually contain a mixture of CPUs, memories and custom hardware modules. Such SoCs can also be implemented on FPGA substrates, something we will refer to as programmable SoCs (PSoCs), in this paper. The inter-module communication mechanisms employed on SoCs and PSoCs have recently received significant attention for at least two reasons. First, traditional bus-based communication mechanisms do not scale well with increasing system complexity and become a bottleneck as system complexities continue to increase. Second, design and verification times for complex systems continue to grow, that is, the desire for efficiencies in design and verification methodologies argues for standardised communication mechanisms instead of ad hoc direct module interconnections. Shared buses such as ARM s AMBA bus [1] and IBM s CoreConnect [2] are commonly used communication mechanisms in SoCs and PSoCs. They support a modular design approach that uses standard interfaces and allows for IP re-use [3], but the bus is often the performance bottleneck in a large system. Both Xilinx and Altera support a hybrid bus/direct-interconnect architecture that allows for direct module-to-module connections in addition to the bus interconnect. Hybrid approaches scale better than purely bus-based schemes, but complicate the design process because they reduce the modularity of the system and # The Institution of Engineering and Technology 2006 IEE Proceedings online no doi: /ip-cdt: Paper first received 31st October 2005 and in revised form 8th February 2006 C. Hilton is with Rincon Research Corporation, 101 N. Wilmot, Suite 101, Tucson, AZ 85711, USA B. Nelson is with Electrical and Computer Engineering, Brigham Young University, Provo, UT 84602, USA clint.hilton@gmail.com require custom hardware design for the module-to-module connections. Another alternative would involve the use of multiple buses or bus segments to alleviate the load on the main bus. This would allow for local communication between modules on the same bus segment without causing congestion to the rest of the bus. The disadvantages to this approach are its reduced flexibility and scalability, and its complication of the design process. Various network interconnect approaches have been proposed for SoCs and PSoCs [4 7]. Networks scale better and promise higher communication bandwidth than buses. Like buses, they allow the re-use of standard interface modules for connecting circuit nodes to the network. Network architectures can be divided into two categories, packet-switched and circuit-switched. In a packet-switched approach, the data are broken into packets, each of which contains routing information. These packets are injected into the network where they are independently routed to the desired destination. Packet-switched networks often allow for high aggregate system bandwidth, as many packets can be in flight at a given instant. However, they generally require congestion control and packet processing, which includes buffers to queue-up packets awaiting the availability of the routing resources. In contrast, with circuit switching, a dedicated connection path (a virtual circuit) between two nodes is established before communication takes place. Once the virtual circuit is established, raw data can be freely transferred with very low overhead between the modules until the virtual circuit is no longer needed, at which time it can be closed. Circuit-switched networks require no overhead for packetisation, packet header processing or packet buffering. Once the virtual circuit is established, accessing data across a circuit-switched connection is no more difficult than accessing a synchronous memory (the requester sends an address and receives the corresponding data in return after a delay of a few clock cycles). As a result, the circuitry required for a circuit-switched network is relatively simple and appropriate for use in even small systems. The flexibility of the proposed approach makes it suitable in a IEE Proc.-Comput. Digit. Tech., Vol. 153, No. 3, May

2 variety of topologies from rings to meshes to irregular structures. Additionally, achieving very close to the peak bandwidth between modules is easily achieved; in the example presented later, a sustained data transfer rate of 96% of the peak rate was achieved. Two problems associated with circuit switching have been mentioned in the past as shortcomings. First, setup latency, the time required to build a virtual circuit, must be incurred before any communication between nodes can take place. In the system described here, efforts were made to minimise this circuit establishment latency through the use of simple communication protocols. The second problem involves idle time on communication links, this will result when connections have been established but no transfers are taking place. This is not a major concern in our system: opening and closing connections are lightweight enough that there is little motivation for nodes to monopolise communication links by leaving them open for long periods of time. A vast majority of the proposed network approaches have involved packet switching architectures. One such example that is fairly representative is the CLICHE architecture [5]. It is a fixed 2D mesh with one routing switch for each compute node as shown in Fig. 1. Although this architecture is highly scalable, the authors concede that their architecture is unsuitable for certain heavy dataflow applications for performance reasons. As another example, the work of Marescaux et al. [7] is targeted specifically to FPGAs. It is a 2D torus architecture that performs packet switching, using wormhole routing. Of particular interest, it uses partial reconfiguration of the FPGA to support run-time dynamic module replacement. Among the few references to circuit switching in NoC design, Liu and coworkers [8, 9] both provide strong arguments for its advantages over packet switching in NoC-based systems. The architecture proposed by Liu et al. [8] is a time-division-multiplexed central-switched network (crossbar) shared by all communicating nodes. SoCBUS [9] is a circuit-switched NoC organised as a fixed 2D mesh and includes a routing switch for every compute node. Both of these references perform detailed simulations of circuit-switched NoCs to show their throughput and relative advantages over packet-switched NoCs. The reader is directed to these references for further comparisons of circuit-switched and packet-switched networks. Unfortunately, very few of the previous circuit-switched NoC papers provided any kind of implementation or performance data that could be used for relevant comparison against this work. In this work, we describe the detailed implementation of a new circuit-switched NoC designed specifically for FPGA-based systems. The flexible and lightweight advantages of this architecture are explained as we proceed to quantify its area and performance benefits. The main motivation for us chosing a circuit-switched network over a packet-switched one is its ability to maintain guaranteed throughput between nodes connected via a virtual circuit. This is in direct contrast to packet-switched techniques, where significant variations in communication latency are often possible. This ability to provide reliable high data rates between the nodes in a system that most need it was critical in this design decision. The topological flexibility of this system, described in the following sections, also distinguishes it from previous work, which has been limited to regular mesh or central crossbar architectures. Our proposed network, PNoC, is designed with three goals in mind. First, we wanted it to be a flexible networking approach that would be applicable to a wide variety of system requirements. Flexibility was desired for both the allowable network topologies as well as the communication datapath widths. In this way, our work differs significantly from that reported by Liu and coworkers [8, 9], which focus solely on crossbars [8] and meshes [9]. Second, we wanted a network that simplified system design by providing simple, standard network interfaces and easily understood network protocols. Third, we wanted our network to be lightweight, requiring few FPGA resources, and thus suitable for both small and large FPGA-based systems. 2 PNoC: circuit-switched NoC for use in FPGA-based systems PNoC was designed to be extremely flexible. At design time, it is possible to easily construct a variety of network architectures each with its own mix of system routing and computational resources. In addition, the network modules are parameterised for communication path widths, flow control and timeout handling. At runtime, PNoCs flexibility supports the dynamic removal and insertion of nodes in the system, if supported by the FPGA fabric (PNoC provides support for dynamic module replacement via routing table updates. However, the creation of partial reconfiguration bitstreams is outside the scope of this paper.). The proposed network topology consists of a series of subnets, in which each contains a router and a collection of network nodes similar to that shown in Fig. 2. This style of topology was chosen because of the ability to place modules that communicate frequently in the same subnet, allowing even more efficient overall system communication. The routers perform the circuit switching between the nodes, and each node connects to a single router through a router port interface. A lightweight handshaking mechanism is used to establish dedicated connections between nodes, to exchange data and to remove connections. The signals required in this circuit-switched communication are described in Table 1. The naming convention is from the node s perspective. That is, signals with a direction Fig. 1 CLICHE architecture Fig. 2 Example PNoC topology 182 IEE Proc.-Comput. Digit. Tech., Vol. 153, No. 3, May 2006

3 Table 1: PNoC communication signals Q2 Signal Direction Description request out initiates either a router update request or a connection request release out initiates a connection release grant in indicates to the network node that its connection request has been granted sl_grant in indicates that a connection has been established with this node as a slave pend in indicates that another node is requesting access to this node s current destination port rx_data[x:0] in rx data bus of parameterisable width rx_addr[y:0] in rx address bus of parameterisable width rx_rnw in rx read-not-write signal rx_valid in indicates valid rx data, address and rnw signals rx_cts in rx clear-to-send signal tx_data[x:0] out tx data bus of parameterisable width tx_addr[y:0] out tx address bus of parameterisable width tx_rnw out tx read-not-write signal tx_valid out indicates valid tx data, address and rnw signals tx_cts out tx clear-to-send signal of in are inputs to the node (and thus are outputs from the router). The single-bit control signals in the table deal with router table updates, requests to create or destroy virtual circuits and read/write requests. In addition to these signals, each module has a set of receive (rx) and transmit (tx) signals consisting of at least address and data lines. The transmit address lines are interesting in that they serve multiple purposes. When creating virtual circuits they specify the ID (or address) of the module to which a virtual connection is desired. Once a virtual connection has been established and read and write transactions begin, they are then used to specify an address in the remote module s address space to which the transaction refers, contributing to the flexibility of the system. Finally cts signals are used for flow control as will be described subsequently. 2.1 Router The router is the core of this network communication architecture. The major components that make up the PNoC router are shown in the block diagram of Fig. 3. The function of each of these components is described as follows. Table Arbiter. The table arbiter receives connection requests and schedules access to the routing table in the Fig. 3 Router block diagram case that multiple requests are received on the same clock cycle. This block is also responsible for managing the routing table update requests. Routing table. The routing table maps network module addresses to ports that may be used to establish connections between modules. The node addresses serve as the index to the table and the data stored at that index represent the port(s) that may be used to establish the connection path. Port queue. This queue is used to maintain the connection request order, while the requests await availability of the target port(s). Port arbiter. Once the target port(s) becomes available, the port arbiter establishes the desired connection and issues the appropriate grant signals. This block also monitors the release signals for the removal of connections. Switchbox. The switchbox forms the actual connections between modules by enabling tri-state buffers that allow the rx signals to drive the appropriate tx signals. The actual routing of the data is done through the router s switchbox. The switchbox is structured such that any given rx line can be connected to any of the available tx lines. As the work presented here targets a Xilinx Virtex-II Pro device, this switchbox was implemented with the available internal tri-state drivers. A similar mux-based implementation could be used for devices that do not provide the same tri-state capabilities. Switchboxes implemented as crossbars can be an expensive form of communication, growing with complexity N 2. An advantage of PNoC over other architectures is that its flexibility lends itself to the use of multiple smaller routers, distributed through the system rather than using a central crossbar as was used in Liu et al. [8]. The result is a smaller and faster implementation. PNoC allows the inclusion of multiple nodes with the same network address in a system. The router assumes that all nodes with address k are interchangeable and will use the first available such node to satisfy a connection request. This makes it possible to easily alter the mix of modules in processor-farm kinds of designs without the individual nodes being aware of the exact mix. This capability is exploited in the demonstration system described later in the paper Routing table updates: The network infrastructure has been designed to support dynamic module replacement via partial configuration. If a node is removed from the system during execution, using partial reconfiguration, its local router should be notified via a router update command, which will remove that module from the system s routing tables. When a new module is added to the system, an update command should be sent to its router to add it to system routing tables. Router update requests are implemented as connection requests addressed to the router itself. All routers are configured IEE Proc.-Comput. Digit. Tech., Vol. 153, No. 3, May

4 with an address of A router update request occurs when an updating node raises its request line while its tx_addr is set to 0 00 and its network address is held on the tx_data line. Similar to a connection process, the router raises the grant signal to indicate the table update is complete Router flexibility: An important design goal is to produce a flexible NoC. The network topology can be altered by how routers are interconnected in the system. Fig. 4 shows four different systems that involve the use of various router configurations. The interconnectivity between the modules and routers is flexible and is defined by the system designer so that the network can meet specific system needs. Each router is parameterisable in the number of ports it contains and the width of the data and address lines contained in each port connection. Many routers could be used in a single system with a custom topology that best meets the system s demands. The focus of this work is on the creation of NoC building blocks and a framework to use those building blocks rather than enumerating and comparing the possible topologies and their respective trade-offs. As shown in the figure, the mix of compute nodes to router nodes can be varied. In addition, multiple links between routers where high traffic is expected can be used to increase system performance. Finally, although all the routers shown in the figure have eight ports each, this value is parameterisable at build time. 2.2 PNoC module interface One of the goals of this work is to facilitate the design of complex systems through modular design using a simple interface to the communication medium. Modules, or nodes, that connect to the network do so via a well-defined port interface that contains multi-bit transmit and receive data address lines along with handshaking control signals (Table 1). Fig. 5 shows the hardware needed to effectively integrate a module with the network. On the left is the node circuitry itself and on the right is the network interface circuitry that consists of optional transmit and receive FIFO s and associated cts signals (not shown), and a simple FSM to communicate with the router. A CPU can be readily connected to the PNoC like any other node. A special CPU interface module decodes the CPU s memory accesses to identify and initiate accesses to the memory mapped network infrastructure. Fig. 6 shows how a CPU interfaces to the network with the Fig. 5 Node interface hardware memory-to-network address translation and a standard PNoC module interface. The implementation used in this work is a Xilinx MicroBlaze CPU combined with a custom memory mapped network interface, similar to that shown in the figure Inter-node data flow control: All routers in the system operate using a common, synchronised clock rate. Each node, however, may operate at its own clock rate. FIFOs are used between nodes and routers to provide for buffering of data as well as for crossing between the node s clock domain and the routers clock domain. Status signals from the FIFOs are provided as a part of the node connection to serve as end-to-end flow control signals. The inclusion of transmit and/or receive FIFOs in the node interface is a parameterisable feature of the node interface design and thus is optional. However, these FIFOs are strictly necessary in two cases: (1) the node runs at a clock rate different from that of its subnet router or (2) the node, when acting as a slave, is unable to keep up with the data transmission rate of potential masters. In case (1), both transmit and receive FIFOs are needed to cross between the node s clock domain and the routers clock domain. In case (2), the slower consuming slave node must use a receive FIFO. The almost full status flag on the receiving FIFO is used for flow control purposes so that data already in flight can safely reach the target node. Flow control is dictated by the use of the cts signals. When the receiving node detects almost full at its receive FIFO, its tx_cts signal is lowered and is propagated to the transmitting node s rx_cts signal. At this point, the transmitting node must stall until its rx_cts is again raised, indicating that the receiving node is capable of accepting more data Connection establishment and data transfer: To establish a connection, the requesting node (the master Node A in Fig. 2) asserts its request signal to the router while specifying the desired target node address on its Fig. 4 Four different network topologies Fig. 6 CPU interface 184 IEE Proc.-Comput. Digit. Tech., Vol. 153, No. 3, May 2006

5 tx_addr lines. The router determines which port is associated with the desired target node (the slave Node B in the figure) by consulting its routing tables. In this example, the connection request is forwarded on to a second router, as Node B resides in its subnet. The second router then processes the request and determines if the target node is available. Once it becomes available, the second router informs the first router, who informs the master via a grant signal, and the connection is established. The slave node is also informed of the establishment of a connection via the sl_grant signal. The master and slave are then free to transfer data as desired. Data transfers between nodes are done using a simple interface. A write-followed-by-read sequence is shown in Fig. 7. The signals in the top half of the figure are the signals seen by the master module and the signals in the lower half of the figure are the signals seen by the slave. In cycle 1, the write operation begins when the master s tx_valid ¼ 1 and tx_rnw ¼ 0. As the data transfer occurs on a dedicated connection path, there is no need for an acknowledge signal nor is there a need to specify which node the write operation is directed to. In this write transaction, there is only a single cycle of delay through the router (no FIFOs are present). Thus, the write is initiated in cycle 1 and completes at the slave node at the end of cycle 2. The initiation of a read request is shown in cycle 2 where the master s tx_valid ¼ 1 and tx_rnw ¼ 1. The slave node receives these signals on its rx_valid and rx_rnw lines in cycle 3, and responds accordingly by placing the requested data on its tx_data lines and raising its tx_valid (cycle 4). The master captures the returned data on its rx_data lines when it sees its rx_valid ¼ 1 in cycle 5. The figure also shows a second read request in cycle 3, illustrating that write and read requests can be pipelined. As long as the slave s cts (not shown) is held high, the master can initiate a new data transaction on every clock cycle. In the event, the slave cannot keep up, it can de-assert its tx_cts, thereby causing the master to temporarily suspend data transfers. This example shows the timing in the absence of FIFOs in the module interfaces. The inclusion of such FIFOs does not change the timing seen in the figure. However, the tx_cts signal is automatically de-asserted by the node interface logic in response to FIFO almost full conditions rather than being manually de-asserted by the slave node itself. As can be seen, transactions between master and slave nodes are similar to those for interfacing to pipelined memories: requests are sent and a number of cycles later the data is returned with accompanying valid signals. The router is not involved in read and write transactions except that (a) it provides the signal switching fabric so that master and slave can communicate and (b) it provides one pipeline register in the switching fabric to improve the throughput (clock rate) of the pipelined data transfers. The master can remove a connection to another block by informing the router it no longer desires the connection through assertion of the release signal. Additionally, a pend signal is supplied to the master to tell it when another node wants access to the slave node. The master may, at its discretion, prematurely close its connection in response. This behaviour is not mandated by the network, but the router functionality is provided to support it. Once the master node releases a connection, both it and the affected slave node become available for use in other connections. 3 Implementation results The PNoC building blocks described earlier have been implemented on a Xilinx Virtex-II Pro FPGA (xcv2p30-7). Design entry was done with JHDL [10]. The resulting NoC building block modules (the router, the node interface and the CPU-node interface) are parameterised as described in the previous sections. Table 2 gives the area and speed results for a variety of router instances with differing numbers of ports and differing port data widths. In each case, the routing table is implemented using a single BlockRAM, which is not reflected in the table. In addition to the router, a complete system must involve the use of node interface circuitry. This node interface hardware (containing the FIFOs from Fig. 5) requires 155 slices and two BlockRAMs. In cases where the FIFOs are not required, the area is reduced to 62 slices. The Microblaze CPU node interface circuitry, including the memory mapped network interface module, requires 196 slices and two BlockRAMs. Fig. 7 Master node write followed by read IEE Proc.-Comput. Digit. Tech., Vol. 153, No. 3, May

6 Q3 Table 2: Router implementation results Number of ports Data width Area (slices) Speed (MHz) Test application The utility of the PNoC is shown here using a simple image binarisation example. This algorithm uses hierarchical thresholding to quantise greyscale image pixels to binary black and white values. The computation involves computing median values at three different levels of hierarchy to be used as quantisation threshold values. The algorithm involves the following steps: 1. Compute the median value for the entire image and use that to compute the global threshold value where global_thresh ¼ median þ median/4. 2. For each block of data in the image, determine its darkest pixel value and compare that against the global threshold. If it is lower (darker) than the threshold, then that block presumably contains valid data (it is called a valid block), and is processed further in step 3. Otherwise, the entire block of data is set to a white value. 3. Each valid block is divided into smaller windows. Those windows are compared with a block threshold value and those which require additional processing are subjected to window processing in step 4. Otherwise, the window pixels are set to white. 4. Each pixel in a valid window is compared against the computed window threshold and set to black or white accordingly. 5. Steps 2 4 are repeated until every block has been processed and the complete quantised image has been produced and collected back to the CPU. This application, targeted to a Virtex-II Pro FPGA (xc2vp30-ff860-7), is illustrated in Fig. 8 and consists of four module types. The Microblaze processor is the primary control for the system and computes the global threshold value for the image and manages the distribution of the image blocks to the block modules. The universal asynchronous receiver-transmitter (UART) enables the uploading and downloading of the original and final images between the FPGA and a host computer. The Fig Binarisation top-level modules block modules compute a block level threshold value and, if valid data are detected within the block, divides the block into windows and sends these to the window modules. The window modules quantise each pixel of a window on the basis of the window s threshold value. This binarisation application was implemented both with PNoC and with two different bus-based approaches. The main design challenge in this system is in coordinating the transfer of image data between the different nodes. The major communications are between the CPU and block processors, and between the block and window processors. There are a different number of block modules and window modules because of the projected need for each kind of processing in the overall computation, however, the exact utilisation of the window modules is unpredictable, as it is completely data dependent. Because of the parallel processing and hence parallel communication involved in this system, the PNoC implementation should noticeably outperform the bus implementations because of its ability to support multiple simultaneous data transfers. 4.1 Shared-bus implementation Two different bus-based implementations were completed using Xilinx EDK version 6.3. Each contains a Microblaze processor and on-chip peripheral bus (OPB) running at 100 MHz. The first implementation uses simple reads and writes to transfer data on the bus. This has the advantage of allowing other modules bus access during the computation but results in a slower implementation. The second implementation allows the block modules to lock the bus and burst the window data transfers. This results in a faster system but prevents other modules from using the bus during those transfers (essentially during the entire computation). In a bus-based system there is no built-in way of arbitrating or scheduling access to the window modules without designing custom arbitration into the modules themselves. In these implementations, each window module was designed to satisfy requests from two statically-chosen block modules. Similarly, there is no built-in way of scheduling the bus other than relying on bus arbitration for concurrent requests. Manually time-multiplexing the bus and manually scheduling access to the window modules to improve performance without locking the bus for extended periods of time would greatly complicate the design task. Therefore the second bus implementation uses the built-in bus locking mechanism instead of a custom bus scheduling scheme. 4.2 Network implementation The PNoC is well-suited to this type of system. Multiple block-to-window module data transfers can occur simultaneously as multiple connections can be active at a given instant. Also, because of the unpredictable nature of window module utilisation, the dynamic routing capability of this network plays an important role in this system. When a block module requires the services of a window module, as all window modules are configured with a common network address, the connection can be established with whichever window module becomes available first. No additional hardware is required by the system designer to poll for available window modules, the choice of which window module to use is made by the router. Further, if no window module is available the router will queue up connection requests in order until a window module is available. This allows for considerable flexibility in the system: IEE Proc.-Comput. Digit. Tech., Vol. 153, No. 3, May 2006

7 additional block and window modules can be added to an existing design and recompiled for execution without any changes being required of the block and window modules. 4.3 System comparisons Table 3 compares the implementations. About 1150 slices of the network design were for the eight-port router. Both designs were downloaded to the Xilinx XUP Virtex-II Pro Development Board. Times were recorded in such a way to remove software overhead on the Microblaze from the computation time so as to compare just networking overhead. Blocks of data were first loaded into the four block modules and then the computation/communication time of the block and window modules was measured using a hardware timer. The experiments were set up to show maximum data transfer capability (all four block modules were competing for the services of the two window modules). Our original shared-bus design used individual bus reads and writes for the data transfer, resulting in a 23 performance advantage for the network version. This has the advantage of not shutting down the system bus while the computation proceeds (other transactions between CPU and other modules can compete for bus cycles and thus make progress during the computation). Our second bus design improved performance by allowing the block modules to completely lock the shared bus while they perform burst transfers. By doing so, the performance difference was reduced to 2. In this second version, however, the bus was completely locked for entire window computations (because of the streaming data nature of the window computations the bus was required to be locked during the entire computation). This prevented any other activity between the CPU and other system modules from occurring for large periods of time. To be fair, other methods could be used to make a bus-based approach workable for this problem instance. Bus bridges to isolate the CPU from much of the window module traffic could be employed. While freeing up the CPU to do other things during the window computations, this would still limit the system to physically performing one-window computation at a time, as all block and window modules would share a single bus. Attempts to further partition this shared bus into multiple buses, each servicing a subset of the block and window modules, defeats the design goal of providing a flexible platform that allows a uniform pool of window modules to serve as a processor farm for a uniform collection of block modules. In short, bus-based approaches have limited scalability compared with network-based approaches. In contrast, the PNoC version allows the CPU to communicate with other system modules (such as the UART) during the computation. It also allows multiple transfers between block modules and windows to take place concurrently, making it possible to achieve a significant fraction of the maximum available computational power present in the design. For example, the computation required the use of two window modules. At 124 MHz, each window module could conceivably maintain a 124 MB/s transfer rate with its associated block module. Each achieved, on average, 119 MB/s or 96% of the maximum bandwidth. Similarly, the utilisation of each window module was 96% over the course of the computation. The PNoC architecture would demonstrate similar performance advantages for any system that requires concurrent data transfers. The ability for the PNoC architecture to allow multiple modules to communicate simultaneously in a flexible way is its major advantage over a bus-based implementation. Additionally, the clock rate of the network implementation was 27% higher than that of the bus-based implementations. This is consistent with results we have seen for other applications we have completed, and is due to the shorter and less heavily loaded wires in the PNoC architecture compared with a shared-bus architecture. 4.4 Network architecture comparison Unfortunately, few authors have published implementation results for their proposed FPGA NoC architectures. As a result, it is difficult to perform direct comparison with other network approaches. One that has done so, however, and appears to be representative of other packet-switched approaches is presented by Bartic et al. [11]. Their system, a 2D mesh similar in topology to that shown in Fig. 1, consisted of eight network modules, and was targeted to a Xilinx Virtex-II Pro device. The communication datapaths in their system were 16-bits wide. It was assumed that each of the eight routers required a single BlockRAM for their output buffers. An equivalent PNoC architecture consists of a single eight-port router and associated compute node interfaces. It was also targeted to a Xilinx Virtex-II Pro device. Table 4 compares the two implementations in terms of area and clock rate. The simplicity of the circuit-switched PNoC architecture not only reduces the hardware costs by over 2, it also increases the clock in this example by almost 3, resulting in an area time improvement of over 5. Table 4: Comparison to packet-switched network of Bartic et al. [11] Network architecture Slices BRAMs Clock rate (MHz) Packet-switched Q4 PNoC Table 3: Binarisation system comparison Parameter Shared bus Locked bus PNoC Microblaze UART Block module Window module Communication Total slices Max clock rate, MHz Cycle count Summary and future work In this paper, we have proposed and demonstrated a flexible, lightweight circuit-switched approach to constructing FPGA-based systems. It provides the ease of design (using standard interfaces) of a bus-based approach while providing performance that approaches that of direct interconnect. We believe it flexible enough for use in general embedded systems and high-performance enough for many high-throughput data flow applications. This first experiment has quantified the implementation cost of the basic PNoC modules on an FPGA and demonstrated their utility in a real application, at the same time showing the ease of design using PNoC as well as its IEE Proc.-Comput. Digit. Tech., Vol. 153, No. 3, May

8 potential performance. A number of directions for continued work remain as several open questions have surfaced throughout this work. First, we want to explore the use of multiple routers and subnets in a system. Specifically, it is important to understand the kinds of topologies useful for various patterns of computing and communication, to identify the most commonly occuring patterns and to quantify the advantages of PNoC for such patterns. Further, we want to investigate the applicable solution space for circuit-switched systems such as PNoC. Our expectation is that PNoC s topological flexibility will be an important characteristic, allowing for high-delivered performance across a wider range of applications than standard crossbars and meshes as in the previously cited circuit-switch NoC work. Unfortunately, at present, there are no readily accessible packet-switched systems with which we can make meaningful comparisons. As they become available, the important work will be to perform detailed comparisons to understand the trade-offs between them and PNoC. Where are circuit-switched networks a better choice than packetswitched networks and buses? How do they compare for power consumption, clock rate, system throughput and ease of design and debug? Finally, an important goal in the creation of PNoC was to support dynamic module replacement via partial reconfiguration. As mentioned, PNoC provides support for adding and deleting modules to a running system, provided the target FPGA fabric supports some form of runtime reconfiguration. We desire to investigate the use of this capability in real applications where requirements change over time and necessitate changing the mix of modules in an embedded computing system. 6 References 1 ARM: Amba specification, Technical report, ARM, Revision 2.0, Coreconnect: Coreconnect bus architecture, Technical report, IBM Cooperation, Salminen, E., Lahtinen, V., Kuusilinna, K., and Hamalainen, T.: Overview of bus-based system-on-chip interconnections. Proc. IEEE Int. Symp. on Circuits and Systems, May 2002, vol. 2, pp. II372 II375 4 Dally, W.J., and Towles, B.: Route packets, not wires: On-chip interconnection networks. Proc. Design Automation Conf., DAC 01, June 2001, pp Kumar, S., and Jantsch, A.: A network on chip architecture and design methodology. Proc. IEEE Computer Society Annual Symp. on VLSI, ISVLSI 02, April 2002, pp Grecu, C., Pande, P.P., Ivanov, A., and Saleh, R.: A scalable communication-centric SoC interconnect architecture. Proc. 5th Int. Symp. on Quality Electronic Design, 2004, pp Marescaux, T., Bartic, A., Verkest, D., Vernalde, S., and Lauwereins, S.: Interconnection networks enable fine-grain dynamic multi-tasking on FPGAs. Proc. 12th Int. Conf. on Field-Programmable Logic and Applications, FPL 02, September 2002, pp Liu, J., Zheng, L.-R., and Tenhunen, H.: A circuit-switched network architecture for network-on-chip. Proc. Int. Symp. on System-on-Chip, September 2004, pp Wiklund, D., and Liu, D.: SoCBUS: switched network on chip for hard real time embedded systems. Proc. Int. Parallel and Distributed Processing Symp., April Hutchings, B., Bellows, P., Hawkins, J., Hemmert, S., Nelson, B., and Rytting, M.: A CAD suite for high-performance FPGA design in Pocek, K.L., and Arnold, J.M. (Eds.). Proc. IEEE Workshop on FPGAs for Custom Computing Machines, Napa, CA, USA, April 1999, (IEEE Computer Society) 11 Bartic, A., Mignolet, J.Y., Nollet, V., Marescaux, T., Verkest, D., Vernalde, S., and Lauwereins, R.: Highly scalable network on chip for reconfigurable systems. Proc. Int. Symp. on System-on-Chip, November 2003, pp IEE Proc.-Comput. Digit. Tech., Vol. 153, No. 3, May 2006

Implementation of PNoC and Fault Detection on FPGA

Implementation of PNoC and Fault Detection on FPGA Preethi T S 1, Nagaraj P 2, Siva Yellampalli 3 Department of Electronics and Communication, VTU Extension Centre, UTL Technologies Ltd. Abstract In this