Computers and Electrical Engineering

Size: px

Start display at page:

Download "Computers and Electrical Engineering"

Janel Daniels
6 years ago
Views:

Computers and Electrical Engineering 40 (2014) 1838 1857 Contents lists available at ScienceDirect Computers and Electrical Engineering journal homepage: www.elsevier.

Lakshminarayanan a, Seok-Bum Ko b, a Department of Electronics and Communication Engineering, National Institute of Technology, Tiruchirappalli, India b Department of Electrical and Computer

1 Computers and Electrical Engineering 40 (2014) Contents lists available at ScienceDirect Computers and Electrical Engineering journal homepage: Design and verification of an efficient WISHBONE-based network interface for network on chip q K. Swaminathan a,b, G. Lakshminarayanan a, Seok-Bum Ko b, a Department of Electronics and Communication Engineering, National Institute of Technology, Tiruchirappalli, India b Department of Electrical and Computer Engineering, University of Saskatchewan, Saskatoon, Canada article info abstract Article history: Received 11 November 2013 Received in revised form 13 May 2014 Accepted 14 May 2014 Available online 7 June 2014 In this paper, a generic asynchronous First In First Out (FIFO) based WISHBONE compatible plug and play Network Interface (NI) for Network on Chip (NoC) is designed and verified. Four different types of encoded asynchronous FIFOs namely binary, Gray, one-hot and Johnson are designed and analyzed. It is found that Gray-code asynchronous FIFO is the best to handle the asynchronous clock domain issues in NI. The control signals of the WISHBONE bus wrappers from/to asynchronous FIFOs and packing/unpacking modules are asserted concurrently at the same rising edge of the respective router and IP clocks to reduce the latency. The same NI has been utilized for transferring data between synchronous as well as asynchronous clock domains irrespective of clock frequency and phase differences. The proposed NI ensures the seamless high data throughput between the routers and IP cores with minimal latency, higher throughput, higher speed and utilized lesser area compared to the existing design. Ó 2014 Elsevier Ltd. All rights reserved. 1. Introduction Increasing the number of reusable Intellectual Property (IP) cores and more than a billion transistors in a Multi-Processor System on Chip (MPSoC) design in the Nano-electronic integrated circuits era brings us ever increasing design and testing challenges [1 3]. The interconnection delays in a bus-based communication are rapidly increasing compared to gate delays. It results in performance degradation and synchronization problems between IPs in MPSoC, if the number of IP cores is increased [2,3]. The NoC architectures have been suggested as a promising solution for highly scalable, reliable and modular on-chip communication infrastructure platform [2,3]. The NoC design represents a new paradigm to design MPSoC which shifts the design methodologies from computation-based to communication-based [3]. The NoC architecture uses layered protocols and packet-switched networks which consist of on-chip routers, links and Network Interfaces (NIs) on a predefined topology. The development of complete application-specific NoC for MPSoC is a challenging process that requires the predominant definition for suitable network topology, protocols and crossbar switches which demands adequate design flows to minimize design time and effort along with the design cost. Interfacing IP cores with different data width and frequency to NoC is a critical task due to its asynchronous nature. Connecting different IP cores with NoC router using NI is a complex task due to its asynchronous clock domain nature, different data width, assembling and disassembling of packets. Therefore, it is q Reviews processed and recommended for publication to Editor-in-Chief by Associate Editor Dr. Saraju Mohanty. Corresponding author. Tel.: address: seokbum.ko@usask.ca (S.-B. Ko) /Ó 2014 Elsevier Ltd. All rights reserved.

2 K. Swaminathan et al. / Computers and Electrical Engineering 40 (2014) essential to develop a plug and play generic NI to handle Clock Domain Crossing (CDC) issues in order to pass the data at high rate between different clock domains without loss [4]. A new micro-architecture of NI for NoC has been implemented utilizing OpenCores WISHBONE [5] bus and it enables short design time and offers seamless high throughput data flow in this paper. The key contributions of the proposed work are: The proposed NI works as dual purpose NI which can interface synchronous as well asynchronous IPs with routers irrespective of clock frequency and phase differences between the modules. The data width and FIFO Size can be modified as per the application requirement. The micro-architectural level merging of the bus wrappers with the respective packing/unpacking modules and the asynchronous FIFO offer the latency free bus wrapper to achieve high speed data transaction on NI between various processing IP cores and NoC router. Different encoded asynchronous FIFO schemes such as binary [6], Gray [6], Johnson [7] and one-hot [8] are designed and analyzed. The proposed NI design utilized the best asynchronous FIFO namely Gray encoded FIFO among the four. A low latency packing and unpacking unit of the proposed NI offers efficient assembling by inserting the packet fields such as routing information, payload details and disassembling by extracting packet fields at a fast rate. The optimum latency of the entire NI is two clock cycles which are limited by packing, unpacking and asynchronous FIFO modules. The proposed NI is verified using coverage driven constraint random based verification environment [9]. The data transaction from a router end bus wrapper to an asynchronous FIFO (transmit FIFO) and an IP end bus wrapper to an asynchronous FIFO (receiver FIFO) of the proposed design has been done without any latency due to micro-architecture level merging mechanism. The read operation has been done between the respective FIFOs and the bus wrappers via packing and unpacking modules with a latency of one clock cycle. This has been achieved by concurrent sampling of data and control signals of the sub modules belong to the identical clock domain at the same edges of the respective clocks. The proposed generic NI has better efficiency, higher throughput, less latency which offers a simple and flexible connection mechanism to utilize single processing core with the router directly as well as multiple processing cores with memories and peripherals connected through the other standard System on Chip (SoC) commercial buses to the routers; the connection is established irrespective of the different frequencies and phases among them. This paper is organized as follows: Section 2 describes an extensive literature survey of NI for NoC. In Section 3, an overview and essential requirements of NI for NoC are given. The packet format of NoC used in this proposed design and the metastability issues when connecting different clock domain subsystems are explained. Section 4 describes the salient features of various asynchronous FIFO using different encoding schemes. Section 5 describes the features of WISHBONE bus with read write operations and the implementation of WISHBONE compatible asynchronous FIFO based NI architecture for NoC is explained. In Section 6, the constraint driven BFM based verification environment utilized to verify the proposed NI is explained. Section 7 deals with the performance comparison of WISHBONE compatible NIs results using different encoded asynchronous FIFOs. Finally, in Section 8, the conclusion of the proposed scheme is discussed. 2. Related work Several studies have been explored on NI implementation for NoC to overcome the asynchronous problem and standardization of NI fabricates to improve speed and throughput [4,10 19]. Generally the implementations have been proposed based on Direct Memory Access (DMA) with asynchronous FIFO [4,10], Globally Asynchronous Locally Synchronous (GALS) [11 13], Advanced Microcontroller Bus Architecture (AMBA) [14] and AMBA Advanced extensible Interface (AXI) bus [15], Open Core Protocol (OCP) [16 18] and asynchronous FIFO [19]. In [4], the authors proposed a simple generic programmable based NI architecture which offers rapid plug and play interfacing of IPs to routers with minimal performance overhead. The packet maker (PM) and packet disassembler (PD) unit of NI handle the header phrasing, payload correction and routing path determination. Apart from asynchronous FIFO the above implementation utilizes extra memories namely PM memory and PD memory. In [10], the authors proposed Network Processor Array (NePA) platform utilized DMA based generic master core and slave core NI with buffered mode and un-buffered mode. Simultaneous read/write operations do not take place without sufficient delay in the same FIFO/buffer when working on different clock domains, however either the read or write operation can be done at a time. The extra memory usage and the complex controller design results in high latency and area overhead compared to the proposed design. In [12], the authors proposed synchronous/asynchronous dual mode on-chip and off-chip interfaces utilized the Gray encoded FIFO based GALS NoC architecture to resynchronize between the synchronous and asynchronous NoC. The off-chip/on-chip NoC interface used the mixed synchronous/asynchronous dual mode NoC port composed of two distinct asynchronous to synchronous (A-to-S) and synchronous to asynchronous (S-to-A) interfaces. Each S-to-A and A-to-S virtual channels (VC0 and VC1) made up of two Gray encoded FIFOs per channel utilizes bundled-data handshake protocol which results in area overhead of two extra Gray encoded FIFOs. Later the same authors proposed another asynchronous FIFO solution, claiming that the Gray code presents limitations which are complex in implementation, encoding of the only powers of two, problems in pointer increment, and extra logic blocks used to convert binary to Gray [13]. As the new solution, Johnson

3 1840 K. Swaminathan et al. / Computers and Electrical Engineering 40 (2014) encoded FIFO was suggested instead of Gray encoded FIFO for area efficient design. This GALS based delay-insensitive 4-phase protocol network adaptor is used to achieve a higher throughput, also provides Dynamic Voltage Frequency Scaling (DVFS) capabilities by using a local programmable clock generator and the GALS adapter is used as a hard-macro for better timing control and easy top level integration. The proposed architectures have advantages of handling synchronous and asynchronous NoC packets at a higher data rate, however there is no information about packing and unpacking unit to handle NoC packet transactions in NI. The proposed design in this paper offers higher speed than the above mentioned implementation. In [14], a low latency AMBA based Master Network Interface (MNI) and Slave Network Interface (SNI) with 4-phase, 2-phase and credit based flow control mechanism is proposed. Simultaneous read and write operations have not performed in a same memory due to the utilization of flow control mechanisms in the above design without asynchronous FIFOs provide higher latency and lower throughput. In [15], the network architecture exploits AXI transaction based protocol to be compatible with existing IP cores. The above mentioned new NI architecture provides a novel dynamic buffer based on variable packet size for improving the resource utilization and performance of the NI and NoC. Master and slave NI architectures with Reorder-Packet Table (RPT) and Reorder-Buffer (RB) which provide high resource efficiency with little hardware overhead to create enough space to store the incoming out-of-order packets have been implemented. OCP IP standard bus based NI is implemented with basic and precise/imprecise burst mode extensions to speed up the NI with NoC router transactions and integration process [16]. A comparative study has done between handshake based flow and credit based control flow on master and slave network adaptor. The above implementation utilized separate request and response module with a handshake and credit based flow control results higher latency compared to our proposed design. In [17], a low latency NI using an OCP IP interface with pausable clock in order to reduce the power dissipation of NI is implemented and a hibernate switching technique is used while no communication is available. This offers a smooth communication between OCP IP and NoC routers, by packing of OCP transactions to form NoC flits and converting NOC flit into OCP transactions, the computation of routing information and the flits buffering of packets to improve performance with a significant power reduction. In [18], the authors designed an OCP compatible NI which utilizes three different burst modes based on the nature of burst data length namely Precision Burst (PB), Imprecise Burst (IB) and Single Request Multiple Data burst (SRMD) with credit based and handshake flow control mechanisms. This design utilized memory sharing techniques for area gain and two level gated clock techniques for area reduction. These OCP compatible NI mainly used credit based and handshake protocols with MNI and SNI. The above implementations offer higher latency and low throughput due to credit based and 4-phase handshake protocol compared to the proposed design in this paper. In [19], the authors proposed dual clock asynchronous FIFO based reliable NI which synchronized multiple packets from different sources to a single destination with packing and unpacking unit capable of handling wormhole switching with X Y routing. However, the design did not use any standard bus protocol to connect entire bus based SoC IP with NoC router. In this paper, a generic dual purpose WISHBONE compatible NI has been proposed to handle synchronous as well as asynchronous data transfer between IPs and NoC router. The latency free bus wrappers and less latency of one clock cycle utilization of the packing and unpacking modules of the proposed NI offer a low latency high throughput data transactions such as multimedia streaming and high speed peripherals. This is achieved by the merging architecture of bus wrapper and packing/unpacking module and merging the FIFO input signals to the respective bus wrapper module. As per the authors knowledge this is the first comprehensive attempt of implementing WISHBONE compatible asynchronous FIFO based NI for NoC in the router side interface as well as in IP interface side with configurable packing and unpacking module. The proposed asynchronous NI can easily adapt any range of frequencies irrespective of packet or flit size and clock phase differences. 3. Network interface requirements of NoC 3.1. Network interface The NoC is composed of routers, NIs and links. NoC router/switch transports the data from incoming ports to outgoing ports to connect the adjacent router or NI. The NI module is connecting the processing element to the NoC router. NIs converts request and response transaction into packets and vice versa [20]. Links provide physical connections between adjacent routers. NoC Router Local Port Network Interface Packing Async FIFO Async FIFO Unpacking IP Core Port Fig. 1. General block diagram of NI for NoC.

4 K. Swaminathan et al. / Computers and Electrical Engineering 40 (2014) A physical link can support more than one logical link or channels. The success of the NoC design is based on the standardization of the interfaces between IP cores and the interconnection fabric. The general structure of NI is shown in Fig. 1 which performs the following functions [20]: Writing/reading the flits to/from the processing core and the router vice versa. Packing the incoming signals from the IP cores by assembling the number of flits, header information, flit type insertion with exact number of flits and unpacking the signals coming from the router as per the IP core specification. Transferring data from one clock domain to another clock domain without loss NoC packet format A message is a contiguous group of bits that is delivered from the source terminal to the destination terminal. A message consists of packets. A packet is the basic unit for routing and sequencing. Packets may be divided into flits. A flit is the basic unit of bandwidth and storage allocation. Flits are divided into header flit, body flit and tail flit. Header flit consists of routing information about its current source address, destination address and sequence information. Body and tail flits do not have any routing or sequence information and have to follow the route for the whole packet. The flit is again divided into phits (physical transfer digits). Phit is transferred across a channel in a single clock cycle. These resource allocation units are handled in different layers of the network protocol for different purpose. Flits and phits are handled in the physical layer of NI utilized for synchronization of data transfer purpose. There are no specified standard sizes about the resource allocation unit [1]. The packets in the proposed work consist of flit with 40-bit width, and the header flit is located at the first position of the packet contains information about the source and destination address, size of the packet and any other application specific commands. The detailed description of the packet is shown in Fig. 2. These packets must be traversed from router clock domain to processing clock domain vice versa, without any data loss Metastability A fundamental problem in digital systems is the lack of a global timing reference, called synchronization [21]. Metastability is a fundamental problem which causes system failure in digital devices when interfacing between circuitry in unrelated or asynchronous clock domains [22] and is caused by registers not meeting the setup (T su ) and hold time (T h ) requirements at the active edge of the clock signal. Synchronized methods are used to avoid or totally suppress the probability of metastability. The synchronization failure probability can be reduced to an acceptable range by carefully designed synchronizer [6,22]. The simplest and safest solution to avoid metastability problems in an asynchronous clock domain is to use flip-flop, double cascaded synchronizer, triple synchronizer and multi cascaded flip-flops [6]. The asynchronous FIFOs of the proposed NI consist on the double synchronizer to avoid metastability. Fig. 2. Resource allocation unit of proposed NI for NoC.

5 1842 K. Swaminathan et al. / Computers and Electrical Engineering 40 (2014) Asynchronous FIFO features 4.1. Asynchronous FIFO Asynchronous FIFO refers to the FIFO where the clock at the reading side of the FIFO buffer and the clock at the writing side of the FIFO buffer are in different speed and phase, in which the clocks are asynchronous with reference to each other. Asynchronous FIFOs are used to transfer the data from one clock domain to another domain without any loss in the data [6]. This requires a memory architecture which has two memory ports, one for input (or write or push) operation and another for output (or read or pop) operation. FIFO pointers are used to keep track of the read and write locations and also prevent the overflow and underflow of the FIFO buffer. FIFOs have the inherent characteristic of synchronizing itself to the read and write pointers. Different encoding schemes such as binary, Gray, Johnson and one-hot are used to encode read and write pointer to pass the data from one clock domain to another clock domain to avoid metastability. 1. Binary encoding: The characteristic of a binary counter is that the number of bits changing per transition is not constant and in half of the cases more than two. In a 4 bit binary counter when the transition is happening from 0111 to 1000, all the 4 bits are changed, and this may result in 4 metastable conditions [6]. In this scenario it is impossible to predict the metastable condition. The pointer value synchronized with other clock domain may become entirely different than intended. This is the biggest drawback of using binary counters as FIFO pointers [6]. 2. Gray encoding: Gray code encoding scheme is the safest counter that can be used in multi-clock designs. It allows only one bit to change for each clock transition, eliminating the problem associated with trying to synchronize multiple changing CDC bits across a clock domain. The drawback of using Gray codes is that it can only be designed for mod (2 n ) counters [6]. One more drawback is that while using Gray pointers arbitrary multi-bit values cannot be passed; it must be either incremented or decremented [6]. This gives very low latency compared to other encoding schemes [7]. 3. One-hot encoding: This uses a ring like structure for encoding scheme. It requires N flip-flops for generating N states [8]. This makes the FIFO depth to be varied over any number without being restricted to the power of 2 as in Gray codes [6]. 4. Johnson encoding: Johnson counter for the read and write pointer encoding is similar to Gray in terms of the change in a single bit per transition but it can represent any number of states as a multiple of two compared to Gray which can represent only 2 n states [7] Generation of full and empty signals The FIFO size may be optimized at a value that is not a power of two while using Gray and binary encoding. The Gray coding uses extra logic which costs more area and lower performance. Toggling of multi bits in binary encoded FIFO implementation, the read and write pointer will result in wrong sampling of pointer comparison [6]. Johnson encoding and one hot encoding for read and write pointers rectifies those limitations and complexities of Gray and binary encoding of FIFO pointers [6 8]. Johnson encoding and one-hot encoding are other codes with a Hamming distance of 2 between consecutive elements which allow a safe synchronization of the pointers and use minimal combinational element for implementation. The FIFO is said to be full when the read pointer catches up with the write pointer and the FIFO is said to be empty when the write pointer catches up with the read pointer [6]. Pointers must be one bit larger than needed to address the FIFO memory. The wrap around condition is detected utilizing Most Significant Bit (MSB) of pointers of binary, Gray, one-hot and the Johnson counters. The comparison unit checks for the equality of the read and write pointers excluding the MSB bit. If both pointers are equal, then the FIFO must be either empty or full. If they are equal, then it checks for the MSBs of the pointers. If both are equal, then both pointers have wrapped the same number of times so, the FIFO is empty. If both MSBs are not equal then the FIFO is full [6 8]. 5. Working of WISHBONE compatible network interface using asynchronous FIFO 5.1. OpenCores WISHBONE bus The detailed WISHBONE SoC interconnection architecture is shown in Fig. 3. It offers a flexible design methodology to interconnect semiconductor IP cores. The main purpose is to foster design reuse by alleviating SoC integration problems. This is accomplished by creating a common interface between IP cores. This improves the portability and reliability of the system, and results in faster time-to-market for the end user [1]. Commercially available SoC buses such as Advanced Microcontroller Bus Architecture (AMBA 2.0), IBM CoreConnect, STMicroelectronics STBus, Sonics SMART Interconnect, Altera Avalon and OCP buses are not royalty free as that of WISHBONE bus. The end user does not need to depend on particular implementation tools and methodologies; this is available in the public domain [5] and it may be freely copied and distributed by any means. The NI consists of two master/slave modules, packing, unpacking and two asynchronous FIFO Modules as shown in Fig. 3. There are two types of possible transactions which take place in NI, namely router to IP core and IP core to the router.

6 K. Swaminathan et al. / Computers and Electrical Engineering 40 (2014) Fig. 3. Detailed diagram of WISHBONE compatible NI for NoC WISHBONE master and slave The master/slave interface module at the router side generates the required signals to write the data into the FIFO and read out the data from the packing module which generates the required signals to translate the packing module signals to WISHBONE compatible signals and also generates the necessary control signals to write the data from NoC router to FIFO. Similarly the master/slave interface module at the processing core side generates the required signals to write the data into the asynchronous FIFO and read out the unpacked data from the unpacking module and generates the required signals to translate the unpacking module signals to WISHBONE compatible and generates the necessary control signals to write the data from processing core to asynchronous FIFO. This interface will perform single/block read/write operations on both sides. The WISHBONE reset (rst_i) signal is active LOW and is used to reset the IPs and the bus Signal and notation description i. _i and _o denotes input and output signals. ii. rtr_ denotes the signal related to the router. iii. rtr_s_wb denotes the signal related to router with the WISHBONE slave. iv. rtr_m_wb denotes the signal related to router with the WISHBONE master. v. ip_s_wb denotes the signal related to router with the WISHBONE slave. vi. ip_m_wb denotes the signal related to router with the WISHBONE master. vii. cyc the cycle signal when asserted, indicates that a valid bus cycle is in progress. viii. stb the strobe signal indicates a valid data transfer cycle. ix. we the write enable signal is asserted LOW during READ cycles, and is asserted HIGH during WRITE cycles. x. ack indicates the termination of a normal bus cycle by slave device. xi. sel select signal indicates location of valid data is expected on the data signal array during READ cycles, and where it is placed on the data signal array during WRITE cycles. The bus configuration is achieved by utilizing the sel signal. The single READ/WRITE cycles are the basic modes of data transfer in WISHBONE bus. The master begins the READ cycle at the rising edge of clock edge. At that time it places an address onto the address bus adr_o, de-assert the we_o line into LOW to indicate a READ cycle, drives the data select line sel_o high or low depending upon the location that is being read, and asserts stb_o and cyc_o to indicate the start of a new bus cycle. The WRITE cycle is similar to the READ cycle. However, in this case the MASTER asserted we_o signal to high and presents valid data out on dat_o at the beginning of the cycle. In response, the SLAVE asserts ack_o when it is ready to latch the data at the next rising edge of the clock WISHBONE slave data write to asynchronous FIFO During the reset state, the ack_o signal is asserted LOW with respect to router clock. A write operation is performed when we_i is set as HIGH and FIFO is not FULL. This asserts ack_o to HIGH which indicates a write transfer from the external master to internal slave write is permissible, and then the data bus (dat_i) directly connected to the FIFO writes data

7 1844 K. Swaminathan et al. / Computers and Electrical Engineering 40 (2014) when operation is not allowed when we_i and FIFO s FULL flag is asserted as HIGH. This indicates the writing of data is impossible when the FIFO is full. The dat_o and add_i are not used during this writing operation. The write operation on asynchronous FIFO by WISHBONE slave is identical in NoC router side as well as the processing core side. The control signals of the asynchronous FIFO and the WISHBONE bus wrappers are merged together at the micro-architectural level and concurrently sampled in a single clock edge to get rid of bus wrapper latency WISHBONE master data read from the packing module During the reset state, the entire output signals cyc_o, dat_o, stb_o, we_o from WISHBONE master are asserted as LOW during the rising edge of router clock. When ack_i is HIGH and fifo_empty is LOW, this asserts fifo_read as HIGH for reading data from packing module and fifo_read. During this time the output signals cyc_o, dat_o, stb_o and we_o are asserted as HIGH. cyc_o, dat_o, stb_o and we_o signals are asserted as LOW when fifo_empty is HIGH. The flit_type contains the information about the type of flit which is connected to sel_o of the WISHBONE master. The control signals functions are same as mentioned in the write operation. Thus, the latency of the NI is limited by the packing/unpacking modules and the asynchronous transmit/receive FIFOs Packing and unpacking module The packing and unpacking modules perform the header phrasing, assembling and disassembling of the incoming and outgoing data as per the router and processing core frequency, packet size and data width Packing module The packing unit collects data from asynchronous FIFO and the necessary information of an individual packet [4] such as source address, destination address, flit size and other field as per packet format as mentioned in Fig. 2, then it transfers the packet to the WISHBONE master. The tasks of the packing module are, To form the packet header with necessary routing information required by the routers and the destination IP core. Inserting flit type information whether the flit is a header, body or tail. The exact sequence number of flits is assigned in each flit and the packet size is assigned in the packet size filed. The incoming data from the asynchronous FIFO are shaped into the packet as per the flit size in the next stages. The Finite State Machine (FSM) of packing module is shown in Fig. 4. The FSM consists of idle state to bring all the output signals to a known state. The state transition of every state depends on fifo_empty signal and ack_i signal from WISH- BONE master interface. When the ack_i is HIGH, it initiates the header insertion state. During this state the header information of NoC packet is formed by inserting flit type, packet size, routing information (i.e. source and destination address), the sequence number and all fields are assigned as per the flit format of NoC. The packet size, type and count are updated in the header insertion state. The proposed design offers flexibility to configure all the fields of the packet. When the FIFO is not empty, it initiates the read operation by asserting read_out_p signal as HIGH. Now, the state transition is mainly done based on the ps_count (flit_count), fifo_empty and ack_i signal. When ps_count > 1 and fifo_empty is not empty the packing module read the data from the FIFO and send the data to WISHBONE bus during body_flit state. The read operation introduced the wait_state when fifo_empty is asserted HIGH and ack_i is asserted LOW. The read operation is formed as single or block read based on the packet size of the header. On the completion of single or block read when ps_count = total flit size 1, the data flow reaches the tail_flit state. This cycle repeats for every incoming packet. The data width of the flit is configurable as per the router and IP core requirement Unpacking module The unpacking unit receives data from asynchronous FIFO and extracts the header and data information. Based on the packet size and control information, the data transfer is carried out by the respective modules in the local IP core through the IP port via WISHBONE master interface. The FSM of unpacking module is shown in Fig. 5. The flit transaction depends on fifo_empty signal from asynchronous FIFO, ack_i signal from WISHBONE master interface and flit type value of each flit. When the ack_i is asserted HIGH and FIFO is not empty, this initiates the header extraction state. The header data are extracted and the respective flit counter value is updated. The local address generator generates the address for each read from the FIFO and the address is transferred to address fields of WISHBONE master to track the word count. When the flit count is greater than one and flit_type = 2 b01, the data flow reach the Body_flit state and read all the data from the FIFO. When flit_type = 2 b11, the state is transferred to Tail_flit state to read the last flit of the packet. The data transfer to the WISHBONE is either a single body flit or blocks of body flits based on packet size. It introduces the wait state during body_flit state when the FIFO is empty and ack_i is LOW. The flit type information is also transferred to the IP core through the sel_o signal of the WISHBONE master. When the state transition reaches the tail state then it again goes to idle or header extraction states depending on fifo_empty and ack_i signal.

8 K. Swaminathan et al. / Computers and Electrical Engineering 40 (2014) Fig. 4. FSM diagram of packing unit. 6. Verification environment for NI using verilog 6.1. Functional verification The Functional Verification (FV) with respect to verification methodology plays a significant role in verifying IP module for a reliable RTL design due to the growth in complexity of ASIC designs. FV is the process of checking whether a design satisfies expected functional specification requirement. The two earlier methods of verifying the correct function of a bus interface based on hardware components create a test bench and a larger system with other known-to-work components that will create or respond to bus transactions. Creating a test bench for different transaction in bus based interface is a very big and time consuming task. It involves describing the connections and test vector for all different combinations of bus transactions. Creating a system with another register based interface component describing the connections of DUT and programming the other component to generate the various bus transactions performing inward and outward transaction based on the DUT response. Such a system usually involves creating and compiling code, storing the code in memory for the components to read, and generate the correct bus transactions. Bus Functional Simulation (BFS) simplifies the verification of hardware components that attach to a bus provides the ability of generating bus stimulus without the need of going through the previously described approaches [23].

9 1846 K. Swaminathan et al. / Computers and Electrical Engineering 40 (2014) Fig. 5. FSM diagram of unpacking unit Bus functional model A Bus Functional Model (BFM) is a behavioral non-synthesizable model of an IP core or integrated circuit component having one or more external buses/processors. BFMs are exact mimic of the hardware device functions such as state machine that executes the bus operations, timing information, interrupt cycles and specific bus or processor oriented functions to simulate system bus transactions prior to implementation of the actual hardware modules [23,24]. BFMs are usually written as tasks/functions using Hardware Description Languages (HDLs) or software languages such as C, C++, SystemC, SystemVerilog, Synopsys OpenVera, Property Specification Language (PSL) and Cadence Specman e. BFM architecture is instantiated within the test bench to drive the signals into the DUT as a driver according to bus protocol and samples response signals to a monitor in the verification environment. Verification engineers need to know only the address of the bus/processor registers and the bus operation. Knowledge of the target device architecture, instructions, registers, and ports is not required when using BFM components. The use of a BFM allows control over bus transactions, transaction spacing, and the ability to simulate abnormal transactions, such as aborts, retries, and errors [23]. A master BFM generates bus transactions based on master bus protocol to which the DUT is connected as a slave to respond to master signals. BFM components of a bus interface can generate stimulus or respond to bus transactions. A slave BFM responds to bus transactions that the master DUT generates. The monitor BFM reports any errors regarding the bus compliance of the DUT in master mode and slave mode respectively as shown in Fig. 6. The verification is done for individual and top modules of DUT utilizing the WISHBONE BFMs. Some important formal verification processes to ensure the functional verification of the design are listed as follows [24].

10 K. Swaminathan et al. / Computers and Electrical Engineering 40 (2014) Test Code (Global Packages -Task and Functions) Test bench wrapper Wishbone Master BFM Wishbone slave Wishbone master Wishbone Slave BFM Design Under Test (DUT) Wishbone Slave BFM Wishbone Master Wishbone Slave Wishbone Master BFM Stimulus generation and Driving Response monitoring and Checking Test bench Fig. 6. Verification environment of NI for NoC. i. Identification: Identify suitability of applying FV and the nature of subsystems whether it is sequential, concurrent, control or data path block in the design to verify by utilizing directed and random test cases. The nature of the clock to the subsystems also identifies and categorizes the number of subsystems that are operated in synchronous mode and asynchronous mode. ii. Formal test planning process: A complete test plan needs to be created stating what is to be verified and how it is to be verified. The formal properties need to be defined in terms of generic behavior, independent of particular input scenarios in terms of the minimal correctness criteria. The formal test plan should have verification requirements, possible signal transition is expressed in term constraints and test plans might use formal coverage targets. The language requirement for each subsystem should be planned earlier. The verification strategies need to be defined and the order of the sub blocks also listed to ease the regression process. The concise hierarchy must be maintained among the subsystems in the top module to avoid the dependency issues between the subsystems. iii. Define interface: Individual sub module interface signals, internal signal and signals of interest to be monitored are listed in a table to determine completeness of the requirement of checklist during the review process. The asynchronous and synchronous interfaces of all subsystems must be listed out. The verification environment shown in Fig. 6 is used to verify the WISHBONE based asynchronous NI consists of a test bench top module having the following three parts 1. Master and slave BFMs. 2. Testbench top module having group of test cases WISHBONE master BFM The master bus function initiates the data transactions as per master bus protocol to the slave mode connected device. Reset bus, single read task, single write task, block read task, block write task are important operations in wishbone bus functional model. The master BFM performs only write operations in the proposed verification plan requirement WISHBONE slave BFM The WISHBONE slave BFM performs single and block read operation and responds to the data transactions of the master with the specified data size and range. The slave BFM performs only read operations in the proposed verification environment. Reset task, delay insertion task are available common to WISHBONE master and slave BFMs Verifying individual modules Reset conditions of all subsystems are verified. The asynchronous FIFO plays a major role in the proposed NI. The full and empty conditions of the Individual FIFOs are verified by simultaneous read write operations with various ranges of clock speed and different sizes of data burst. Individual WISHBONE master and slave interfaces are verified by performing single read/write, multiple read/write operations and monitor the control signal status against the expected response.

1848 K. Swaminathan et al. / Computers and Electrical Engineering 40 (2014) 1838 1857 The packing module is verified with possible combinations of header flits, data flits and tail flits.

11 1848 K. Swaminathan et al. / Computers and Electrical Engineering 40 (2014) The packing module is verified with possible combinations of header flits, data flits and tail flits. Checked the module responses based on header, payload and tail of the unpacking module during header extraction stage Verifying top level module The following scenarios are verified in the overall integrated verification environment. The data, status and control signals are verified during a reset state of entire NI. Performed single read and block read operation with and without acknowledgment signal and check the responses. Performed single write and block write operation with and without requiring write enable signal and check the responses. Performed write operation with a different data width (8, 16, 32 and 64 bits) and read the same and check the data with the expected values. Performed write operation with different block size (8, 16, 32, 64, 128, 256 and 512 word block) and read the same and check the data with the expected values. Exercise all the test cases keeping read frequency greater than write frequency and vice versa. Performed all the above operations with different network interface and router frequency Design under test The complete RTL description or Gate level net-list of a system under verification is called design under test. The DUT gets stimulus from the generator via driver and the response checking module which checks the output of the DUT for correct operation as per design specification. The stimulus driving the DUT can be generated in many different formats and from different sources under different scenarios, but it is primarily focused towards exercising the DUT output with a known response. Fig. 7. Simulated wave form of all types of asynchronous FIFO.

12 K. Swaminathan et al. / Computers and Electrical Engineering 40 (2014) Results and comparisons with existing work 7.1. Simulation results The simulated waveforms for the read clock, write clock, input data, output, and internal read and write pointers, empty and full signals of all four types of asynchronous FIFO using ModelSim are shown in Fig. 7. The read and write clock signals of the differently encoded asynchronous FIFO are connected together with a single read and write clock signal in the testbench. The speed ratio is depicted as 2:3 in Fig. 7. The read and write clocks of different FIFOs are connected together with a single clock in the top level module of the testbench. Similarly a range of different read, write clocks was varied and checked the output responses. The individual reset signals of read and write clocks of all FIFOs are not shown in Fig. 7 for simplicity; however, individual FIFO clocks are connected with read_clk and write_clk. The data input signal of Gray encoded FIFO is driven with 1, 2, 3, 4, 5,...series, the one-hot encoded FIFO input signal is driven with 2, 4, 6, 8, 12,..., the Johnson encoded FIFO input signal is driven with 3, 6, 9, 12, 16,... and the binary encoded FIFO input signal is driven with 4, 8, 12, 16, 20,... The expected data of different FIFOs are available at the respective output data signals Synthesis results The WISHBONE compatible NI is synthesized by using Synopsys Design Compiler targeted to STMicroelectronics COR- E90GPSVT90 nm CMOS standard cell library with nominal corner at 1.0 V and 25 C. The asynchronous FIFO size/depth is the major impact on the performance of the NI. The speed, area and power analysis of the different types of FIFO has been done with different FIFO size. The result is shown in Table 1 and Figs. 8, 9, 11, 13 and 15. The performance of the NI utilizing Gray encoded FIFO is shown in Table 2 and Figs. 10, 14 and Speed Performing static timing analysis (STA) in asynchronous clock domains is very crucial to estimate the exact speed of the system due to its frequency and phase relationship. The following two important constraints must be set to achieve correct timing of the asynchronous FIFOs based designs [25]. Table 1 FIFO depth vs. area, power and clock speed. FIFO TYPE FIFO depth Read clock speed of FIFO in MHz Write clock speed of FIFO in MHz Total area Power Total power in lm 2 Static in lw Dynamic in mw in mw Gray encoded FIFO Binary encoded FIFO Johnson encoded FIFO One-hot encoded FIFO

13 1850 K. Swaminathan et al. / Computers and Electrical Engineering 40 (2014) Fig. 8. Read clock speed vs. FIFO depth. Fig. 9. Write clock speed vs. FIFO depth. Fig. 10. Read, write speed of router and IP vs. FIFO depth of NI. Identifying the false path and identifying the asynchronous clock groups. A false path is a logic path in the design that should not be analyzed for timing. The following paths should be set as false paths to avoid timing failures in the synchronization registers of the asynchronous FIFO. The paths crossing from writing clock domain into reading clock domain between the synchronized delayed write pointer registers and read pointer registers of the asynchronous FIFO.

14 K. Swaminathan et al. / Computers and Electrical Engineering 40 (2014) Fig. 11. Static power dissipation vs. FIFO depth. Fig. 12. Dynamic power dissipation vs. FIFO depth. Fig. 13. Total power dissipation vs. FIFO depth. The paths crossing from the read clock domain into the write clock domain between the synchronized delayed read pointer registers and write pointer registers of the asynchronous FIFO. The unrelated clocks must be grouped together and set as a constraint for asynchronous group. In an asynchronous FIFO the read domain and write domain clocks are grouped as asynchronous group for STA. Similarly, in the integrated NI the router clock domains and the IP clock domains are grouped together and set the asynchronous clock constraint. The false path constraint must be set to the unrelated paths of the NI for STA. Comparative analyses of four types of encoded asynchronous FIFOs are shown in Table 1. It is observed that the read and write clock speed is gradually decreased with the subsequent increasing of the depth for all types FIFO as shown in Table 1 and Figs An increase in the depth of the FIFO results in the increase of the number of the bit width of the binary counter as well as Gray, one-hot and Johnson encoders for read and write pointers. Gray encoding presents an edge over other FIFO encoding techniques in write and read speeds as it utilizes the lowest number of bits and a low switching activity between two consecutive address jumps. The read clock speed as shown in Fig. 8 is always higher due to less complexity in reading logic for all types of FIFO than the write clock speed as shown in Fig. 9. When it comes to write speeds, in

15 1852 K. Swaminathan et al. / Computers and Electrical Engineering 40 (2014) Fig. 14. Total power dissipation of NI vs. FIFO depth. Fig. 15. Area of FIFOs vs. FIFO depth. Table 2 FIFO depth of network interface vs. area, power and clock speed of network interface. FIFO depth IP clock speed in MHz Router clock speed in MHz Total area in lm 2 Power Total power in mw Static in lw Dynamic in mw Network interface using Gray encoded FIFO one-hot encoding, the switching activity is always constant, but the number of bits for encoding increases as the FIFO depth increases. Since Johnson encoding technique utilizes half number of bits when compared to one-hot encoding with varied switching activity, it is observed that at higher FIFO depths, Johnson encoding performs well compared to one-hot encoding. Binary encoding has the advantage of utilizing less number of bits, but with higher switching activity. At lower FIFO depths, binary encoding technique may suffer, but as FIFO depth increases it gains advantage of utilizing lesser number of bits. The above situations are clearly depicted in the Figs. 8 and 9 which are based on the values taken from the Table 1. The speed of whole NI with respect to the FIFO depth is shown in Fig. 10 and Table 2. The Figs are drawn based on Table 1 values. The variation of the graphs depends on read clock, write clock with respect to the depth of FIFO. When the FIFO depth increases, the memory of FIFO occupies more area, consumes more power and reduces the speed (both read and write clock speed) of the NI. The difference in the switching activity of different encoding schemes and the small variations in write to read clocks do not require high depth of FIFO. When FIFO depth is 4 or 8, the read and write pointers traverse almost all addresses causing more switching activity which results in the frequency fluctuations. Even though FIFO depth is more (say ) the read and write pointer traverse few addresses as that of low depth case due to less frequency difference

16 K. Swaminathan et al. / Computers and Electrical Engineering 40 (2014) Fig. 16. Area of NI vs. FIFO depth. between read and write clock. This makes the read clock to remain almost constant in many FIFOs as shown in Fig. 8. The write speed of the binary counter is very high under low depth of FIFO due to the usage of less number of bits in the binary counter. Other encoding schemes requires a conversion (i.e. binary to Gray, binary to Johnson and binary to one-hot) in read and write pointers, which results in gradual decrease in writing speed, whereas binary encoding scheme remains constant irrespective of the depth as shown in Fig. 9. The fluctuations depend on switching activity of encoding schemes and depth of the FIFO. The IP clock speed is bit lower than the router clock speed due to the higher logic complexity involved in the packing unit at the IP core side than the unpacking unit at the router side as shown in Fig. 10 and Table Power The total power dissipation in any VLSI system is the summation of switching power, short circuit power and leakage power. The short circuit and leakage power is called as the static power and switching power as dynamic power. Static power of binary encoded FIFO is lowered by 6 mw, 34 lw and 16 lw compared to Gray, Johnson and one-hot encoded FIFO respectively. The static power dissipation is low due to the usage of binary read and write pointers without employing any encoding schemes which result in a reduction in the number of gates compared to other FIFOs as shown in Fig. 11. Dynamic power dissipation refers to the power consumed by a CMOS gate as a result of charging and discharging of the output capacitance and also some of internodal capacitances. The dynamic power dissipation is low in Gray encoded FIFO because the number of toggling is low during read pointer and write pointer increment as shown in Fig. 12. The total power dissipation of Gray encoded FIFO is low as shown in Fig. 13, and the total power dissipation of complete NI is shown in Fig Area The area of asynchronous FIFO and complete NI increases proportionally with the depth of the FIFO as shown in Figs. 15 and 16. The increase in FIFO size increases the memory unit, number of bits in read and write pointer of NI which constitutes the increase in the area of the entire module Latency The latency is defined as the number of clock edges after a read or a write operation occurring before the signal is updated. The latency of the write clock domain to the read clock domain is different due to its asynchronous nature. The total latency of NI is the summation of the latency caused by the double synchronizer flip flop on read and write pointers of the FIFO, latency of the packing/unpacking units and WISHBONE master/slave wrappers in the proposed design. Each flip flop in synchronizer of the asynchronous FIFO creates a latency of one clock cycle. In this proposed NI design, the asynchronous FIFO has been implemented as the number of flip flops in the synchronizer is configurable as per the design requirement. The packing, unpacking modules and the bus wrappers use the common reset signal when sampling the data at the same rising edge of their respective clock domains with respect to the full and the empty signal of the asynchronous FIFOs. The bus wrapper is efficiently designed as latency free and directly coupled with asynchronous FIFO and packing/unpacking modules. The total latency of the NI is limited by the asynchronous FIFO and the packing unpacking modules. The latency is negligible in the burst transfer mode when the latency is shared by large numbers of flits compared with flit by flit transfer. A cycle accurate simulation has been done on the entire modules using Modelsim10.0b and has found the exact value of the latency in nano seconds. The calculation of Latency/Throughput in the Receiver side (IP to Router) is as follows: Total latency ¼ Wishbone wrapper latency at the IP side þ FIFO latency þ Unpacking latency þwishbone bus wrapper latency at the Router side: ¼ 0nsþ3:04 ns þ 2:43 ns þ 0ns ¼ 5:43 ns 2 clock cycles ðwhen only one flip-flop is used for synchronizationþ

17 1854 K. Swaminathan et al. / Computers and Electrical Engineering 40 (2014) The calculation of Latency/Throughput in the transmitter side (Router to IP) is as follows: Total latency ¼ bus wrapper latency at the router side þ FIFO latency þ Unpacking latency þbus wrapper latency at the IP side: ¼ 0nsþ4:02 ns þ 0 þ 2:32 þ 0 ¼ 6:34 ns 2 clock cycles ðwhen only oneflip-flop is used for synchronizationþ For asynchronous clock domain it is worth to mention the latency in nano second instead of the number of clock cycles Throughput The NoC data transmission occurs between the router and the IP core. The throughput differs from the NoC router to IP core and IP core to router due to the packing and unpacking unit. The proposed low latency NI improves the throughput and the overall performance. Table 4 shows the throughput of NI in router to IP and IP to router direction. Throughput is defined as the total number of flits processed by NI per second. Throughput ¼ 1=ðlatency ðflit=clockþþ The latency is two clock cycles in the proposed design. For example the FIFO depth of NI is 8, the respective frequency is 2000 MHz and flit size is 32 bit. Throughput ¼ 1=ð2 ð1= ÞÞ ¼ 10 9 ¼ 1000 Mflits=s ¼ 32; 000 Mbits=s: 7.3. Individual module performance of proposed NI The performance of individual module is shown in Table 3. The WISHBONE master/slave wrapper offer higher speed compared to other modules due to its simplicity in nature. The packing and the unpacking modules offer less speed and consume more numbers of registers and LUTs compared to asynchronous FIFO and WISHBONE wrappers. It is obvious that the overall system counts of registers and LUTs is not equal to the summation of individual modules counts of registers and LUTs when synthesizing the modules separately as shown in Table Performance comparison analysis with existing NI The comparison of the proposed NI design with the existing NI design is done in ASIC with [10,19], and FPGA with [14,16 18] as the target technology/device as shown in Table 4. Avnet s Xilinx Virtex-5 LX Evaluation Kit with device XC5VLX50-1FF676 is utilized for NI implementation and also selected target device of Virtex-5 XC5VLX30 during postsynthesis for comparing with existing device without using figure of merit. A clear cut comparison between previous works and the proposed work is depicted in the Table 4. In S. no. 3 and S. no. 8, our proposed results with respect to ASIC design and FPGA design process are stated respectively. With device technology STM 90 nm, our design resulted in a throughput of 1000 Mflits/s with a frequency of 2 GHz, which shows an improvement of 4.5 times compared to DMA based NI. It is very complicated to do exact comparison with existing implementations due to the fact that the existing works were implemented in different ASIC technologies as well as targeted different FPGA devices. Even though the proposed design is targeted to STMicroelectronics 90 nm CMOS technology as shown in S. no. 3 in Table 4, the entire design is synthesized on the same FPGA target device as per the existing work as shown in S. no. 8 in Table 4 instead of doing complex calculations utilizing technology scaling to convert performance metrics from ASIC to FPGA or vice versa to maintain the exact equivalent for the comparison purpose. Compared to NI architecture for a NePA [10], the proposed design outperforms by 170% in speed and by 10% in the area. The proposed design offers very less latency of 2 cycles instead of 4 and 5 cycles, and the throughput of the proposed design is 1000Mflits/s instead of 179 Mflits/s when latency is 4 cycles and 143 Mflits/s when latency is 5 cycles of the best performed existing ASIC designs as shown in Table 4. Table 3 Individual sub-module s area and speed of proposed NI design targeted to the Xilinx FPGA. Submodules Number of slice registers Number of lice LUTs Number of fully used LUT-FF pairs Speed in MHz Unpacking unit WISHBONE master slave wrapper at IP end WISHBONE master slave wrapper at router end RX asynchronous FIFO Wr-622/Rd-648 TX asynchronous FIFO Wr-622/Rd-648 Packing unit Overall area IP-430/RTR-412

18 Table 4 Comparison with existing works. S. no. Ref. no. Bus/module Modes/method Target device technology Speed MHz ASIC area lm 2 FPGA area Number of Slice Registers (NSR) number of slice LUTs (NSLUTS) Power mw Latency cycles Throughput MFlits/s 1 [10] DMA Buffered/unbuffered TSMC 90 nm , /5 179/143 2 [20] NS Asynchronous FIFO 180 nm , NA - 3 This work WB Asynchronous FIFO STM 90 nm , [15] AHB MNI-4p/2p/cb xc5vlx NSR-6232/6242/ /5578/9694 3/3/3 103/103/103 SNI-4p/2p/cb 262 NSLUTS-611/586/ /6624/1068 6/4/3 43/65/87 NSR-7792/7782/7622 NSLUTS-906/890/846 5 [17] OCP MNA-hs/cb xc5vlx30 463/331 - NSR-473/590 26/30 3/3 154/110 SNA-hs/cb 354/370 NSLUTS-338/391 28/30 10/4 118/123 NSR-772/649 NSLUTS-694/531 6 [18] OCP MNI-hs/cb xc5vlx30 462/330 NSR-601/590 60/118 3/3 154/110 StoppableMNI-hs/cb 361/260 NSLUTS-356/590 22/74 3/3 120/86 NSR-743/602 NSLUTS-666/393 7 [19] OCP MNA-hs/cb xc5vlx30 378/246 - NSR-1031/ / /82 SNA-hs/cb 309/320 NSLUTS-772/811 20/ /106 (PB, IB and SRMD) NSR-1579/1293 NSLUTS-2057/ This work WB Asynchronous FIFO xc5vlx30 IP NSR xc5vlx50 RTR-412 NSLUTS NS Non-Specific. WB WISHBONE. AHB AMBA 2.0 AHB. 4p 4phase. 2p aphase. cb credit based. PB Precise burst. IB Imprecise Burst(IB). SRMD Single Request Multiple Data burst. NA Not Available. MNA Master Network Adapter. SNA Slave Network Adaptor. MNI Master Network Interface. SNI Slave Network Interface. K. Swaminathan et al. / Computers and Electrical Engineering 40 (2014)

19 1856 K. Swaminathan et al. / Computers and Electrical Engineering 40 (2014) In the FPGA based NI architecture the authors used standard buses to transfer the data from router to IP core using an MNA or MNI and to transfer the data from IP to the router using SNA or SNI by employing handshake and credit based flow control with or without power saving modes. In a practical NoC fabric design, the routers operate on a single frequency of a router clock domain and the IP core or a set of standard bus connected IP cores operate on a single frequency of the IP clock domain. We need to find out the worst case frequency among the minimum of router side frequency and IP core side frequency. The maximum operating frequency of router (RTR) clock domain of the current proposed design is 412 MHz and the IP clock domain frequency is 430 MHz. To calculate the throughput, the designers need to consider the lowest between the two frequencies (i.e. 412 MHz). Compared to FPGA targeted NI architectures [14,16 18], the proposed design outperforms by 16.38% in speed, by 52.58% in number of slice registers, and by 12.29% in number of slice LUTs. The proposed design offers very less latency of 2 cycles instead of 3 cycles, and the throughput is improved by 33.76% by considering the best performed design [16] of existing NI architectures as shown in Table 4. Compared to all existing designs, the proposed design offers very less latency of 2 cycles which is main reason for the improvement in throughput. For all comparison the FIFO depth is maintained as 8. The comparison results shows that the proposed architecture outperforms in terms of speed, area, latency and throughput compared to all other architectures. 8. Conclusion In order to speed up the data transfer in the NI for NoC, a generic asynchronous FIFO-based WISHBONE compatible plug and play NI for NoC design is presented in this paper. The existing AMBA, OCP and DMA-based NIs utilizing many types of handshake and credit based flow control offer high latency, low throughput and low speed. The proposed NI offers lower latency due to latency free wrappers with merged micro-level architecture of one clock cycle latency packing/unpacking modules and optimum latency of one clock cycle asynchronous FIFO compared to the existing designs. The proposed NI is performed well irrespective of the router and processing core frequencies and phase differences. This NI offers an easy integration of WISHBONE compatible existing IPs and other IPs with minimal manual effort in a shorter design cycle. The whole design is verified using BFM based constraint random verification environment. The proposed design has been implemented in STMicroelectronics 90 nm CMOS standard cell and the entire design is verified in constrained random based verification environment using Verilog-HDL. Experimental results show that the proposed NI offers a low latency of 2 clock cycles, 4.5 times higher throughput when compared to the best available ASIC implementation and 33.76% more compared to the best available FPGA implementation. The speed of the proposed NI is increased by 170% in ASIC design and 16% increase in FPGA based design and the area is reduced by 10% in ASIC, 52% and 12% reduced in number of slice registers and LUTs in FPGA based design. Acknowledgements This research was partially supported by the Canadian Bureau for International Education (CBIE) on behalf of Foreign Affairs and International Trade, Canada (DFAIT), under the Canadian Commonwealth Exchange Program-Asia Pacific (formerly GSEP), which is gratefully acknowledged. References [1] Dally W, Towles B. Principles and practices of interconnection networks. San Francisco (CA): Morgan Kaufmann Pub; [2] Dally W, Towles B. Route packets not wires: on-chip interconnection networks. In: Annual design automation conference; p [3] Benini L, De Micheli G. Networks on chips: a new SoC paradigm. IEEE Comput 2002;35:70 8. [4] Singh Sanjay Pratap, Bhoj Shilpa, Balasubramanian Dheera, Nagda Tanvi, Bhatia Dinesh, Balsara Poras. Generic network interfaces for plug and play NoC based architecture. In: Lecture notes in computer science springer reconfigurable computing: architectures and applications; p [5] OpenCores WISHBONE Specification. WISHBONE System-on-Chip (SoC) Interconnection Architecture for Portable IP Cores. OpenCores. rev. B.4; [6] Cummings C, Alfke P. Simulation and synthesis techniques for asynchronous FIFO design with asynchronous pointer comparison. In: SNUG; [7] Rahmani AM, Liljeberg O, Plosila J, Tenhunen H. An eficient VFI-based NoC architecture using Johnson-encoded reconfigurable FIFOs. In: IEEE international norchip conference; p [8] Fattah M, Manian A, Rahimi A, Mohammadi S. A high throughput low power FIFO used for GALS NoC buffers. In: IEEE annual symposium on VLSI; p [9] Bergeron J. Writing testbenches: functional verification of HDL models. Norwell (MA): Kluwer Academic Publishers; [10] Lee Seung Eun, Bahn Jun Ho, Yang Yoon Seok, Bagherzadeh Nader. A generic network interface architecture for a networked processor array (NePA). In: ARCS; p [11] Lai Yong-Long, Yang Shyue-Wen, Sheu Ming-Hwa, Hwang Yin-Tsung, Tang Hui-Yu, Huang Pin-Zhang. A high-speed network interface design for packet-based NoC. In: IEEE ICCCAS; p [12] Beigne E, Vivet P. Design of on-chip and off-chip interfaces for a GALS NoC architecture. In: 12th IEEE international symposium on asynchronous circuits and systems; p [13] Thonnart Yvain, Beigné Edith, Vivet Pascal. Design and implementation of a GALS adapter for a NoC based architectures. In: ASYNC; p [14] Attia Brahim, Chouchene Wissem, Zitouni A, Nourdin A, Tourki R. Design and implementation of low latency network interface for network on chip. In: International conference on design and test workshop; p [15] Ebrahimi Masoumeh, Daneshtalab Masoud, Sreejesh NP, Liljeberg Pasi, Tenhunen Hannu. Efficient network interface architecture for network-onchips. In: NORCHIP; p [16] Attia B, Zitouni A, Tourki R. Design and implementation of network interface compatible OCP for packet based NOC. In: International conference on design & technology of integrated systems in nanoscale era; p [17] Chouchene W, Attia B, Zitouni A, Abid N, Tourki R. A low power network interface for network on chip. In: International multi-conference on systems, signals and devices; p. 1 6.

K. Swaminathan et al. / Computers and Electrical Engineering 40 (2014) 1838 1857 1857 [18] Attia B, Chouchene Wissem, Zitouni Abdelkrim, Tourki Rached. Network interface sharing for SoCs based NoC.

Network interface to synchronize multiple packets on NoC-based systems-on-chip. In: VLSI system on chip conference; 2010. p. 31 6.

[21] Apperson RW, Yu Zhiyi, Meeuwsen MJ, Mohsenin T, Baas BM. A scalable dual-clock FIFO for data transfers between arbitrary and haltable clock domains.

20 K. Swaminathan et al. / Computers and Electrical Engineering 40 (2014) [18] Attia B, Chouchene Wissem, Zitouni Abdelkrim, Tourki Rached. Network interface sharing for SoCs based NoC. In: International conference on communications, computing and control applications; p [19] Matos D, Costa M, Carro L, Susin A, Matos D, et al. Network interface to synchronize multiple packets on NoC-based systems-on-chip. In: VLSI system on chip conference; p [20] Atienza D, Angiolini F, Murali S, Pullini A, Benini L, De Micheli G. Network-on-chip design and synthesis outlook. Integration of VLSI J 2008;41(3): [21] Apperson RW, Yu Zhiyi, Meeuwsen MJ, Mohsenin T, Baas BM. A scalable dual-clock FIFO for data transfers between arbitrary and haltable clock domains. IEEE Trans Very Large Scale Int (VLSI) Syst 2007;15(10): [22] Dally W, Poulton J. Digital systems engineering. Cambridge, UK: Cambridge Univ Press; [23] Xilinx verification document. BFM Simulation in Platform Studio. Xilinx Inc; [24] Gregg D, Tim L. Designing procedural-based behavioral bus functional models for high performance verification. In: SNUG; [25] Altera design document. SCFIFO and DCFIFO mega functions. Altera Inc K. Swaminathan received his B.E. degree in Electrical and Electronics Engineering from IRTT, Erode, Bharathiar University, India in He received the M.E. degree (VLSI Design) from Govt. College of Technology, Coimbatore, India in He is currently pursuing his Ph.D. degree at NIT, Tiruchirappalli. His research interests include System on Chip, Network on Chip and Design/ Verification of Complex Digital Systems. G. Lakshminarayanan received the M.E. and Ph.D. degrees in Electronics and Communication Engineering from Bharathidasan University, Tiruchirappalli, India, in 1995 and 2005, respectively. He is currently working as an Associate Professor in the Department of ECE, NIT, Tiruchirappalli. His current research interests include Reconfigurable Systems, VLSI based Wireless System Design, Algorithms and Techniques for Cognitive Radio and Network on Chip. Seok-Bum Ko received his Ph.D. in Electrical & Computer Engineering at the URI, USA in He is currently an Associate Professor in the Department of Electrical & Computer Engineering at the University of Saskatchewan, Canada. His research interests include efficient hardware implementation of computer system, computer arithmetic, digital design automation and computer architecture. He is a senior member of IEEE Computer Society.

BRIDGE PIF / WISHBONE

June 27 Politecnico of Torino BRIDGE PIF / WISHBONE Specification Authors: Edoardo Paone Paolo Motto Sergio Tota Mario Casu Table of Contents May 27 Table Of Contents Table of Figures May 27 Table Of Figures