IJMTES International Journal of Modern Trends in Engineering and Science ISSN:

Size: px

Start display at page:

Download "IJMTES International Journal of Modern Trends in Engineering and Science ISSN:"

Melinda Wade
5 years ago
Views:

1 Low-Latency Wormhole Routers for On-Chip Networks N.Rajesh Kumar 1, A.Dhivya 2, K.Kalaivani 3, R.Sindhu 4 1 (Assistant Professor,Pollachi Institute of engineering and Technology,Poosaripatti,rajeshkumar.n1@gmail.com) 2 (UG scholar,pollachi Institute of engineering and Technology,Poosaripatti,dhivimohan95@gmail.com) 3 (UG scholar,pollachi Institute of engineering and Technology,Poosaripatti,kalaiec96@gmail.com) 4 (UG scholar,pollachi Institute of engineering and Technology,Poosaripatti,sindhuec95@gmail.com) Abstract On-chip routers typically have buffers devoted to their input or output ports for temporarily storing packets. Buffers, regrettably, consume significant portions of router area and power budgets. In previous design, a more number of buffer queues in the network are null and other queues are mostly busy and only 1 byte of data can be transferred in unidirection. In this article the design is implemented with sharing the buffers among the virtual channels, to maximize the utilization of buffer. This feature is more efficient, because they have high throughput and low queuing delays under heavy loads. Performance evaluation of the routing node in terms of latency is the characteristics of an efficient design of buffer in input module. Wormhole routing is a network flow control mechanism which decomposes a packet into smaller fluts and delivers the fluts in a pipelined fashion. It has better performance and less buffering requirements. We conclude that more bytes of data can be transferred bidirectionally without error by using error detection and correction. Keywords On chip network, router architecture, shared buffer, wormhole routing. 1. INTRODUCTION On chip systems towards multicore design for taking advantage of technology scaling and speeding up system performance through expanded parallelism in the fact that power wall limits the increase of the clock frequency. Network on chip is a communication subsystem on an combined circuit (commonly called a "chip"), typically between intellectual equity (IP cores in a system on a chip (SoC). NoCs can span concurrent and asynchronous clock domains or use unlocked asynchronous logic. NoC technology inquires networking theory and methods to onchip communication and brings notable development over conventional bus and crossbar reticulation. NoC improves the scalability of SoCs, and the power efficiency of complex SoCs related to other designs. Networks on chip are shown to be feasible and easy to scale for supporting a large number of processing elements rather. Fig.1. Chip multiprocessors interconnected by a network of VC routers. NI: network interface. R: router A multicore system in which chip multiprocessors interconnected by a network of VC routers is shown in Fig. 1. Each router has five ports that connect to four neighboring routers and local processor. A network integrate locates among a processor and its router for transforming processor messages into packets to be transmitted on the network. Router is a network device that ahead s data packet from one network to another network. Based on the address of the receiving network in the incoming packet and an internal routing table, the router finds which port (line) to circulate out the packet (ports typically link to Ethernet cables). Routers require packets structured in a routable protocol, the global standard being TCP/IP. Buffer less routers remove buffers from the router hence save much area. However, their performance becomes poor in case packet inoculation rates are high. Because of having no buffers, previous router create proposed to drop and retransmit packets or to deflect them once network contention occurs that can absorb even higher energy per packet than a router with buffer. In previous paper, only 1byte of data can be transferred in unidirection. Buffer utilization is not efficient because each ports uses separate buffers for storing the packets temporarily. And during the transmission of packets if the error occurs, it cannot be predicted. In this paper, we proposed to transfer 5 byte of data can be transferred in bidirection with efficient buffer utilization by using shared buffer. Error detection and error correction methods are used to detect the error and correct the error during the data transmission. 2. BACKGROUND AND MOTIVATION We first study conventional on-chip router architectures with brief evaluation of their performance, and then derive the provocation of our new router design using shared queues. Volume: 03 Issue:

A.Warmhole router architectures Router is the most important component to construct NoC system. Fig. 1 shows the traveling process of a fluts through a WH router.

2 A.Warmhole router architectures Router is the most important component to construct NoC system. Fig. 1 shows the traveling process of a fluts through a WH router. At first, when a fluts arrives at an input port, it is written to the corresponding buffer queue. This step is called buffer write or Queue Write (QW).Let s assume without other packets in the front of the queue, the packet starts deciding the output port for its next router (based on the receiving information contained in its head fluts) instead of for the present router (called as Look ahead Routing Computation (LRC). Simultaneously, it arbitrates for its output port at the present router because there may be multiple packets from different input queues having the same output port. This step is called Switch Allocation (SA).If it beats the output SA, it will traverse across the crossbar. This step is called crossbar traversal or Switch Traversal (ST). Later, it then traverses on the output link toward next router. This step is called Link Traversal (LT). channel, the router stores the flit in the buffer for the allocated virtual channel and find the next hop node for the packet. This stage is called as RC stage. Given the next hop, the router then assigns a virtual channel in the next hop. This stage is called as VCA stage. C.Cross bar switch Crossbar switch (cross-point switch, matrix switch) is convocation of switches arranged in a matrix configuration. A crossbar switch has many input and output lines that form a crossed pattern of interconnecting lines between which a connection may be recognized by closing a switch located at each intersection, the elements of the matrix. Originally, a crossbar switch subsisted literally of crossing metal bars that provided the input and output paths. Later implementations achieved the similar switching topology in solid state semiconductor chips. The cross-point switch is one of the principal switch architectures, at the same time with a rotary switch, memory switch, and a crossover switch. A crossbar switch is an assembly of separate switches between a set of inputs and a set of outputs. The switches are arranged in a matrix. If the crossbar switches consist of M inputs and N outputs, then a crossbar has a matrix with M N cross-points or places where the connections cross. At each cross point is a switch; when closed, it connects one of the inputs to one of the outputs. Given crossbar is a single layer, non-blocking switch. Non-blocking means that other concurrent connections do not stop connecting other inputs to other outputs. Collections of crossbars can be used to implement more layers and blocking switches. A crossbar switching system is also known as coordinate switching system. Fig 2) Typical Router Architecture B.VC Router architecture Virtual Channel Router (VCR) is a router which consumes source routing algorithm and wormhole network flow control with virtual channels. It is sufficient for on chip networks with two structural topologies. The data path of the router consists of buffers and a switch. In this VC router design, an input buffer has several queues in parallel, each queue is called a VC, that permits packets from different queues to bypass each other to advance the crossbar stage instead of being blocked by a packet at the head of the queue (at any rate, all queues at one input port can be still blocked if all of them do not beat SA or if all related output VC queues are full).because now an input port has multiple VC queues, each packet has to select a VC of its next router s input port before arbitrating for output switch. The routing operation encloses four steps called as Routing Computation (RC), virtual channel allocation (VCA), switch allocation (SA), and switch traversal (ST), these are often applied as four pipeline stages in modern virtual-channel routers. When a head flut (the first flut of a packet) reaches at an input Fig 3) Cross bar switch. Wormhole Routing Space wire routing switches employ wormhole routing. When a packet starts to reach at an input port of a router, its destination address is viewed immediately. If the output port that is to be used to forward the packet towards its destination is not currently used, the head of the packet is sent to that output port straight away, with the rest of the packet following as it is arrived at the input port. There is no buffering of complete packets in the router, neither before, nor after switching the packets. Wormhole routing has a more advantages over other approaches like store and forward: No packet buffering Little buffer memory Can support packets of arbitrary size Volume: 03 Issue:

3 Wormhole routing grieves from one main problem, that of blocking. If the output port that the packet is to be forwarded through is not prepared or is currently being used, the packet has to wait until it is ready or the packet currently flowing through it has completed. Since the tail of a packet can be spread out through the network, not only the waiting packet is halted, but that packet blocks other packet in the network that is waiting to use the links that it is currently occupying. Fig 4) Packet Blocking A long packet it being transferred from node 1 to node 5, shown in blue. Another packet, shown in red, wants to go from node 2 to node 5, but since the link from router 2 to node 5 is already in use, the red packet is blocked in router 2. A third packet, shown in yellow, wants to go from node 3 to node 6.This does not use any of the links being occupied by the first packet (blue), but it is blocked by the waiting packet (red) in router 1 since it has to travel over a link from router 1 to router 2. Once there is a blockage its effect can multiply, causing further blockage throughout network. There are some strategies that help to avoid blocking a network: Split large packets up into many smaller ones, e.g. an image could be sent as a series of image lines. Make sure that the destination is ready before sending the packet, which can be done using an end-to-end flow control mechanism. If the destination is not ready to receive a packet it an simply throw the packet away, this can be combined with retry mechanism to implement flow control, although it might result in inefficient use of network bandwidth if the destination is often not ready. If a packet does get blocked for longer than might be expected, indicating that a fault has occurred, detect this using a watchdog timer and discard the blocked packet. 3.RoSHAQ: ROUTER ARCHITECTURE WITH SHARED QUEUES A. RoShaQ Architecture RoShaQ, a router architecture with shared queues based. When an input port receives a packet, it calculates its output port for the next router (look ahead routing), at the same time it arbitrates for both its decided output port and shared queues. If it receives a grant from the output port allocators (OPAs), it will advance to its output port in the next cycle. Otherwise, if it receives a grant to a shared queue, it will be written to that shared queue at the next cycle. In case that it receives both grants, it will prioritize to advance to the output port. Shared-queues allocator (SQA) receives requests from all input queues and grants the permission to their packets for accessing nonfull shared queues. Packets from input queues are allowed to write to a share queue only if: 1) the shared queue is empty or 2) the shared queue is containing packets having the same output port as the requesting packet. This shared queue writing policy guarantees deadlock-free for the network as will be explained in Section III-E. The OPA receives requests from both input queues and shared queues. Both SQA and OPA grant these requests in round-robin manner to guarantee fairness and also to avoid starvation and live lock. Input queue, output port, and shared-queue states maintain the status (idle, wait, or busy) of all queues and output ports, and incorporate with SQA and OPA to control the overall operation of the router. Only input queues of RoShaQ have routing computation logic because packets in the shared queues were written from input queues hence they already have their output port information. RoShaQ has the same I/O interface as a typical router that means they have the same number of I/O channels with flit-level flow control and credit-based backpressure management. Fig 5) RoShaQ router micro architecture. B. RoShaQ s Properties 1) A network of RoShaQ routers is deadlock-free. At light load, packets normally bypass shared queues, so RoShaQ acts as a WH router hence the network is deadlock-free [23]. Volume: 03 Issue:

4 At heavy load, if a packet cannot win the output port, it is allowed to write only to a shared queue which is empty or contains packets having the same output port. Clearly, in this case RoShaQ acts as an output-buffered router which is also shown to be deadlock-free [24]. 2) A network of RoShaQ routers is livelock-free. Because both OPA and SQA use round-robin arbiters, each packet always has a chance to advance to the next router closer to its destination; hence the network is also free from livelock. 3) RoShaQ supports any adaptive routing algorithm. The output port for each packet is only computed at its input queue, not at shared queues. Therefore, any adaptive routing algorithm which works for WH routers also works for RoShaQ. 4) RoShaQ can be used for any network topology. If we hide all design details inside RoShaQ, we would see RoShaQ. Only has one buffer queue at each input port similar to a WH router. Therefore, we can change the number of RoShaQ s I/O ports to make it compatible with any network topology known in the literature along with an appropriate routing algorithm. 4. EXPERIMENTAL RESULTS A. Latency and Throughput In data network, latency means time when a particular packet takes to reach the destination from source. The term delay is similar to latency. Popular tools like ping and trace route can be used to measure the delay or latency of the link or connection. Usually this is done by considering time when packet takes to reach to destination from source. The latency or delay can be low if there is high congestion in the traffic or can be because of errors and distance as well. Jitter is delay when the packet has different time every next data transfer. Network throughput is the amount of data that can traverse through a given medium. The network throughput is measured in bits per second (bps). Throughput can be high or low depending on your network infrastructure. Devices such as routers, switches, firewalls, cables, network cards can have significant impact on the network throughput. High speed devices and cables will definitely increase your network throughput. We conducted the experiments over eight common synthetic traffic patterns which cover a wide range of interconnect patterns on 2-D mesh net-works [21]. For uniform random traffic, each source processor chooses its destination randomly with uniform distribution in packet-bypacket basic. For other patterns, destination of each source node is decided based on the location of the source as follows: Neighbour: destination is randomly chosen among four nearest neighbors of the source on a probability of 80%, and is randomly among other processors on a probability of 20%; Regional: destination is randomly chosen among processors with distances to the source of at most three on a probability of 70%, and is randomly among other processors on a probability of 30% B) Real Application Communication Traffic Network traffic is the amount of data moving across a network at a given point of time.[1] Network data in computer networks is mostly encapsulated in network packets, which provide the load in the network. Network traffic is the main component for network traffic measurement, network traffic control and simulation. Network traffic control - managing, prioritizing, controlling or reducing the network traffic Network traffic measurement - measuring the amount and type of traffic on a particular network Network traffic simulation - to measure the efficiency of a communications network Traffic generation model - is a stochastic model of the traffic flows or data sources in a communication computer network The general definitions of the terms are as follows: Error detection is the detection of errors caused by noise or other impairments during transmission from the transmitter to the receiver. Error correction is the detection of errors and reconstruction of the original, error-free data. Error Detection Errors in the received frames are detected by means of Parity Check and Cyclic Redundancy Check (CRC).In both cases, few extra bits are sent along with actual data to confirm that bits received at other end are same as they were sent. If the counter-check at receiver end fails, the bits are considered corrupted. One extra bit is sent along with the original bits to make number of 1s either even in case of even parity, or odd in case of odd parity. The sender while creating a frame counts the number of 1s in it. For example, if even parity is used and number of 1s is even then one bit with value 0 is added. This way number of 1s remains even. If the number of 1s is odd, to make it even a bit with value 1 is added. Fig 6) Conversion of data bits into parity bits There are two variants of parity bits: 1. Even parity bit 2. Odd parity bit. In the case of even parity, for a given set of bits, the occurrence of bits whose value is 1 is counted. If that count is odd, the parity bit value is set to 1, making the total count of occurrences of 1's in the whole set(including the parity bit) an even number. If the count of 1's in a given set of bits is already even, the parity bit's value remains 0. In the case of odd parity, the situation is reversed. For a given set of bits, if the count of bits with a value of 1 is even, the parity bit value is set to 1 making the total count of 1's in Volume: 03 Issue:

the whole set(including the parity bit) an odd number. If the count of bits with a value of 1 is odd, the count is already odd so the parity bit's value remains 0.

5 the whole set(including the parity bit) an odd number. If the count of bits with a value of 1 is odd, the count is already odd so the parity bit's value remains 0. Even parity is a special case of a cyclic redundancy check (CRC), where the 1-bit CRC is generated by the polynomial x+1. If the parity bit is present but not used, it may be referred to as mark parity (when the parity bit is always 1) or space parity (the bit is always 0). The receiver simply counts the number of 1 s in a frame. If the count of 1 s is even and even parity is used, the frame is consider to be not corrupted and is accepted. If the count of 1 s is odd and odd parity is used. The frame is still not corrupted. If a single bit flips in transit, the receiver can detect it by counting the number of 1 s. But when more than one bits are erroneous, then it is very hard for the receiver to detect the error. Fig.7) Error detection and Error correction Cyclic Redundancy Check (CRC) Error Correction: In the digital world, the error correction can be done two ways: Backward error Correction: When the receiver detects an error in the data received, it requests back the sender to retransmit the data unit. Forward error Correction: When the receiver detects some error in the data receiver, it executes the error correcting codes, which helps it to auto-recover and to correct some kinds of errors. The first one, backward error correction, is simple and can only be efficiently used where retransmitting is not expensive. For example, fiber optics. But in case of wireless transmission retransmitting may cost too much. In the later case, forward error correction is used. To correct the error in data frame, the receiver must know exactly which bit in the frame is corrupted. To locate the bit in error, redundant bits are used as parity bits for error detection. For example, we take ASCII words (7 bits data), then there could be 8 kind of information we need: first seven bits to tell us which bit is error and one more bit to tell that there is no error. For m data bits, r redundant bits are used. r bits can provide 2r combinations of information. In m+r bit codeword, there is possibility that the r bits themselves may get corrupted. So the Number of r bits used must inform about m+r bit location plus no error information, i.e.m+r SIMULATION RESULTS ModelSim is an application that integrates with Xilinx ISE to provide simulation and testing tools. Two kinds of simulation are used for testing a design: functional simulation and timing simulation. Functional simulation is used to make sure that the logic of a design is correct. Timing simulation also takes into account the timing properties of the logic and the FPGA, so you can see how long signals take to propagate and make sure that your design will behave as expected when it is downloaded onto the FPGA. CRC is a different approach to detect if the receiver contains valid data. The technique involves binary division of data bits being sent. The divisor is generated using polynomials. The sender performs the division operation on the bits being sent and calculates the reminder. Before sending the actual bits the sender adds the reminder at the end of the actual bits. Actual data bits plus the reminder is called a codeword. The sender transmits the data bits as codeword. At the other end, the receiver performs division operation on codeword using same CRC divisor. If the reminder contains all zeros the data bits are accepted, otherwise it is considered as there some data corruption occurred in transit. Fig.8)Output Volume: 03 Issue:

6 6. RELATED WORK Nicopoulos et al. [17] proposed ViChar, a router architecture which allows packets to share flit slots inside buffer queue so that can achieve higher throughput. Our paper manages buffers at coarser grain that is at queue-level rather than at flit-level, hence allows reusing existing generic queue design which makes buffer and router design much simpler and straightforward. Raman jam et al. [21] Latif et al. [10] implemented a router with input ports sharing all queues that is similar to the architecture shown in Fig. 6(a). Its implementation on FPGA shows more power and area-efficient than typical input VC routers. A similar approach is proposed by Tran et al. [20]; due to the high complexity of its allocators and also interrouter round-trip request/grant signaling, however, its performance is actually poorer than a typical router 7. CONCLUSION We presented RoShaQ, a novel router architecture that allowed sharing multiple buffer queues for improving network throughput and latency. we can transfer more number of data than the previous papers that have published. The error occurred during the transmission of packets can be detected and corrected. And also the speed of data transmission can be increased and delay can be reduced. RoShaQ achieved higher improvement in application performance and energy consumption than other router. REFERENCES [1] L. Benini and G. D. Micheli. Networks on Chips: A NewSOC Paradigm. IEEE Computer, [2] A. A. Chien. A cost and speed model for k-ary n-cube Wormhole routers. In Proceedings of Hot Interconnects, 1993.[3]P.Coe,F.Howell,R.Ibbett, and L.Williams. A HierarchicalComputer Architecture Design and Simulation Environment.ACM Transactions on Modeling and Computer Simulation, 8(4 ), October Communication Networks. In P. Losleben, editor. [4] A.Bianco, P. Giaccone, G. Masera, and M. Ricca, Power control for crossbar-base queued Switches, IEEE Trans. Comput., vol. 62,no.1, pp , Jan [5] S.T. Chuang, A. Goel, N. McKeown, and B. Prabhakar, Matching output queueing with a combined input/output-queued switch, IEEE J. Selected Areas Commun., vol. 17, no. 6, pp , Jun [6] M. Galles, Spider: A high-speed network interconnect, IEEE Micro, vol. 17, no. 1, pp , Jan [7] D. Gebhardt, J. You, and K. S. Stevens, Design of an energy-efficient asynchronous NoC and its optimization tools for heterogeneous SoCs, IEEE Trans. Comput.- Aided Design Integr. Circuits Syst., vol. 30, no. 9, pp , Sep [8] O. He, S. Dong, W. Jang, J. Bian, and D. Z. Pan, UNISM: Unified scheduling and mapping for general networks on chip, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 20, no. 8, pp , Aug [9] M. G. Hluchyj and M. J. Karol, Queueing in highperformance packet switching, IEEE J. Sel. Areas Commun., vol. 6, no. 9, pp , Dec [10] J.Hu and R.Marculescu, Energy-and Performanceaware mapping For NOC architecture, IEEE Trans. Comput-Aided Design Integr Circuits Syst., vol. 24, no. 4, pp , Apr [11] H. Ito, M. Kimura, K. Miyashita, T. Ishii, K. Okada, and K. Masu, A bi- directional and multi-droptransmission-line interconnect for multipoint- tomultipoint on-chip communications, IEEE J. Solid-State Circuits, vol. 43, no. 4 pp , Apr [12] J. Kim, C. Nicopoulos, P. Dongkook, V. Narayanan, M. S. Yousif, and C. R. Das, A gracefully degrading and energy-efficient modular router architecture for onchip networks, in Proc. 33rd IEEE/ACM ISCA, Jun. 2006, pp [13] A. Kumar, L.S. Peh, P. Kundu, and N. K. Jha, Towards ideal on-chip communication using express virtual channels, IEEE Micro, vol. 28, no. 1, pp , Jan [14] Y.C. Lan, H.A. Lin, S.H. Lo, Y. H. Hu, and S.J. Chen, A bidirectional Noc(BiNoC) architecture with dynamic self-reconfigurable channel, IEEE Trans.Comput. Aided Design Integr. Circuits Syst., vol. 30, no. 3, pp , Mar [15] B. Lin and I. Keslassy, The concurrent matching switch architecture, in Proc. IEEE Comput. Commun. Soc. (INFOCOM), Apr [16] R. Mullins, A. West, and S. Moore, Low-latency virtual-channel routers for on-chip networks, in Proc. 31st ISCA, Mar. 2004, p [17] C. A. Nicopoulos, P. Dongkook, K. Jongman, N. Vijaykrishnan, M. S. Yousif, and C. R. Das, ViChaR: A dynamic virtual channel regulator for network-on-chip routers, in Proc. 39th IEEE/ACM Int. Symp. Microarchitect. (MICRO), Dec. 2006, pp [18] G. Passas, M. Katevenis, and D. Pnevmatikatos, A Gb/s crossbar interconnecting 128 tiles in a single hop and occupying 6% of their area, in Proc. NOCS, 2010, pp [19] L. Peh and W. J. Dally, A delay model and speculative architecture for pipelined routers, in Proc. Int. Symp. HPCA, Jan. 2001, pp [20] A. Prakash, Randomized parallel schedulers for switch-memory-switch routers: Analysis and numerical studies, in Proc. IEEE INFOCOM, vol. 3. Mar. 2004, pp [21] R. S. Ramanujam, V. Soteriou, B. Lin, and L.-S. Peh, Extending the effective throughput of NoCs with distributed shared-buffer routers, IEEE Trans. Comput.Aided Design Integr. Circuits Syst., vol. 30, no. 4, pp , Apr [22] D. Seo, A. Ali, W.-T. Lim, and N. Rafique, Nearoptimal worst-case throughput routing for 2-D mesh networks, in Proc. 32nd IEEE/ACM ISCA, Jun. 2005, pp Volume: 03 Issue:

Routers In On Chip Network Using Roshaq For Effective Transmission Sivasankari.S et al.,

Routers In On Chip Network Using Roshaq For Effective Transmission Sivasankari.S et al., International Journal of Technology and Engineering System (IJTES) Vol 7. No.2 2015 Pp.181-187 gopalax Journals,