IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS 1

Size: px

Start display at page:

Download "IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS 1"

Martin Holland
6 years ago
Views:

1 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS 1 Reconfigurable Routers for Low Power and High Performance Débora Matos, Student Member, IEEE, Caroline Concatto, Student Member, IEEE, Márcio Kreutz, Fernanda Kastensmidt, Member, IEEE, Luigi Carro, Member, IEEE, and Altamiro Susin, Member, IEEE Abstract Network-on-chip (NoC) designs are based on a compromise among latency, power dissipation, or energy, and the balance is usually defined at design time. However, setting all parameters, such as buffer size, at design time can cause either excessive power dissipation (originated by router under utilization), or a higher latency. The situation worsens whenever the application changes its communication pattern, e.g., a portable phone downloads a new service. Large buffer sizes can ensure performance during the execution of different applications, but unfortunately, these same buffers are mainly responsible for the router total power dissipation. Another aspect is that by sizing buffers for the worst case latency incurs extra dissipation for the mean case, which is much more frequent. In this paper we propose the use of a reconfigurable router, where the buffer slots are dynamically allocated to increase router efficiency in an NoC, even under rather different communication loads. In the proposed architecture, the depth of each buffer word used in the input channels of the routers can be reconfigured at run time. The reconfigurable router allows up to 52% power savings, while maintaining the same performance as that of a homogeneous router, but using a 64% smaller buffer size. Index Terms Buffer, latency, network-on-chip (NoC), power dissipation, reconfigurable router. I. INTRODUCTION MULTIPROCESSOR SYSTEM-ON-CHIPS (MPSoCs) are emerging as one of the technologies providing a way to support the growing design complexity of embedded systems, since they provide processor architectures adapted to selected problem classes, allied to programming flexibility. To ensure flexibility and performance, future MPSoCs will combine several types of processor cores and data memory units of widely different sizes, leading to a very heterogeneous architecture. The increasing interconnection complexity and the known scalability deficiency of buses require another model of interconnection. The communication among cores of an MPSoC Manuscript received June 22, 2009; revised November 27, 2009 and May 08, 2010; accepted July 25, D. Matos, C. Concatto, F. Kastensmidt, L. Carro, and A. Susin are with the Department of Computer Science, Federal University of Rio Grande do Sul (UFRGS), Porto Alegre, Rio Grande do Sul, Brazil ( debora.matos@inf.ufrgs.br; cconcatto@inf.ufrgs.br; fglima@inf.ufrgs.br; carro@inf.ufrgs.br; susin@inf.ufrgs.br). M. Kreutz is with the Departamento de Informática e Matemática Aplicada (DIMAP), Federal University of Rio Grande do Norte (UFRN), Natal , Brazil ( kreutz@dimap.ufrn.br). Color versions of one or more of the figures in this paper are available online at Digital Object Identifier /TVLSI having reusable and scalable interconnections is being provided by networks-on-chip (NoCs) [1]. NoCs have been proposed to integrate several Intellectual Property (IP) cores, providing high communication bandwidth and parallelism. Azimi et al. [2] affirm that it is necessary to find a way to keep the off-die bandwidth manageable in system architectures with tradeoffs among cost, power, and performance. Moreover, in a hardware context, the system must offer flexibility with highbandwidth, low-latency, and power-efficiency. Interconnection fabric allows cores to access memory, communicate with each other and with the rest of the system. Manferdelli et al. [3] state that to guarantee the increase in performance of general purpose CPUs, one needs to use massive parallel computing. For this, more independent CPUs, bigger caches, and more independent memory controllers have been used, and it is possible to find many applications that use heterogeneous processors with several memory controllers to provide a large memory interface. One can find an example of such architecture on the Xbox360 [4]. Fig. 1(a) shows a system block diagram of the Xbox, a platform with several cores, with each core having a specific throughput and bandwidth. Another example of mixed communication behavior requirements is showed in the Fig. 1(b). In accordance with [3], there is a clear difference between traffic among cores in an SoC with out-of-order cores (OoCs) and in-order-cores (IoCs). OoCs are larger and have worse power performance than IoCs. Besides, there is more communication among IoCs than among OoCs, thus the former need to have different interconnection characteristics among them, in order to guarantee a higher communication bandwidth among IoC devices, since their communication with OoCs occurs on a much smaller scale. From the above MPSoC examples one can see that, as it happens in the microprocessor market, each NoC used in an MPSoC can have different communication performance and costs, depending on the target application. Designing the same NoC router to cover the whole spectra of applications would mean an oversized and expensive router, in terms of area and power. At the same time, designing specific routers for different markets would mean that many important decisions would have to be taken at design time, hence precluding scalability and online optimizations targeting different behaviors with different application demands. In an NoC, several items can vary from design to design, like depth of first-input first-output (FIFO) buffers, router topology, switch and arbiter [5]. In this manner, decisions such as throughput, latency and bandwidth are defined as a /$ IEEE

2 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS channel can lend/borrow buffer units to/from neighboring channels in order to obtain a determined bandwidth.

2 2 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS channel can lend/borrow buffer units to/from neighboring channels in order to obtain a determined bandwidth. When a channel does not need its entire available buffer, it can lend buffer word slots to neighboring channels. Results show the inefficiency in the amount of buffers used within a homogeneous router, and the gains that can be achieved using the proposed strategy. We focus on providing a reconfigurable router that can optimize power and improve energy usage while sustaining high performance, even when the application changes the communication pattern. Moreover, experiments compare favorably with other dynamic topologies like virtual channels. This paper is organized as follows. Section II presents an analysis of the problem and identifies low efficiency in homogenous routers. The reconfigurable router is proposed in Section III, where we describe the differences between the original and the new router architecture. In Section IV, we present results of latency, buffer utilization, frequency, area, and power consumption. In Section V, we show some related works and a specific comparison of our proposal with virtual channels, and finally the conclusions are shown in Section VI. Fig. 1. (a) Xbox 360 block diagram. (b) Example of different cores used in an SoC. modification of the NoC architecture, most of the times, being made at design time, trying to guarantee the performance of the system. However, whenever the product needs an update or has to change its functionality, most likely a huge change in the communication pattern will be observed, and hence decisions performed at design time would mean either a loss in performance, or an excessive power dissipation. Considering the NoC components, as crossbars, arbiters, buffers, and links, in the experiments realized by [6] the buffers were the largest leakage power consumers, dissipating approximately 64% of the whole power budget. In this way, the buffers were considered as candidates for leakage power optimization, since even at high loads, there were still 85% of idle buffers [6]. Regarding dynamic power, the buffers consumption is also high, and it increases rapidly as the packet flow throughput increases [7]. In this paper, we first characterize the need of a reconfigurable router and show how an NoC built with reconfigurable routers allows the use of buffers with smaller depths. These in turn have a similar performance w.r.t. an NoC using a fixed size router, the latter showing a large buffer depth and incurring higher power dissipation than the design presented here. Our particular contribution aims at providing the router with a certain amount of reconfiguration logic, allowing changes in the amount of buffer utilization in each input channel, in conformity with the communication needs. The principle is that each input II. QUANTITATIVE ANALYSIS OF THE PROBLEM We simulated four examples of real applications to analyze the homogeneous router behavior. The applications used were the MPEG4, VOPD [17], multi-window display (MWD) [18], and Xbox [4], all with 12 cores, but with different communication patterns, as represented in the bandwidth of each link depicted in Fig. 2. In this figure, arcs in MPEG4, VOPD, and MWD applications show the rates in MB/s, while arcs in the XBOX application show the rates in GB/s. A cycle-accurate traffic simulator in Java was utilized to evaluate the network hotspots and the average latency using the reconfigurable and original routers. The distribution of the cores in the NoC was specified in accordance with the communication needs of the cores, reflecting a design time choice, being based on the original application. In this work, we measured the efficiency in terms of buffer usage of NoC router in accordance with the need of the application. Equation (1) represents the efficiency results according to the buffer usage. It indicates how many of the available buffers slots are effectively used in each channel of the NoC. The reference value has been obtained by using Algorithm RBLA, presented later in this paper, which can give the best buffer distribution for a certain performance level. From the ideal buffer depth achieved for each channel (which represents the buffer depth used in the reconfigurable router using 100% of its buffers), we verified how many buffers were not used when one has a homogeneous router, designed to reach the same performance of the reconfigurable router The four applications MPEG4, VOPD, MWD, and Xbox were mapped to a 4 3 NoC with homogeneous buffer sizes. The network used is a mesh-2-d with -routing algorithm, handshake flow control and wormhole-switching mechanism [19]. Each input channel has a FIFO for storage of the flits. The FIFO size is defined at design time, and all channels have the (1)

3 MATOS et al.: RECONFIGURABLE ROUTERS FOR LOW POWER AND HIGH PERFORMANCE 3 Fig. 2. (a) MPEG4, (b) VOPD, (c) MWD, and (d) Xbox task graphs. same FIFO size, for this reasons, we called this as homogeneous router. Using (1), we obtained the efficiency results shown in Fig. 3. Fig. 3(a) presents the efficiency considering all channels of the network with the same buffer depth. Fig. 3(b) shows the efficiency using a heterogeneous NoC, where each router might have a different number of buffers compared to others router, but all channels inside the same router have the same number of buffers. In both cases, one can observe that the routers use excessive buffers in some channels compared to others, since not all channels present the same communication rate. In such cases, the extra buffer units of the channel will consume unnecessary area and power. One can see in Fig. 3(a) that, using a homogeneous router with the buffers sized to the best performance case, around 33% of the buffers slots are utilized (low efficiency). Similarly, in Fig. 3(b) only 54% of buffer slots have been used in the mean. However, they are nevertheless consuming power, but are not contributing to the reduction of the latency or the number of hotspots in the network. Usually, at design time the buffers are sized to guarantee that all channels in a router will have low latency, or to guarantee less power consumption and area. This means that each FIFO buffer might have a different number of flip-flops to provide performance and/or optimal power consumption and/or area for that specific link. However, by defining an optimal point at de- Fig. 3. Efficiency results (a) using the same buffer depth for all channels of the NoC and (b) using the same buffer depth for all channels of the same router, but with a different buffer depth specified to each router.

4 4 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS Fig. 4. Input FIFO (a) original and (b) proposed router. sign time, if the application is changed, probably the latency and the power consumption will increase, since in some links there might not be enough buffer words to ensure quality of services (low latency, high throughput, a required bit rate, delay, jitter). In Section III, we will present the proposed reconfigurable router, where the NoC efficiency can be increased as a function of the possibility to reconfigure the buffer size according to the requirements of each channel of the router at run time, without the need to oversize buffers to guarantee performance. III. PROPOSED ROUTER ARCHITECTURE A. Original Router Architecture The proposed router architecture was embedded in the SoCIN NoC [19]. SoCIN has a regular 2-D-mesh topology and parametric router architecture. The router architecture used is RaSoC, which is a routing switch with up to five bi-directional ports (Local, North, South, West, and East), each port with two unidirectional channels and each router connected to four neighboring routers (North, South, West, and East). This router is a VHDL soft-core, parameterized in three dimensions: communication channels width, input buffers depth, and routing information width. The architecture uses the wormhole switching approach and a deterministic source-based routing algorithm. The routing algorithm used is -routing, capable of supporting deadlock-free data transmission, and the flow control is based on the handshake protocol. The wormhole strategy breaks a packet into multiple flow control units called flits, and they are sized as an integral multiple of the channel width. The first flit is a header with destination address followed by a set of payload flits and a tail flit. To indicate this information (header, payload, and tail flits) two bits of each flit are used. There is a round-robin arbiter at each output channel. The buffering is present only at the input channel. Each flit is stored in a FIFO buffer unit. The input channel is instantiated to all channels of the NoC, and thus all channels have the same buffer depth defined at design time. B. Reconfigurable Router Architecture If an NoC s router has a larger FIFO buffer, the throughput will be larger and the latency in the network smaller, since it will have fewer flits stagnant on the network [20]. Nevertheless, there is a limit on the increase of the FIFO depth. Since each communication will have its peculiarities, sizing the FIFO for the worst case communication scenario will compromise not only the routing area, but power as well [6]. However, if the router has a small FIFO depth, the latency will be larger, and quality of service (QoS) can be compromised. The proposed solution is to have a heterogeneous router, in which each channel can have a different buffer size. In this situation, if a channel has a communication rate smaller than its neighbor, it may lend some of its buffer slots that are not being used. In a different communication pattern, the roles may be reversed or changed at run time, without a redesign step. The proposed architecture is able to sustain performance due to the fact that, statistically, not all buffers are used all the time. In our architecture it is possible to dynamically reconfigure different buffer depths for each channel. A channel can lend part or the whole of its buffer slots in accordance with the requirements of the neighboring buffers. To reduce connection costs, each channel may only use the available buffer slots of its right and left neighbor channels. This way, each channel may have up to three times more buffer slots than its original buffer with the size defined at design time. Fig. 4 shows the original and proposed input FIFO. Comparing the two architectures, the new proposal uses more multiplexers to allow the reconfiguration process. Fig. 4(b) presents the South Channel as an example. In this architecture it is possible to dynamically configure different buffer depths for the channels. In accordance with this figure, each channel has five multiplexers, and two of these multiplexers are responsible to control the input and output of data. These multiplexers present a fixed size, being independent of the buffer size. Other three multiplexers are necessary to control the read and write process of the FIFO. The size of the multiplexers that control the buffer slots increases according to the depth of the buffer. These multiplexers are controlled by the FSM of the FIFO. In order to reduce routing and extra multiplexers, we adopted the strategy of changing the control part of each channel.

5 MATOS et al.: RECONFIGURABLE ROUTERS FOR LOW POWER AND HIGH PERFORMANCE 5 Fig. 5. Proposed router architecture. Some rules were defined in order to enable the use of buffers from one channel by other adjacent channels. When a channel fills all its FIFO it can borrow more buffer words from its neighbors. First the channel asks for buffer words to the right neighbor, and if it still needs more buffers, it tries to borrow from the left neighbor FIFO. In this manner, some signals of each channel must be sent for the neighboring channels in order to control its stored flits. In result, each channel needs to know how many buffer words it uses of its own channel and of the neighboring channels, and also how much the neighbor channels occupy of its own buffer set. A control block informs this number. Then, based on this information, each channel controls the storage of its flits. These flits can be stored on its buffer slots or in the neighbor channel buffer slots. Each input port has a control to store the flits and this control is based in pointers. Each input channel needs six pointers to control the read and writing process: two pointers to control its own buffer slots, two pointers to control the left neighbor buffer slots, and two more pointers to control the right neighbor buffer slots (in each case, one pointer to the read operation and one pointer to write operation). In this design, we are not considering the possibility of the Local Channel using neighboring buffers, only the South, North, West, and East Channel of a router can make the use of their adjacent neighbors. As mentioned before, the loan granularity used in this proposal is a buffer slot. The area results of the reconfigurable router would not present a significant change if loan granularity was increased. This is due to the fact that the control overhead is defined mainly by the FIFO s control circuit. As the buffers are implemented using circular FIFOs, the FIFO pointers are incremented to each new slot, and this control will be the same whatever the used loan granularity. If we increase the loan granularity to more than one slot, then the loss in performance could be large, and the reduction in area or power would be minimal. In addition, we are considering sharing of the buffer slots only among adjacent channels. This decision is based on the costs of interconnections, multiplexers, and logic to control the combination of all loans among all input channels. Consequently, the area and power consumption would be much larger if we consider the last case, and the gains in performance would not be large enough to compensate this extra cost. Fig. 5 shows the channel of Fig. 4(b) organized to constitute the reconfigurable router. Each channel can receive three data inputs. Let us consider the South Channel as an example, having the following inputs: the own input (din S), the right neighbor input (din E), and the left neighbor input (din W). For illustration purposes, let us assume we are using a router with buffer depth equal to 4, and there is a router that needs to be configured as follows: South Channel with buffer depth equal to 9, East Channel with buffer depth equal to 2, West Channel with buffer depth equal to 1, and North Channel with buffer depth equal to 4. In such case, the South Channel needs to borrow buffer slots from its neighbors. As the East Channel occupies two of its four

slots, this channel can lend two slots to its neighbor, but even then, the South Channel still needs more three buffer slots.

6 6 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS Fig. 6. (a) Router designed with FIFO depth 4; (b) One example of need of configuration of the router; (c) Reconfiguration of the buffers to attend the need. slots, this channel can lend two slots to its neighbor, but even then, the South Channel still needs more three buffer slots. As the West Channel occupies only one slot, the three missing slots can be lent to the South Channel. When the South Channel has a flit stored in the East Channel, and this flit must be sent to the output, it is passed from the East Channel to the South Channel (d E S), and so the flit is directly sent to the output of the South Channel (dout S) by a multiplexer. The South Channel has the following outputs: the own output (dout S) and two more outputs (d S E and d S W) to send the flits stored in its channel but belonging to neighbor channels. The choice to resend the flits stored in a neighbor channels to its own channel before sending them to the output was preferred in order to avoid changes in others mechanisms of the architecture. In this manner we did not change the routing algorithm, avoiding the possibility of data deadlock, since the NoC continues using routing, which is intrinsically deadlock free. With this definition, the complexity of the implementation to obtain the correct function of the router was reduced in this aspect. Each flit stored in a neighbor channel returns to the respective channel when it needs to be sent to an output channel. In this case, when an input channel is connected to an output channel, the flits are sent one-by-one, and the pointers are updated as each flit is sent. As each channel knows how many buffer slots it has allocated, when the pointers present an address belonging to a neighbor buffer slot, the control of the first multiplexer of Fig. 4(b) allows the sending of the respective flits to the output of its channel. As we do not change the routing policy, there is no possibility of entering a deadlock situation. Of course, one could be concerned about one channel asking buffers from another channel which is also asking for buffers. Since only the neighbors are asked about lending/borrowing, no cycle can be made, and hence at the circuit level there is also no possibility of deadlock. Fig. 6 shows an example of the reconfiguration in a router according to a needed bandwidth in each channel. First, a buffer depth for all channels is decided at design time, in this case, we defined the buffer size equal to 4, as illustrated in Fig. 6(a). After this, the traffic in each channel is verified and a control defines the buffer depth needed in each link to attend to this flow, as shown in Fig. 6(b). The distribution of the buffer words among the neighbor channels is realized as shown in Fig. 6(c). Meanwhile, the buffer physical disposition in each channel correspondent the Fig. 7. South input channel with the buffers partitioned for three channels: South, West, and East. FIFO depth initially defined, as shown in Fig. 6(a), but the allocation of buffer slots among the channels can be changed at run time, as exemplified in Fig. 6(c). Now, let us consider an input buffer with ten slots, as shown in Fig. 7. In this example we are considering again the South channel. Let us assume that the South channel divides part of its buffer with the neighboring channels (West and East channels). In this case, South channel uses only two buffer slots, five buffer slots are used by West channel, and three buffer slots are used by East channel. The number of buffers of a channel is partitioned according to the need for loans among the channels. In this way, each buffer slot is allocated in a mutually independent way. Pointers to each buffer partition are used in order to control the flit storage process (read and write). Each slice of partitioned buffer in a channel has two pointers, one to control the read and another to control the write in the buffer (for example, addr rd E and addr wr E in Fig. 7). Besides the pointers, there are other control signals that are needed, as the signal that indicates when the partitioned buffers are empty and full. With these signals, each channel allows

7 MATOS et al.: RECONFIGURABLE ROUTERS FOR LOW POWER AND HIGH PERFORMANCE 7 Fig. 8. Flits need to wait the buffer availability to be sent to the next router for the VOPD application for buffer depth equal to 4 for (a) the original architecture and for (b) the reconfigurable router. neighbor channels to allocate buffer slots of this port and to guarantee that the flits are not mixed among the channels. The information about how many slots of buffers are used for each channel can been used to dynamically adjust their usage, consequently improving the efficiency. With this, one can monitor the NoC traffic flow and analyze how the resources are being used. This information can be used to increase the efficiency of the NoC design. Our proposal consists of reconfiguring the channel according to the availability of buffers in the channels. If a new channel depth is required, the buffer depth is updated slot by slot, and this change is made whenever a buffer slot is free. For the set of benchmarks used in this work, and as reported in many related works, whenever the application is changed, a different bandwidth is required among the channels. The reconfigurable router can change its depth in only few cycles, which means a very small performance overhead. Moreover, as each core sends packets at a different rate, the reconfiguration of the router was implemented considering that in some possible interval among packets there would be a time-slack. As the traffic is composed of packets, the buffers are not used 100% of the time in all parts of the network. IV. RESULTS A. Simulation Platform A cycle-accurate NoC traffic simulator in Java was used to evaluate the simulation results. All simulations were performed in a 2-D-mesh with -routing algorithm and wormhole switching. The buffer depth defined to be used in the reconfigurable router was equal to four for all channels. For these experiments, fixed-length packets of 80 flits were assumed, and the link size was defined to be of 16 bits. The benchmarks used in this analysis were the same shown in Fig. 2: MPEG4, VOPD, MWD, and Xbox. The injection rate of the packets in the NoC was determined from the bandwidth to each application, presented in Fig. 2, and determining the frequency equal to 200 MHz, we obtain the interval in number of cycles that each packet is sent to the link. The applications were mapped in the network by taking into account the relative bandwidth of each core to place them in order to minimize global power. From this mapping, the bandwidth to each link considers the need of all cores that utilize it. B. Performance Evaluation In order to define the buffer size of each channel in accordance with its need, we used a simple decision mechanism described by Algorithm Router Buffer Loan Algorithm (RBLA). This algorithm distributes buffer slots among the channels of each router. It analyzes each channel individually, and performs the buffer words distribution according to the number of hotspots. First, the algorithm verifies the hotspots of each router. We considered as hotspots those channels that receive a large number of flits, and in such case, require many buffer words because of contingency reasons. This way, whenever hotspots are detected, the algorithm verifies the possibility to borrow buffer slots from the neighbor channels. When there is only one hotspot in the router, the buffer loaning process occurs between the right and left neighbors. Otherwise, if there are two hotspots in the router, the neighbor buffer words are divided between the hotspots. When a neighbor channel is used and it is not a hotspot, the channel leaves only one buffer word to the neighbor channel and it uses the remaining buffer words. When there are three hotspots in a router, no buffer slots can be lent. The constant buffer increase considered in the pseudocode refers to the buffer depth in the channel. Algorithm RBLA was used only to present a study of buffer distribution, but actually, this control can be realized in each router. An external control signal is provided in each router to define the buffer depth of each channel. The routers are autonomous among themselves, the dependency just exists among the channels belonging to the same router. Using a traffic monitor it is possible to define at runtime the appropriated buffer

8 8 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS depth to each channel, just indicating this size to the input ports responsible for this control. Algorithm RBLA provides an alternative to calculate the ideal buffer depth according to the traffic in each channel for the benchmarks adopted in this work. We use the results obtained with this algorithm to confirm the gains when the appropriated buffer depth is used in an NoC. Using the algorithm for buffer slots lending, we obtained the results of Fig. 8, which presents the behavior of the NoC for the VOPD application using a 4 3 NoC with buffer depth equal to 4. The -axis presents the input channels of all 12 routers of the NoC, and the -axis represents the number of flits that need to wait the availability of the buffer to be sent to the next router. Equation (2) shows how the buffer overhead is calculated, where flits to send refers to all flits that should be sent to a channel, in accordance with the bandwidth required if the router had an infinitive buffer depth and buffer depth means the real buffer size of the channel used to store the flits Fig. 8(a) illustrates the results obtained with homogeneous routers and Fig. 8(b) presents the reduction of hotspots obtained with the reconfigurable router. One can observe that with the reconfigurable router the number of hotspots was drastically reduced. Another analysis was done for the four applications (MPEG4, MWD, VOPD, and XBOX), where, as one can see, each one presents different traffic conditions. Using the reconfigurable router, we defined the buffer size equal to 4 for each channel considering the possibility of loans among the adjacent channels. From this configuration, we verified the performance obtained in the NoC with the reconfigurable router, and sized the (2) buffers of the homogeneous router in order to have the same average latency of the reconfigurable ones. In this case, to reach the same average latency obtained with the reconfigurable router, the homogeneous router needed much deeper FIFOs, as it can be seen in Fig. 9. In this figure each application has two columns, the first column shows the benchmarks that were connected by an NoC with reconfigurable router and buffer depth equal to 4. The second column shows the buffer depth needed to reach the same average latency of the reconfigurable router when a homogeneous router is used. Observing the results presented in Fig. 9, we verify that the buffer depth greatly influences the average latency. As these applications present different bandwidth in the links and different number of connections among the cores, we can confirm that in order to have the same average latency obtained with the reconfigurable router, larger and for many cases, useless buffer depths were required by the homogeneous router. We observed that the MWD application presented the smaller latency results when using the reconfigurable router with four positions, and to obtain the same latency value, the homogeneous router required a buffer with 11 positions. This can be explained due to the traffic behavior of the MWD application, since there are few connections among the cores. In this case, as there are many buffer units which are not used, they could be lent to the channels in use. This experiment proves that it is possible to use a single NoC with the reconfigurable router to obtain low latency results to any application. With the reconfigurable router we also verified that there was a better distribution of flits in the network. Fig. 10 shows a histogram with the number of flits that need to wait the availability of the buffer to be sent to the next router (buffer overhead) versus the number of channels that generate the network congestion for the homogeneous router. The high values of buffer overhead represent the hotspots, i.e., channels that need a larger buffer depth. In the -axis of Fig. 10 one can see how many channels have approximately the same number of buffer overhead (flits that cannot be stored in the first time). Fig. 10(a) shows results for the homogenous router and Fig. 10(b) for the reconfigurable one. One can observe that for the homogeneous router there are ten channels [first two columns in Fig. 10(a)] that potentially decrease the performance of the NOC. However, for the proposed router one has only two channels that decrease the performance, and with a much smaller buffer overhead value, since the buffer lending/borrowing process better distributes the storage of flits in the NoC. This experiment shows that for the same buffer size, the reconfigurable router presents a better utilization of the buffers, and hence less power is wasted. C. Area, Power, and Frequency Results The proposed router was described in VHDL, and we used the ModelSim tool to simulate the code. We analyzed the average power consumption, area, and frequency results to a CMOS m standard cell library using the Synopsys Design Compiler tool. The design operates at a supply voltage of 1.8 V, and the power results were obtained with a 200 MHz clock frequency in both architectures. Table I presents the results obtained for this configuration. The channel width contains bits, for data bits and 2 bits for control.

9 MATOS et al.: RECONFIGURABLE ROUTERS FOR LOW POWER AND HIGH PERFORMANCE 9 Fig. 9. Four applications with buffer depth of the original router sized in order to obtain the same latency of a reconfigurable router with buffer depth equal to 4. Fig. 10. Occurrences in number of channels where the flits need to wait the buffer availability to be sent to the next router (a) in an NoC with homogeneous routers and (b) in an NoC with reconfigurable routers. The application used in this experiment was the MPEG4. As shown in Fig. 9, for the same average latency, using a reconfigurable router with buffer depth 4, different buffer sizes were needed while using the homogeneous router to the four applications mentioned. For this reason, we demonstrate the gains obtained in the synthesis results when one considers the same average latency results for a homogeneous and reconfigurable router (meanwhile, for each benchmark the homogeneous router needs to use different buffer sizes when compared with the reconfigurable router). Area results take into account the whole router, which is composed of: five input channels, five output channels (Local, South, West, and East channels), the crossbar, the routing algorithm, the arbiters, and the switching algorithm. The proposed router presents modifications only in the implementation of the input channels, but the synthesis results were obtained including the whole router with all changes realized or introduced in our proposal, like control signals, pointers, registers, multiplexers, and others changes needed in the original architecture. In a router, the largest power dissipation comes from the flipflops of the buffers. By using the proposed reconfiguration, and using extra multiplexers, it has been possible to reduce the total number of required flip-flops. One can still obtain power reduction because the multiplexers present less power consumption than flip-flops. A flip-flop dissipates power even when no data changes at its input, since the clock is always switching. Notice that thanks to the high activity present in the routers, clock gating is a less affective technique. Hence, the larger the buffers, the larger is the power dissipation in the flip-flops. Using the homogeneous router, in order to sustain the same performance that the proposed buffer scheme can achieve, a buffer of fixed size would have to be much larger, with many more flip-flops. Thus, the proposed router reduces power considering the same performance, once it uses much smaller buffers, hence, less flip-flops and less power dissipation. In Table I, the reason for the power increase being a nonlinear function of the buffer depth is due to the utilization of multiplexers, as present in Fig. 4(a). These multiplexers define which flit of the FIFO must be sent to channel output. Hence, the gate increase of the FIFO depth 7 to 9 is lower than from the FIFO depth 9 to 11. With the applications simulated in this work, we confirmed that the original homogeneous NoC presents a large under utilization of the router, since not all of its channels are used. In such cases, the extra buffer words on channels not used in the original router would be unnecessarily consuming power. When the channel width is defined equal to 18 bits, the proposed router does not present penalties in the maximum frequency compared with the homogenous router designed with the same latency. Besides, for the same performance results, the reconfigurable router presents a great reduction in power dissipation, reaching up to 52% of power reduction. The larger the

10 10 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS TABLE I POWER CONSUMPTION, FREQUENCY, AND AREA RESULTS TO THE HOMOGENEOUS AND RECONFIGURABLE ROUTER ARCHITECTURE FOR THE SAME LATENCY VALUES link size, the larger the power savings allowed by the reconfigurable router, since in this case the impact of the extra circuits required to allow reconfiguration are amortized. With respect to area results, the reconfigurable router also presents gains in some cases analyzed in Table I when compared with the original router. Considering the four applications utilized in this work, the reconfigurable router reduces the power consumption by 40% on average, and to the same performance results, it uses 64% smaller buffer depths. Besides, another advantage in the use of the NoC with the reconfigurable router instead of the homogeneous router is that one can dynamically change the buffer depth to each channel, in accordance with the necessity of the application. Thus, we can conclude that the obtained results emphasize the fact that the proposed NoC router does not degrade the system performance, and can save power. In Table I the synthesis results for the homogeneous router with buffer depth equal to 4 have been added, in order to compare the overhead required to provide the adaptability. The area results for the reconfigurable router were almost duplicated when compared with the homogeneous router for the same configuration, and the maximum frequency also presented a small decrease. However, considering the same buffer depth for both architectures, the reconfigurable router can reduce the latency up to 50% for the applications analyzed in this work. With the proposed router it is possible to have one single NoC connecting different applications that might change their communicating patterns at run time. In the same way, this architecture allows application updates without compromising the performance of the system. Meanwhile, if a homogeneous router had been used in these situations, design modifications at design time would have had to be made to achieve the optimum case. In such case, one would need to redesign the homogeneous NoC to set buffer sizes and position of the cores in the network. The technique here proposed avoids costly redesigns and new manufacturing. V. RELATED WORKS In MPSoC, it is usual to find different interconnection needs amongst processors, memories, peripherals and others elements. Due to this fact, it has been perceived the need for distinct bandwidth in each node of a NoC. In the literature, some works present solutions that point in this direction with approaches to improve NoC performance. In this section we did a study where several of these ideas demonstrate that in the same application there is the need to use routers with different features and communication needs. Eun Lee and Bagherzadeh [8] proposed the use of different clocks while sending flits in the NoC. Body flits operate faster than head flits. In accordance with them, as FIFOs work faster than the route decision, it is possible to use different clocks to feed body flits or head flits. While the head flit is analyzed to define the route path, body flits can continue advancing along the reserved path already established, improving the performance of the router. Ahmad et al. [9] proposed a network router designed with a bus like interface. A built-in wrapper is used, and thus any component compatible with a bus can be integrated into the NoC architecture. The interface of this router becomes a simple bus. The objective of this proposal was to reduce the design time and to ease integration, since it is not necessary to know the NoC architecture. When the network has a channel that requires high bandwidth, the NoC changes the switching, obtaining a dedicated path between the IPs. Ahonen and Nurmi [10] proposed a hierarchical NoC, able to cope with inefficiencies obtained with a regular NoC. They used two types of on-chip network: the Global Network (NoC) and the Local Network (bus-based). The Local Network is used to connect slaves to a master that together are called local clusters. The NoC is used to connect all local clusters, all of them having the same capabilities. Vestias and Neto [11] considered mechanisms to reduce the area of the NoC without impairing the throughput and/or latency. In such case they proposed a NoC architecture that, for a determined topology and traffic requirements, could automatically obtain the architecture of each router. In each router is configured the depth of the buffers and the bandwidth of the switch matrix. The router architecture can be customized to specific performance requirements in accordance with the application and, with this, it is possible to reduce the area occupied by the NoC. A generic router is proposed in which area is traded-off for bandwidth capacity. Different protocols and mechanisms can be configured to the generic router and some techniques are used to reduce the area of FIFOs and switches. Bouhraoua and Elrabaa [12] showed a modified Fat Tree Noc based on the fact that bandwidth is not constant in an application. According to these authors, there is no reason to give more

11 MATOS et al.: RECONFIGURABLE ROUTERS FOR LOW POWER AND HIGH PERFORMANCE 11 bandwidth than it will be used to meet the bandwidth requirements of other clients. The Fat Tree Topology (FT) is a class of NoC based on a sub-class of Multi-Stage Interconnections Network Topology (MIN). The routing in an FT is like a routing in binary trees. The modification in a FT is based on the analyses of the required bandwidth. All router ports have the same bandwidth, this being the minimum bandwidth required. If some core needs more bandwidth, then this core has more routers ports attached to it. The common problem of the proposals reported above is that all of them must be used at design time, generally with a static approach, hence causing problems of scalability, whenever the NoC is used for a new application in the same platform. Hence, product updates or even customization of an MPSOC for a different market, using the same components, but with a different communication pattern, is not possible without a costly redesign. With regard to power consumption, Kangmin Lee et al. [13] designed a low-power NoC and used various power-efficient techniques in different design levels such as channel, protocol, topology, circuits, and others alternatives to obtain the NoC architecture. However, this proposal does not present any adaptability to support different types of applications. Al Faruque [14] presented a proposal using configurable links that can change their direction at run time. The idea is to use two half-duplex links allowing three configuration possibilities: both links in one direction, both links in reverse direction, and links in opposite directions. With this, the links may present up to two times larger capacity than the simple links, increasing the throughput of the system. Another very similar work is the BiNoC [15], which replaces the unidirectional channels by bidirectional ones, while it can dynamically self-reconfigure to improve the flits traffic. BiNoC uses a traffic control algorithm to support dynamic self-configuration to specify the direction of each channel. This solution improves bandwidth utilization and lowers the latency at each router. However, to only increase the channel width is insufficient, since larger links require a larger buffer. In such cases, as BiNoC has input buffering and allows one to use the output links as input links, it needs buffering in both links, and then the power consumption problem is severely aggravated. Besides, it uses an inter-router transmission channel control block scheme implemented in a FSM for each channel, and it makes sure that only one direction of the channels is valid on each bidirectional channel at any time. If the requested channel is available, this means that the corresponding buffers at the neighboring router have enough storage space. The allocation of the channel is based only in the request. Hence, the other problem of this approach is that it does not consider the target application to distribute channels for each router, nor it has a priority scheme to distribute these channels. This can decrease the performance of the network, since a channel with lower traffic (and hence lower theoretical priority) can ask and obtain the channel before a router with a higher traffic gets it. In [21], an adaptive architecture with runtime observability is proposed, providing adaptability in the system-level and at architecture-level. At system-level it can remap the tasks of the system, and at the architecture-level it can reroute the packets and reallocate the virtual channels buffers (VCB). The changes at architecture-level are based on the occurrence of fault, and these events occur when the packets do not reach the destination or when the VCB is full. The faults trigger the need of NoC adaptation at the architecture-level, and they are used to invoke the necessary steps to reconfigure the NoC for the new infrastructure. The adaptive process only occurs in the presence of a fault, and hence no performance or power advantage can be obtained during the normal operation of the system. It would be interesting to have an architecture that adapts to not only the presence of faults, but also to the communication patterns within the network. In [22] one can find a buffer allocation strategy at the system level for each specific application, that is, given a traffic characteristic of a target application and the total budget of the available buffering space, an algorithm optimizes the allocation of buffering resources across different router channels, while matching the communication characteristics of the target application. The algorithm making buffers distribution is based on the architecture parameters (routing algorithm, delay parameters and others) and the application parameters (probability of the packet being delivered to the destination and the packet rate injection). These characteristic are modeled in C++, and the algorithm gives a certain number of buffers for each channel. However, buffer sizing is developed at design-time for each target application, and then, if the communication behavior changes, probably the system will not deliver the required performance or the resources will be under utilized. In the next subsection we will do a more effectively comparison of our proposal with virtual channels (VCs). We will present others related works where we will comment each proposal and we will compare the advantages of our work regarding these others techniques. A. Comparison With Virtual Channels A work with goals similar to the one shown in this paper, but using virtual channels, was shown in [16]. They proposed a unified buffer structure with a dynamic virtual channel regulator, called ViChaR, which dynamically allocates virtual channels and buffers according to network traffic conditions. In this case, instead of individual and statically partitioned buffers, they utilized a unified buffer unit. We have considered the comparison with ViChar because both architectures have the same goal: to use adaptability to obtain a reduction of the power dissipation or to increase the performance, given the buffer resources distribution available in an NoC. In order to obtain a direct and effective comparison with the ViChaR architecture, we have verified the synthesis results for the same technology and the same synthesis tool. For these analyses we have used a 90-nm standard cell library and synthesized the architecture in Synopsys Design Compiler tool. The comparison results are shown in Table II. The ViChaR architecture presents the results using a virtual channel with buffer size equal to 16 and these slots can be used to store data of four channels. In Table II we have compared the results of one input channel with 16 slots. In accordance with these results, our proposal shows a 78% of power reduction when we consider one input channel in comparison with ViChaR, running at the same

12 12 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS TABLE II COMPARISON OF POWER CONSUMPTION AND AREA RESULTS BETWEEN VICHAR AND THE PROPOSED CHANNEL frequency. The channel width was defined equal to 128 bits, like specified in the ViChaR reference. Although our synthesis results were obtained to 500 MHz for 90-nm technology, our proposed input channel reaches up to 730 MHz as its maximum frequency. In the comparison with the ViChaR architecture, we used the same NoC configuration to obtain the results. However, ViChaR and our proposal are completely different in terms of implementation. Analyzing the ViChaR architecture, one can observe that it has much more memory blocks to allow the correct functioning of the virtual channels. Besides these buffers, ViChaR uses a table control logic and extra registers that consume much power. On the other hand, in our proposal, to allow the adaptability, we used more multiplexers, and these components dissipate a smaller power than memory circuits. In the simulation, in order to obtain the power results, we considered the sending of different flits over time. However, we verified that by considering different situations in this simulation, i.e., considering a variation in the bit activities in the flits stored in the buffer, the power consumption had a minor variation. This result points to the fact that the memory and registers are the main sources of power dissipation. Our input channel has a larger area overhead when compared with the ViChaR, however the power gains more than compensate for this, in our view. We have also realized some experiments to compare the performance results, and for this analysis, we have reproduced the same traffic behavior and configurations defined to evaluate the ViChaR architecture. In that work, the proposed configuration is one packet containing 4 flits, clock cycle equal to 500 MHz, channel width of 128 bits and buffer depth equal to 16 positions per port. Using this same set up, for all studied examples the network presented no congestion. In this case, the different architectural approaches lead to very different results, since with a larger channel much less congestion is expected anyway. ViChaR architecture uses a Token (VC) Dispenser to distribute the flits among the VCs. However, it does not use any priority to the flits occupation in the buffers. The buffers are granted according to a VC request. Hence, if a larger packet requests many buffer slots, the performance results probably, in many situations, would be worse compared with our proposal. Our proposed router disposes more buffer slots for the channel that presents high traffic rate, in this case, even using larger packets, the distribution of our proposal will occur in accordance with the buffer availability defined for each channel. This does not happens in the ViChaR proposal, since when it uses larger packets it is possible that one packet requires all buffer slots available, even if this packet is sent to a smaller traffic rate. Hence, it compromises the performance if later packets sent in a higher traffic rate need those buffer slots. However, this behavior was not observed in the experiments of the ViChaR proposal, because only packets with four flits are reported to be used. Al Faruque et al. also proposed a work using Virtual-Channels [23]. This solution aims to optimize the number of VCs from two steps. The first step considers the mapping of tasks to minimize the virtual channels, and in the second step one uses an analytical approach to further optimize the number of virtual channels buffers (VCBs). However, this technique is used to configure the number of VCBs of a specific application to provide all transactions of an application at design-time, and hence the dynamic possibilities that are offered in our proposal are lost. Moreover, the proposal in [23] shows results of the reduction in the number of VCBs when compared with the QNoC architecture, but no performance and/or power results were given, thus, it was not possible to directly compare synthesis and performance results of the reported solution with our proposal. VI. CONCLUSION In this paper, the advantage of the use of an NoC with reconfigurable routers instead of homogeneous ones has been presented. Using reconfiguration, one can dynamically change the buffer depth to each channel, in accordance to the necessity of the application, increasing the power efficiency of the system for the same performance level. We verified that to reach the same performance obtained with the reconfigurable router, the original architecture needs many more buffers. The new router, while reaching the same performance than the original architecture, obtained a reduction of approximately 25% of power consumption in the worst case, and of 52% for the best case analyzed. Besides, when compared with the ViChaR architecture, our proposal obtains 78% of power reduction for the same configuration. Moreover, the reconfigurable router obtains the same performance of the homogeneous router with a buffer depth 64% smaller. Moreover, with the new architecture it is possible to reconfigure the router in accordance with the application, obtaining similar performances even when the application radically changes. REFERENCES [1] L. Benini and G. De Micheli, Network on chips: A new SoC paradigm, IEEE Computer, vol. 35, no. 1, pp , Jan [2] M. Azimi, N. Cherukuri, D. N. Jayasimha, A. Kumar, P. Kundu, S. Park, I. Schoinas, and A. S. Vaidya, Integration challenges and tradeoff for tera-scale architectures, Intel Technol. J., vol. 11, no. 3, Aug [3] L. Manferdelli, N. K. Govindaraju, and C. Crall, Challenges and opportunities in many-core computing, Proc. IEEE, vol. 96, no. 5, pp , May [4] J. Andrews and N. Baker, Xbox 360 system architecture, IEEE Micro, vol. 26, no. 2, pp , Mar./Apr

MATOS et al.: RECONFIGURABLE ROUTERS FOR LOW POWER AND HIGH PERFORMANCE 13 [5] M. Vestias and H. Neto, Area and performance optimization of a generic network-on-chip architecture, in Proc. Symp.

90 95. [7] T. Benini and G. De Micheli, Analysis of power consumption on switch fabrics in network routers, in Proc. 39th Des. Autom. Conf. (DAC), 2002, pp. 524 529. [8] S. E. Lee and N.

Arslan, Dynamically reconfigurable NOC with bus based interface for ease of integration and reduced designed time, in Proc. NASA/ESA Conf. Adapt. Hardw. Syst. (AHS), 2008, pp. 309 314. [10] T.

Neto, Router design for application specific network-on-chip on reconfigurable systems, Field Program. Logic Appl., vol. 1, pp. 389 394, 2007. [12] A. Bouhraoua and M. E.

Lee, and H. Yoo, Low-power network-on-chip for highperformance SoC design, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 14, no. 2, pp. 148 160, Feb. 2006. [14] M. A. Al Faruque, T.

13 MATOS et al.: RECONFIGURABLE ROUTERS FOR LOW POWER AND HIGH PERFORMANCE 13 [5] M. Vestias and H. Neto, Area and performance optimization of a generic network-on-chip architecture, in Proc. Symp. Integr. Circuits Syst. Des. (SBCCI), 2006, pp [6] C. Xuning and L. Peh, Leakage power modeling and optimization in interconnection networks, in Proc. Int. Symp. Low Power Electron. Des. (ISLPED), 2003, pp [7] T. Benini and G. De Micheli, Analysis of power consumption on switch fabrics in network routers, in Proc. 39th Des. Autom. Conf. (DAC), 2002, pp [8] S. E. Lee and N. Bagherzadeh, Increasing the throughput of an adaptive router in network-on-chip (NoC), in Proc. Int. Conf. Hardw./Softw. Codes. Syst. Synth., 2006, pp [9] B. Ahmad, A. Ahmadinia, and T. Arslan, Dynamically reconfigurable NOC with bus based interface for ease of integration and reduced designed time, in Proc. NASA/ESA Conf. Adapt. Hardw. Syst. (AHS), 2008, pp [10] T. Ahonen and J. Nurmi, Hierarchically heterogeneous network-onchip, in Proc. Int. Conf. Comput. as a Tool (EUROCON), 2007, pp [11] M. Vestias and H. Neto, Router design for application specific network-on-chip on reconfigurable systems, Field Program. Logic Appl., vol. 1, pp , [12] A. Bouhraoua and M. E. Elrabaa, Addressing heterogeneous bandwidth requirements in modified fat-tree network-on-chip, in Proc. 4th IEEE Int. Symp. Electronic Des., Test Appl. (DELTA), 2008, pp [13] K. Lee, S. Lee, and H. Yoo, Low-power network-on-chip for highperformance SoC design, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 14, no. 2, pp , Feb [14] M. A. Al Faruque, T. Ebi, and J. Henkel, Configurable links for runtime adaptive on-chip communication, in Proc. 12th IEEE/ACM Des. Autom. Test Euro. (DATE), 2009, pp [15] Y.-C. Lan, S.-H. Lo, Y.-C. Lin, Y.-H. Hu, and S.-J. Chen, BiNoC: A bidirectional NoC architecture with dynamic self-reconfigurable channel, in Proc. Int. Symp. Netw.-on-Chip, 2009, pp [16] C. Nicopoulos, K. Park, D., N. Vijaykrishnan, S. Yousif, and C. Das, ViChaR: A dynamic virtual channel regulator for network-on-chip routers, in Proc. 39th Annu. Int. Symp. Microarch. (MICRO), 2006, pp [17] D. Bertozzi, A. Jalabert, M. Srinivasan, R. Tamhankar, S. Stergiou, L. Benini, and G. De Micheli, NoC synthesis flow for customized domain specific multiprocessor systems-on-chip, IEEE Trans. Parallel Distrib. Syst., vol. 16, no. 2, pp , Feb [18] K. Srinivasan and K. S. Chatha, A low complexity heuristic for design of custom network-on-chip architectures, in Proc. Des., Autom. Test Euro. Conf., 2006, vol. 1, pp [19] C. Zeferino, M. Kreutz, and A. Susin, RASoC: A router soft-core for networks-on-chip, in Proc. Conf. Des., Autom. Test Euro. (DATE), 2004, pp [20] C. Wu and H. Chi, Design of a high-performance switch for circuitswitched on-chip networks, in Proc. Asian Solid-State Circuits Conf., 2005, pp [21] M. A. Al Faruque, T. Ebi, and J. Henkel, ROAdNoC: Runtime observability for an adaptive network on chip architecture, in Proc. IEEE/ACM Int. Conf. Comput.-Aided Des. (ICCAD), 2008, pp [22] H. Jingcao, U. Y. Ogras, and R. Marculescu, System-level buffer allocation for application-specific networks-on-chip router design, IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., vol. 25, no. 12, pp , Dec [23] M. Al Faruque and J. Henkel, Minimizing virtual channel buffer for routers in on-chip communication architectures, in Proc. Conf. Des., Autom. Test Euro. (DATE), 2008, pp Débora Matos (S 08) was born in Pelotas-RS, Brazil. She received the B.S. degree in digital system engineering from Universidade Estadual do Rio Grande do Sul (UERGS), Guaíba, Brazil, in 2007 and the M.Sc. degree in computer science from Universidade Federal do Rio Grande do Sul (UFRGS), Porto Alegre, Brazil, in 2010, where she is currently pursuing the Ph.D. degree in computer science. Her main research interests include networks-onchip (NoCs), reconfigurable systems, system on chip, and multiprocessor system on chip (MPSoC). Caroline Concatto (S 08) received the B.S. degree in digital system engineering from Universidade Estadual do Rio Grande do Sul (UERGS), Guaíba, Brazil, and the M.Sc. degree from Universidade Federal do Rio Grande do Sul (UFRGS), Porto Alegre, Brazil, where she is currently pursuing the Ph.D. degree in computer science. Her research interests include adaptive systems, networks-on-chip (NoCs), and fault tolerance techniques. Márcio Kreutz received the B.S. degree in computer science and the M.Sc. and Ph.D. degrees in computer science and microelectronics from the Federal University of Rio Grande do Sul (UFRGS), Porto Alegre, Brazil, in 1994, 1997, and 2005, respectively. His thesis was developed on the topic of networks-on-chip architectural optimization. He is currently an Adjunct Professor with Federal University of Rio Grande do Norte (UFRN), Natal, Brazil. His research interests include embedded architectures modeling and specification, embedded software mapping, and communication/processing architectures optimization. Fernanda Kastensmidt (M 05) received the B.S. degree in electrical engineering and the M.Sc. and Ph.D. degrees in computer science and microelectronics from the Federal University of Rio Grande do Sul (UFRGS), Porto Alegre, Brazil, in 1997, 1999, and 2003, respectively. She is a Professor with the Department of Computer Science, UFRGS. Her professional research experiences include internships with the Grenoble National Polytechnic Institute (INPG), France, in 1999, with Xilinx Corporation, San Jose, CA, in 2001, and with the Laboratory of Materials and Systems Integration (IMS), Bordeaux University, France, in Her research interests include VLSI testing and design, fault effects, fault tolerant techniques, and programmable architectures. She is the author of the book Fault-Tolerance Techniques for SRAM-based FPGAs (Springer, 2006). Luigi Carro (M 97) was born in Porto Alegre, Brazil, in He received the electrical engineering, M.Sc., and Ph.D. degree in computer science from Universidade Federal do Rio Grande do Sul (UFRGS), Porto Alegre, Brazil, in 1985, 1989, and 1996, respectively. From 1989 to 1991, he worked with the R&D Group, ST-Microelectronics, Agrate, Italy. He is currently a Professor with the Applied Informatics Department, Informatics Institute, UFRGS, where he is in charge of computer architecture and organization disciplines at the undergraduate levels. He is also a member of the Graduation Program in computer science at UFRGS, where he is coresponsible for courses on embedded systems, digital signal processing, and VLSI design. His primary research interests include embedded systems design, validation, automation and test, fault tolerance for future technologies, and rapid system prototyping. He has advised over 20 graduate students (Master s and Ph.D. levels). He has published over 150 technical papers on those topics and is the author of the book Digital Systems Design and Prototyping (Editora da Universidade, 2001, in portuguese) and coauthor of Fault-Tolerance Techniques for SRAM-Based FPGAs (Springer, 2006) and Dynamic Reconfigurable Architectures and Transparent Optimization Techniques (Springer, 2010). His most updated resume is located in For the latest news, please check Dr. Carro was a recipient of a prize FAPERGS Researcher of the Year in Computer Science in 2007.

14 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS Altamiro Susin (M 03) was born in Vacaria-RS, Brazil. He received the Electrical Engineering and the M.Sc.

14 14 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS Altamiro Susin (M 03) was born in Vacaria-RS, Brazil. He received the Electrical Engineering and the M.Sc. degrees from Universidade Federal do Rio Grande do Sul (UFRGS), Porto Alegre, Brazil, in 1972 and 1977, respectively, and the Dr.Eng. degree from Institut National Polytechnique de Grenoble-France, Grenoble, France, in Since 1968, he has worked with digital computers when he was with a group that started the computer centers of two local Universities. He is Full Professor with the Electrical Engineering Department, UFRGS, where he is in charge of analog and digital electronics disciplines at the graduate and undergraduate levels. He is also a member of the Graduation Programs in Computer Science, Electrical Engineering, and Microelectronics of UFRGS. His main research interests include integrated circuit architecture, system-onchip design, and signal processing with over 200 technical papers published in those domains. He is/was responsible for several R&D projects either funded with public and/or industry resources, presently coordinating a research network for Digital TV.

Design and Implementation of Buffer Loan Algorithm for BiNoC Router

Design and Implementation of Buffer Loan Algorithm for BiNoC Router Deepa S Dev Student, Department of Electronics and Communication, Sree Buddha College of Engineering, University of Kerala, Kerala, India