AS THE NUMBER of cores integrated into a systemon-chip

Size: px

Start display at page:

Download "AS THE NUMBER of cores integrated into a systemon-chip"

Dorothy Francis
6 years ago
Views:

1 774 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 30, NO. 5, MAY 2011 Data Encoding Schemes in Networks on Chip Maurizio Palesi, Member, IEEE, Giuseppe Ascia, Fabrizio Fazzino, Member, IEEE, and Vincenzo Catania Abstract An ever more significant fraction of the overall power dissipation of a network-on-chip (NoC) based systemon-chip (SoC) is due to the interconnection system. In fact, as technology shrinks, the power contribute of NoC links starts to compete with that of NoC routers. In this paper, we propose the use of data encoding techniques as a viable way to reduce both power dissipation and energy consumption of NoC links. The proposed encoding scheme exploits the wormhole switching techniques and works on an end-to-end basis. That is, flits are encoded by the network interface (NI) before they are injected in the network and are decoded by the destination NI. This makes the scheme transparent to the underlying network since the encoder and decoder logic is integrated in the NI and no modification of the routers architecture is required. We assess the proposed encoding scheme on a set of representative data streams (both synthetic and extracted from real applications) showing that it is possible to reduce the power contribution of both the self-switching activity and the coupling switching activity in inter-routers links. As results, we obtain a reduction in total power dissipation and energy consumption up to 37% and 18%, respectively, without any significant degradation in terms of both performance and silicon area. Index Terms Coupling capacitance, data encoding, low power, network on chip (NoC), power analysis. I. Introduction AS THE NUMBER of cores integrated into a systemon-chip (SoC) increases, the role played by the interconnection system becomes more and more important. The International Technology Roadmap for Semiconductors [1] depicts the on-chip communication issues as the limiting factors for performance and power consumption in current and next generation SoCs [2]. Design in the era of ultradeep submicron silicon is mainly dominated by issues concerning the communication infrastructure. As the design complexity increases, the total length of the interconnection wire increases, resulting in long transmission delay and higher power consumption. In addition, the distance between wires shrinks with technology, increasing coupling capacitance, and the height of the wire material increases resulting in greater fringe capacitance [3]. While SoCs consisting of tens of cores were common in the last decade, common predictions foresee that the next Manuscript received April 16, 2010; revised September 14, 2010; accepted November 9, Date of current version April 20, This paper was recommended by Associate Editor D. Atienza. M. Palesi is with Kore University, Enna 94100, Italy ( maurizio.palesi@unikore.it). G. Ascia and V. Catania are with the Dipartimento di Ingegneria Informatica e delle Telecomunicazioni, Università di Catania, Catania 95125, Italy ( gascia@diit.unict.it; vcatania@diit.unict.it). F. Fazzino is with Icera, Inc., Bristol BS32 4AQ, U.K. ( fazzino@icerasemi.com). Color versions of one or more of the figures in this paper are available online at Digital Object Identifier /TCAD /$26.00 c 2011 IEEE generation of many-cores SoC will contain hundreds or thousands of cores [4]. In the many core era, as the number of cores residing on the same SoC increases significantly, the communication solutions also need to change drastically in order to support the new inter-core communication demands. It is nowadays widely recognized that network-on-chip (NoC) architectures [5] represent the most viable solution to cope with scalability issues of future many-cores systems and to meet performance, power, and reliability requirements which characterize future ambient intelligent applications. The importance of interconnects in complex many-core chips has outrun the importance of transistors as a dominant factor of performance, power, cost, and reliability [6], [7]. Sophisticated on-chip communication protocols, involving advanced adaptive routing algorithms, selection policies, data protection schemes, and mechanisms aimed at guaranteeing the quality-of-service are pushing the interconnect system to become one of the main elements which characterizes the system in terms of both power dissipation and energy consumption. In fact, the advantages over bus-based architectures come at the cost of increase in complexity which pushes the communication system to become one of the main elements of a SoC which strongly impact the cost, power, and performance figures of the overall system. For instance, in the Intel s 80- tiles TeraFLOPS processor [8] over 30% of the chip area is dedicated to the communication system and the communication power accounts for about 28% of the total. In the Massachusetts Institute of Technology RAW chip [9] the NoC is responsible for 40% of the system power. In the Æthereal NoC the largest percentage of power dissipation (54%) is due to the NoC clock, followed by the NoC links (18%) [10]. In [11], it has been shown that on-chip interconnects account for a significant fraction (up to 50%) of the total on-chip energy consumption. The basic elements which form a NoC-based interconnect are network interfaces (NIs), routers, and links. As technology shrinks, the power dissipated by the links is as relevant as (or more relevant than) that dissipated by routers and NIs [12] [15]. In this paper we focus on power dissipated by network links. Links dissipate power due to the switching activity (both self and coupling) induced by subsequent data patterns traversing the link [16]. We focus on data encoding schemes as a viable way to reduce power dissipated by the network links. The basic idea is to opportunely encode the data before their injection in the network in such a way as to reduce the switching activity of the links. Differently from the previous approaches on data encoding in NoCs [16], [17] our proposal exploits the pipeline nature of wormhole switching technique (commonly used in the NoC context) to implement an

2 PALESI et al.: DATA ENCODING SCHEMES IN NETWORKS ON CHIP 775 end-to-end encoding/decoding scheme. In our proposal data are encoded before transmission and are decoded at the destination. This makes the approach transparent with respect to the underlying NoC fabric as it does not require any modification of the router architecture. It should be pointed out, however, that the proposed approach is thought to be applied to NoC architectures which do not use virtual channels (VCs). In fact, if VCs are used, the effectiveness of the proposed approach reduces as it will be shown in the experimental section. In addition, although the proposed approach is specifically focused on reducing the power dissipated by network links, it does not conflict with other techniques which attack the power problem by acting on the other main elements of the interconnect. Based on this, it could be used in cooperation with other approaches to form a complete framework for power optimization of the NoC based interconnection system. The proposed data encoding schemes are assessed on a set of traffic scenarios both synthetic and extracted from real applications. The analysis takes into consideration not only the power and energy saving due to the reduction of the switching activity in network links, but also the overhead (both in terms of power dissipation and silicon area) due to the encoding and decoding logic integrated into the NI. We show that up to 37% of power dissipation and up to 18% of energy consumption can be saved adopting the proposed encoding schemes without impacting the overall performance of the network. The rest of this paper is organized as follows. In Section II, we briefly discuss related research. An overview of the proposed data encoding scheme is presented in Section III. In Section IV, we perform a general quantitative analysis aimed at showing the power saving figures obtained using data encoding schemes when several parameters are made to vary. The proposed data encoding scheme along with a possible hardware implementation and its analysis is presented in detail in Section V. The proposed data encoding scheme is assessed and compared with other approaches on a set of traffic scenarios, both synthetically generated as well as extracted from real applications, in Section VI. Finally, in Section VII we draw our conclusion and discuss possible future developments. II. Related Work The interconnection network dissipates a significant fraction of the total system power budget. For this reason, the design of power efficient interconnection networks is today recognized as a key issue. There are several works in literature which deal with power dissipation and energy consumption issues in NoC architectures. They differ by either the level of abstraction in which they operate or by the specific NoC element they focus on. Here we focus on power dissipated by network links. Several techniques have been proposed in the literature to reduce the power dissipated by the links of a NoC [18] [22]. In this subsection, we review the sub-set of them which use data encoding schemes as main mechanism to reduce power dissipation. Almost all the data encoding techniques proposed in the literature have been defined to be applied in the context of bus-based architectures with the primary goal of minimizing transition activities on buses while ignoring cross-coupled capacitance. Bus-invert method [23] can be applied to encode randomly distributed data patterns. Highly correlated access patterns exhibit spatial-temporal locality which is exploited by Gray code [24], T0 method [25], and the working-zone encoding [26]. Application specific approaches based on a priori knowledge of the traffic patterns have been proposed [27] [29]. Other encoding techniques have been defined to take into consideration the contribute of cross-coupled capacitance [30], [31]. In the context of NoCs, Jantsch et al. [16] analyzed the use of partial bus invert coding as link level low power encoding technique with the conclusion that it spends several times more power than no encoding at all, if normalized for the same performance, which is done by adjusting supply voltage and frequency. However, differently from how we propose in this paper, they considered point-to-point encoding in which every router in the NoC decodes the incoming flits and encodes the outgoing flits. In addition, [16] did not take advantage of the pipelined nature of the flow of flits through the links of the routing path which is guaranteed by the wormhole switching technique generally used in NoCs. Conversely, the data encoding scheme proposed in this paper is designed to exploit the wormhole switching technique making it possible to operate an end-to-end encoding which does not determine any overhead in terms of routers and links. It only requires the upgrade of the network interface, which is augmented with the encoding decoding logic leaving the underlying network as is. Pande et al. proposed the use of crosstalk avoidance codes (CAC) to improve the signal integrity by reducing the effective coupling capacitance and lowering the energy dissipation of wire segments [17]. By incorporating CAC in NoC data streams the effective coupling capacitance of the inter-switch wire segments and hence the communication energy is reduced without incurring the non-optimal wire area overhead of shielding/spacing. However, its application requires redundant wires and the encoding/decoding process is performed hop by hop for the header flit. The data encoding schemes we present in this paper have been already introduced by the authors in [32] and [33]. In this paper, the proposed schemes are discussed in more details and assessed by means of both a quantitative analysis and an experimental analysis. Differently from [32] and [33], in which the analysis was carried out using synthetic traffic scenarios and without considering the interaction between concurrent communication flows (zero-load analysis), in this paper we extend the experimental analysis to real case studies using a cycle accurate simulator in which the dynamic behavior of the NoC is modeled. III. Overview of the Proposal The general scheme of the proposed approach is depicted in Fig. 1. The basic idea is to apply an encoding technique end-to-end taking advantage of the wormhole switching technique [34]. In fact, wormhole switching is the most suitable option for on-chip communication [35]. The rationale behind

3 776 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 30, NO. 5, MAY 2011 through h links and will be processed by h + 1 routers (from R 0 to R h ). The power dissipated to transmit the packet can be expressed as Fig. 1. General scheme of the proposed approach. this idea is due to the pipeline nature of wormhole switching. Since all the links of the routing path are traversed by the same sequence of flits, the encoding decision taken at the network interface guarantees the same switching behavior in each link of the routing path. As shown in Fig. 1, the NI is augmented with an encoder (E) and a decoder (D) block. With the exception of the header flit, the encoder encodes the outgoing flits of the packet in such a way as to minimize the power dissipated by the interrouter point-to-point links which form the routing path of the current packet. Since the routers are not equipped with any encoding/decoding logic, the header flit is not encoded as it contains control information (destination address, packet size, and so on) which have to be processed by the routers through the routing path. Similarly to the above description, all the incoming flits in the network interface (with the exception of the header flit) are decoded by the decoder block. It should be pointed out that the proposed scheme is designed to be applied in the context of no VC based implementations. In fact, if VCs are used, the assumption that the flits belonging to different packets are not interleaved in the same link is not valid anymore. At any rate, it does not mean that the proposed scheme cannot be applied in VC based implementations but, instead, that the potential power savings are reduced. Before describing the proposed encoding technique, in the next section we will perform a general quantitative analysis which allows to assess the achievable power reduction improvement when the scheme outlined above is used. IV. General Quantitative Analysis In this section, we will first define a general model to quantify the communication power saving that can be achieved using an end-to-end data encoding technique as sketched in Fig. 1. Then, we will analyze the impact of several architectural and communication-related parameters on power saving. Finally, we will summarize the results of this analysis. A. Power Saving Estimation Let us consider a packet of n + 1 flits pkt = {b H,b 1,b 2,...,b n } where we indicated with b H the header flit and with b i, i = 1, 2,...,n the body flits. Let us suppose that a packet is transmitted from PE s to PE d involving h hops 1 (see Fig. 1). Such a packet will pass 1 With the term hops we refer to the number of links traversed and not to the number of routers traversed. P(pkt) =2(n +1)P NI +(h +1)P (H) R +(h +1)nP (B) R + h(n +1)P L (1) where we indicate with P (H) R and P (B) R the power dissipated by the router when it routes a header flit and a body flit, respectively. With P NI the power dissipated by the network interface and with P L the power dissipated to transmit a flit over a link. Now, let us consider the case in which the NI encodes each flit of the packet (except the header flit) before transmission to the network and decodes each received flit from the network (except the header flit). In this case, the power dissipated to transmit the packet can be expressed as ˆP(pkt) =2P NI +2nˆP NI +(h +1)P (H) R +(h +1)nP (B) R + h(n +1)ˆP L (2) where ˆP NI is the power dissipated by the NI augmented with the encoding/decoding logic and with ˆP L the power dissipated to transmit an encoded flit over a link. Let us indicate with P ED the power contribution of the encoding/decoding logic. 2 We can approximate ˆP NI as the P NI plus the overhead due to the encoding/decoding logic ˆP NI P NI + P ED. (3) Substituting (3) in (2) we have ˆP(pkt) =2P NI +2n(P NI + P ED )+(h +1)P (H) R (4) +(h +1)nP (B) R + h(n +1)ˆP L. (5) The percentage reduction in power dissipation, PR, when the encoding technique is used is computed as PR =1 ˆP(pkt) P(pkt). (6) Substituting (1) and (5) in (6) and performing some symbolic algebraic manipulations we obtain hε (n + 1)(1 β) 2nγ δ PR = 2(n +1)+ε(h + 1)(n + α)+ hε δ (n +1) (7) which expresses the percentage power reduction by means of the following relative parameters. 1) α P (H) R /P (B) R is the header to body flit routing power ratio. This ratio is 1 since routing a header flit involves more operations than that required to route a body flit (e.g., routing algorithm, selection policy, arbitration, and so on). 2) β ˆP L /P L is the link power reduction factor. It indicates the average reduction factor of link power dissipation when the encoding technique is used. 3) γ P ED /P NI is the amount of power dissipated by the encoder/decoder logic normalized to the power dissipated by the network interface. 2 We assume that the power dissipated by the encoder logic is equal to the power dissipated by the decoder logic.

4 PALESI et al.: DATA ENCODING SCHEMES IN NETWORKS ON CHIP 777 4) δ P (B) R /P L is the ratio between the power dissipated by the router when it routes a body flit and the power dissipated to transmit a flit over a link. 5) ε P (B) R /P NI is the ratio between the power dissipated by the router when it routes a body flit and the power dissipated by the network interface. Equation (7) expresses the percentage reduction in power dissipation when the encoding technique is used in function of packet size, distance from source to destination and the relative parameters α, β, γ, δ, and ε. B. Analysis and Discussion To get some confidence with the percentage reduction in power dissipation that can be achieved in practical cases, let us consider as baseline a router in which α =1.04 (i.e., the average power dissipated by the router when it processes the header flit is 4% higher than that when it processes a body flit) and ε =1.08 (i.e., the power dissipated by the router is 8% higher than that dissipated by the NI). The power values used to compute α and ɛ have been captured from a power analysis of a bit router with input FIFO buffers of 4 flits and a NI with minimum buffering supporting OCP and AHB protocols. Additional details about the synthesis results can be found in Section V-E. Fig. 2(a) (d) shows a set of contour plots of the percentage power reduction for different values of the parameters β, γ, δ, h, and n. Fig. 2(a) shows the contour plot of the percentage power reduction for different values of the link power reduction factor, β (from 0% to 90%) and different power fraction of the encoding/decoding block, γ (from 0% to 10%). As can be observed, the overall power reduction spans from 5% to 35%. The effect of hop count, h, is analyzed in Fig. 2(b). As can be observed, the effectiveness of data encoding increases as hop count increases [36]. Moreover, as soon as the routing path length increases, the effectiveness of data encoding becomes less and less sensitive to the power overhead due to the encoding/decoding logic. The effect of packet size, n, is shown in Fig. 2(c). As can be observed there is a γ threshold located at about γ T =0.04. This value depends on the values of the remaining parameters considered in this analysis (β =0.7, δ =1,h =10).Itis interesting to observe that below γ T, an increase in packet size has a positive impact of power reduction. Conversely, above γ T, an increase in packet size causes a reduction in power saving. Please note that such behavior is observed for small packet size. For packet size greater than six flits, such separation becomes negligible. Finally, Fig. 2(d) shows the effect of the router to link power ratio δ. As expected, the more the power contribute of link is as compared to power contribute of the router, the more is the power saving that can be achieved. C. Summary of the Analysis Overall, the effectiveness of an encoding scheme strongly depends on several architectural and technological parameters as well as communication related parameters (e.g., packet size, routing path length). To summarize the results of the above general quantitative analysis, we can state that the effectiveness of applying data encoding techniques for low power in the NoC context increases as: 1) hop count increases; 2) power contribute of link is comparable with that of routers; and 3) packet size increases. It is expected that, due to the ever growing bandwidth requirements demanded by current and future applications, links will become more and more wider. In addition, technological trend is pushing power consumption from logic to wiring. Based on this, the power contribution of links is expected to dominate that of routers. Moreover, as the number of cores increases, the NoC size increases as well resulting in longer average path length. Based on the above considerations, we believe that the use of data encoding techniques represents a viable solution to address low-power issues in NoC based system architectures. V. Proposed Encoding Scheme In this section, we present the proposed encoding scheme whose goal is to reduce the power dissipated by point-to-point inter-router links of a NoC. Before we start to discuss the proposed scheme, we briefly analyze the different contributions which determine the power dissipated by a link. A. Power Model The dynamic power consumed by the interconnects and drivers is given by P = [T 0 1 (C s + C l )+T c C c ] V 2 dd F ck (8) where V dd is the supply voltage, F ck is the clock frequency, C s is the self capacitance (which includes the parallel-plate capacitance and the fringe capacitance), C l is the load capacitance, and C c is the coupling capacitance. T 0 1 and T c are the average number of effective transitions per cycle for C s and C c, respectively. They are computed as follows. T 0 1 counts the number of 0 1 transitions in the bus in two consecutive transmissions. T c counts the correlated switching between physically adjacent lines. Precisely, we can enumerate four types of coupling transitions as follows [30]. A Type I transition occurs when one of the lines switches while the other stays unchanged. In a Type II transition one line switches from low to high and the other from high to low. A Type III transition occurs when both lines switch simultaneously. Finally, in a Type IV transition both lines do not switch. The effective switched capacitance varies from type to type. Thus, the coupling transition activity T c is a weighted sum of the different type of coupling transition contributions. We have T c = k 1 T 1 + k 2 T 2 + k 3 T 3 + k 4 T 4 (9) where the T i, i =1, 2, 3, 4, are the average number of transition type i and k i are weights. According to [30] we assume k 1 =1, k 2 = 2, and k 3 = k 4 = 0. That is, k 1 is assumed as reference for other types of transition. The effective capacitance in Type II transition is usually twice that of a Type I transition. In Type III transition, as both signal switch simultaneously, C c

5 778 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 30, NO. 5, MAY 2011 Fig. 2. Contour plot of the percentage reduction in power dissipation when the encoding technique is used and α =1.04, ε =1.08. (a) δ =1,h =10,n =4. (b) β =0.7, δ =1,n = 4. (c) β =0.7, δ =1,h = 10. (d) β =0.7, h =10,n =4. TABLE I How Transitions Mutate If Data Is Inverted Time Normal Inverted Type I Type I t t Type II Type IV t t Type III Type IV t t Type IV Types II and III t t T4 T4 T 3 T 2 is not charged (here we assume that there is no misalignment between the two transitions). Finally, in Type IV transition there is no dynamic charge distribution over C c. Based on this, (8) can be expressed as follows: P = [T 0 1 (C s + C l )+(T 1 +2T 2 )C c ] V 2 dd F ck. (10) In the next subsection, we present the proposed encoded scheme whose primary goal is to minimize T 1 and T 2 and to minimize T 0 1 as secondary goal. B. Proposed Encoded Scheme Looking at (8) and (9) we have P T 0 1 C s +(k 1 T 1 + k 2 T 2 + k 3 T 3 + k 4 T 4 )C c. (11) If the data (from now on, the flit) is inverted, the link power consumption will be P T 0 1 C s +(k 1 T 1 + k 2T 2 + k 3T 3 + k 4T 4 )C c (12) where we indicate with T 0 1, T 1, T 2, T 3, and T 4, the self transition activity, the coupling transition activity of Types I, II, III, and IV, respectively, if the flit is inverted before being transmitted. It is simple to determine the relationship between the coupling transition activities if the flit is transmitted as is and the coupling transition activities if the flit is transmitted with its bits inverted. Table I reports for each transition type how it mutates if the flit is inverted. Data are organized as follows. The first bit is the value of the generic ith line of the link, whereas the second bit represents the value of the adjacent Fig. 3. Flowchart to evaluate the invert condition (14) for link width greater than or equal to 8 bits. line (line i + 1 of the same link). For each partition, the first line represents the values at time t 1, whereas the second line the values at time t. For instance, looking at the first partition which reports Type I transitions, the first column indicates that, on time slot t, lines i and i + 1 of a link were 0 and 0, respectively, and in the next time slot t they switch to 0 and 1, respectively. As can be observed from Table I, Type I transitions still remain Type I transitions if the flit is inverted. Type II and Type III transitions will mutate in Type IV transitions if the flit is inverted. Type IV transitions mutate either in Type II or Type III transitions. In particular, transitions indicated as T4 in the table mutate in Type III transitions whereas that indicated with T4 mutate in Type II transitions. Similarly, it is simple to find that T 0 1 = T 0 0. Thus, (12) can be expressed in function of T 1, T 2, T 3, T4, and T4 as P T 0 0 C s +[k 1 T 1 + k 2 T 4 + k 3 T 4 + k 4(T 2 + T 3 )]C c. (13) It is convenient to invert the flit before transmission if P> P. Taking (11) and (13) and considering, according to [30], k 1 =1,k 2 =2,k 3 = k 4 = 0 and C c /C s = 4, we obtain the following invert condition: T T 2 >T T 4. (14) In conclusion, the proposed encoding scheme simply inverts the flit before its transmission if and only if the invert condition (14) is satisfied. In the next subsection, we assess the hardware implications of implementing this encoding scheme into the network interface in a NoC based system.

6 PALESI et al.: DATA ENCODING SCHEMES IN NETWORKS ON CHIP 779 Fig. 5. Architecture of the encoder implementing the simplified invert condition (15). of ones in their inputs. Finally, in the third level there is a set of parallel comparators. D. Simplified Version of the Proposed Encoding Scheme The invert condition (14) is the exact condition which determines if the transmitted flit has to be inverted or not to reduce both the self switching activity and the coupling switching activity on the links traversed by the flit. Since the terms T 2 and T4 are weighted with a factor 8 in respect to T 0 1 and T 0 0, we can approximate the invert condition as Fig. 4. Encoder architecture. (a) Top level view. (b) Internal view of the encoder block (E). C. Design of the Proposed Encoding Scheme Looking again at the invert condition (14) and considering a link width less than or equal to 8 bit, if T 2 is greater than T4 then the invert condition is satisfied as T 00 can be at most 8. Based on this the flowchart shown in Fig. 3 can be considered a simple way to evaluate the invert condition. We use this algorithm as the base for the implementation of the encoding logic. For link width greater than 8 bit we found that the miss prediction of the invert condition does not exceed 1.2% on average as will be shown in the next subsection. Let us consider a NoC with links width of w bits. We assume that the NI, which hosts the encoding logic, packs body flits in w 1 bits. Fig. 4(a) shows the top-level view of the encoder. The w 1 bits body flit is concatenated with a 0 bit and represents the first input of the encoder. The second input is the previously encoded body flit. The internal logic of the encoder block is sketched in Fig. 4(b). The 2 1 bits of the incoming body flit are indicated with x i, i =0, 1,...,w 2 whereas that of the previously encoded body flit are indicated with y i. The wth bit of the previously encoded body flit is indicated with inv. This bit is used by the decoder to decide whether the received body flit has to be inverted (inv =1)or left as is (inv =0). The first level of the encoder determines the transition type. The four input blocks T 2 and T 4 assert their output if y i y i+1 x i x i+1 is a T 2 or a T4 transition, respectively. The two input blocks T 01 and T 00 assert their output if y i x i is a T 0 1 or a T 0 0 transition, respectively. The blocks Ones in the second level are parallel counters which count the number T 2 >T 4. (15) That is, the flit is inverted if (15) is satisfied. The architecture of the encoder implementing the simplified invert condition is shown in Fig. 5. On the one hand, as will be shown in the next subsection, the simplification of the invert condition results in a reduction of the area, power, and delay of the encoder since the logic to evaluate the invert condition becomes simpler. On the other hand, the use of the simplified (or approximated) invert condition instead of the exact invert condition reduces the effectiveness of the encoding scheme since for some combinations of successive transmitted flits (14) and (15) might conflict each other. E. Logic Synthesis Results The encoder and the decoder have been designed in Verilog HDL described at the RTL level, synthesized with Synopsys Design Compiler and mapped onto an UMC 65 nm technology library. A clock speed of 700 MHz has been considered. Here, we compare the area, power, and timing figures of the proposed encoding scheme (SC) against the bus-invert (BI) coding [23], the coupling driven bus invert (CDBI) coding [30], and the forbidden pattern condition (FPC) codes [17] as they have the highest potential for power saving while still represent a feasible implementation for on-chip communication. Fig. 6 shows the percentage impact on silicon area and power dissipation of the NI due to the data encoding/decoding logic. The baseline NI has minimum buffering and supports OCP 2 and AHB protocols [37]. We assume 32 bit link and for each encoder type E we consider four different versions named E4, E8, E16, and E32. In En, the link is partitioned in 32/n n-bits sub-links and

780 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 30, NO. 5, MAY 2011 congestions, blocking, multiplexing of packets, and so on.

7 780 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 30, NO. 5, MAY 2011 congestions, blocking, multiplexing of packets, and so on. The analysis is performed on a set of data streams belonging to several media formats and under different synthetic traffic patterns. Finally, the section is closed with a real case study. Fig. 6. Percentage impact on silicon area and power dissipation of the network interface due to the data encoding/decoding logic. TABLE II Absolute Power Dissipation (mw) of the Different Elements of the NoC Router NI Link FIFO Arbiter Crossbar WHRT Routing Assuming 25% self switching activity and 25% coupling switching activity. the encoding scheme E is applied in parallel to each sublink. There is only one instance of FPC as it is specifically designed to work on 4 bit sub-links. As can be observed, the area overhead is below 8% for all the encoding schemes. With the exception of CDBI-32 and SC-32, the power overhead is below 6%. In the experimental section, we show that, in many cases, such power overhead is completely absorbed by the power saving in the NoC links. In most cases, whatever the delay introduced by the encoding logic, there will be no slowdown and the corresponding timing path can be easily ignored, since the encoding of the data can be pipelined and the inversion condition for the first chunk of data will be evaluated in parallel with the creation of the header flit. Table II shows the absolute power dissipated by the different elements of the NoC. The power analysis for the router and the NI has been carried out considering different packet sizes (from 2 to 16 flits) and random destinations. Power dissipated by the link strongly depends on the data traveling through it. The value in the table refers to the case in which the transfer determine 25% self switching activity and 25% coupling switching activity. VI. Experiments In this section, we assess the effectiveness of using data encoding techniques in NoC architectures. We restrict the analysis to the interconnect system components (i.e., links, routers, and network interfaces) without considering the power and energy contribution of the IP cores. This is not a limitation of the analysis since the interconnect system absorbs an important fraction of the overall power budget of the entire system [8]. First, we analyze the effectiveness of different encoding schemes, both in term of power and energy reduction, focusing on a single communication flow. Then, we perform a complete network analysis taking into account dynamic effects like A. Zero-Load Analysis In this analysis, we emphasize the energy improvement that can be obtained using the proposed data encoding schemes without considering any specific communication traffic. That is, we limit the analysis on a specific routing path from a source node, S, to a destination node, D. We assume only a communication flow from S to D without taking into consideration congestion and blocking issues due to the interaction of multiple concurrent communication flows. Such effects will be taken into account in the next subsection. The following parameters are used. The NoC is clocked at 700 MHz. The baseline NI (i.e., without the encoding/decoding logic) dissipates 5.3 mw. The average power dissipated by the wormhole-based router is 5.7 mw. An inter-routers wire has a total capacity of 592 ff/mm in a 65 nm UMC technology in which about 80% is due to crosstalk. We assume 2 mm 32 bits links and packet size of 16 bytes (8 flits). We assess the different data encoding schemes on a set of data streams belonging to eight different media formats namely ASCII text, PDF, gray scale image and true color image (both in BMP and JPEG formats), MP3 audio and MPEG video. For each class, ten data streams are considered and average values are reported. Fig. 7 shows the percentage of power saving obtained with different data encoding schemes for several data streams as compared to the case in which no data encoding is used. Negative values mean that there has been an increase in average power dissipation. With the exception of Text and Pic BW bmp, for all the considered data streams almost all the data encoding schemes improve power dissipation. If we focus on SCS, for instance, average power saving ranges from 5% to 24% when we pass from SCS-32 to SCS-4. The rationale behind the fact that smaller partitions result in more power saving is that the estimation of the impact on power dissipation due to the inversion of the bits in the partitions is more accurate for smaller partition sizes. This is due to the fact that, as soon as partition size increases, it becomes more and more probable that sub-sequences of bits in the partition, for which there is no crosstalk, are inverted in favor of other sub-sequences in the partition which contribute more to the power dissipation. It should be pointed out that the use of the considered encoding schemes increases the amount of traffic in the network since one bit of information in each encoded word is sacrificed in favor of the inv bit. Such overhead increases with the number of sub-links in which the link is partitioned. Precisely, the overhead due to the inv bit(s) when encoder En (with E {BI, CDBI, SC} and n {8, 16, 32}) is used is 1/n. The impact of this overhead on energy consumption is analyzed by Fig. 8 which shows the percentage of energy saving obtained with different data encoding schemes for several data streams as compared to the case in which no data encoding is used. Although the average power dissipated by the encoding/decoding logic for SC and SCS is not negligible

PALESI et al.: DATA ENCODING SCHEMES IN NETWORKS ON CHIP 781 Fig. 7. Percentage of power saving obtained with different data encoding schemes for several data streams. Fig. 8.

6), the effectiveness of SC and SCS in reducing both the switching activity and the coupling switching activity is higher than that of the other approaches.

8 PALESI et al.: DATA ENCODING SCHEMES IN NETWORKS ON CHIP 781 Fig. 7. Percentage of power saving obtained with different data encoding schemes for several data streams. Fig. 8. Percentage of energy saving obtained with different data encoding schemes for several data streams. (see Fig. 6), the effectiveness of SC and SCS in reducing both the switching activity and the coupling switching activity is higher than that of the other approaches. That is, the power saving on links when SC or SCS is used counterbalances the power dissipated by the encoding/decoding logic more than the other approaches. Thus, since the energy consumption is the area below the power curve, and the extension of the curve is the same for all the schemes using the same number of partitions, the energy saving for SC/SCS is higher than that of the other approaches. The above results have been obtained assuming single hop communication. The impact of path length on both power saving and energy saving is shown in Fig. 9. For the sake of discussion, let us consider the results obtained under music data stream. As can be observed all the encoding schemes allow to obtain power saving [Fig. 9(a)]. In particular, the proposed approaches together with FPC exhibit the highest power saving up to 33%, 33%, and 36% for SC-4, SCS-4, and FPC, respectively. Average power savings above 20% are also observed when SC-8 and SCS-8 are used. Bus invert and CDBI do not pass the threshold of 17% in power saving. In terms of energy [Fig. 9(b)], only the proposed encoding schemes result in energy saving. In particular, SC/SCS in their configuration with 8, 16, and 32 bit are effective whatever the number of hops whereas SC/SCS-4 is effective starting from 2 hops. B. High-Load Analysis In this section, we compare the different encoding schemes using a cycle accurate NoC simulator based on Noxim [38]. In this way, dynamic effects like congestion, blocking, multiplexing of packets through the same link are taken into account and a clearer picture of the effectiveness of this data encoding can be provided. The power estimation models available in Noxim have been updated to take into account the power dissipated by the NIs (augmented with the encoding/decoding logic), and the power dissipated by the links due to both self and coupling switching activity. 1) Energy Analysis: Let us start by analyzing the percentage energy reduction that can be achieved using data encoding schemes on a 8 8 NoC under several synthetic traffic patterns. We consider deterministic XY routing, input FIFO buffers of four flits, and packets of eight flits injected at different packet injection rates (pir). Energy figures are computed running the simulation until 1 MB of traffic is drained by the network. A number of simulations is repeated for each pir value and energy values are averaged until the 95% confidence intervals are mostly within 2% of the means. Fig. 10 shows the percentage energy reduction achieved when several data encoding schemes are used for different pirs under bit-reversal traffic for 4-flit and 8-flit packets. Negative values mean an increase of energy consumption. As can be observed, the general trend is that the percentage energy reduction decreases as pir increases. This is quite expected due to the exponential nature which relates the communication delay with the pir. As the pir increases and approaches the saturation point, small increment of the injected load determines a high increment of the communication delay. For a given pir, the amount of traffic injected in the network is higher for the mechanisms which use more partitions. 3 For this reason, as soon as the saturation point is approached, the completion time for the schemes which use more partitions (i.e., BI 4, CDBI 4, SC 4, SCS 4, FPC) increases faster than in 3 Such an increment of injected load is related with the overhead traffic carrying control information (a.k.a., invert bit conditions) for decoding purposes.

9 782 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 30, NO. 5, MAY 2011 Fig. 9. Percent (a) power saving and (b) energy saving for different path length (hops) for music data stream. Fig. 10. Percentage energy reduction achieved for different pirs under bit-reversal traffic for 4-flit and 8-flit packets. the other schemes. This explains why, in Fig. 10, from a certain pir value, the slope of the curves BI 4, CDBI 4, SC 4, SCS 4, and FPC is more pronounced than the other mechanisms. The experimental results shown in Fig. 10 confirm the theoretical quantitative analysis presented in Section IV. As expected, as soon as the packet size increases, the percentage energy reduction increases. In fact, for larger packet sizes, the fraction of time a generic link is traversed by flits belonging to the same packet increases. Thus, the positive effects of data encoding are exploited for a longer time. It should be pointed out, however, that the results obtained for small packet sizes (e.g., 4-flit packets) are still worth of consideration. For instance, the percentage energy reduction when SCS-8 is used ranges from 12% to 15% even with small packet size. Due to space limitation, we do not report the detailed results for the other traffic scenarios. However, a summary of the improvements in terms of energy consumption with respect to the case in which no data encoding is used is shown in Fig. 11. The percentage energy reduction is computed for a pir value where none of the networks are saturated. 4 On average, only 4 A network is said to start saturating when increase in applied load does not result in linear increase in throughput [39]. SC, SCS, and FPC provide energy saving which are up to 16%, 17%, and 3% for SC-8, SCS-8, and FPC, respectively. It is interesting to note that the energy reduction mostly depends on the encoding scheme and it is fairly invariant with the traffic scenario. Thus, the average results shown in Fig. 11 are representative of the energy reduction that can be achieved in general cases using the proposed data encoding techniques. 2) Energy/Power Versus Performance: To assess the tradeoff between the reduction of the average power dissipation, the reduction of total energy consumption of the interconnect system with the completion time (i.e., the amount of time needed to drain a given amount of traffic volume), Fig. 12 shows the distribution of the simulated configurations in the plane % increase of completion time versus % reduction of power dissipation [Fig. 12(a)] and % increase of completion time versus % reduction of energy consumption [Fig. 12(b)]. The percentage increase of completion time is defined as the percentage increase of the time needed to drain a given amount of traffic when a given data encoding scheme is used with respect to the case in which no data encoding is used. Similarly, the percentage decrease of power dissipation (energy consumption) is the percentage reduction of power dissipated (energy consumed) to drain a given amount of traffic when a

PALESI et al.: DATA ENCODING SCHEMES IN NETWORKS ON CHIP 783 Fig. 11. Percentage energy reduction using different data encoding schemes under different synthetic traffic scenarios. Fig. 12.

10 PALESI et al.: DATA ENCODING SCHEMES IN NETWORKS ON CHIP 783 Fig. 11. Percentage energy reduction using different data encoding schemes under different synthetic traffic scenarios. Fig. 12. Percentage increase of completion time versus (a) percentage increase of power dissipation, and (b) percentage decrease of energy consumption to drain a given amount of traffic volume when different data encoding schemes are used under several traffic scenarios. given data encoding scheme is used with respect to the case in which no data encoding is used. Each point in the graph refers to one of the five synthetic traffic scenarios discussed above. As can be observed in Fig. 12(a), as the percentage power reduction is always positive, there is an improvement of the average power dissipation for all the encoding schemes and under all the synthetic traffic scenarios considered. The oblique solid line partitions the area of the graph in two regions. The points belonging to the bottom region are characterized by a percentage increase of completion time which is greater than the percentage reduction of power dissipation. On the contrary, the points belonging to the top region are characterized by a percentage reduction of power dissipation which is greater than the negative impact due to the percentage increase of the completion time. From this graph, the Pareto-optimal encoding schemes (i.e., that above the oblique line) are SC and SCS. In particular, higher power savings are observed as the granularity of the encoder becomes more and more fine (from SC/SCS 32 to SC/SCS 4). The other schemes (BI, CDBI, and FPC) fall in the bottom region in which the penalty due to the increase of the completion time is higher than the reduction of the average power dissipation. Fig. 12(b) shows the relationship between completion time and energy consumption. The plane is divided in three regions. The first region, bounded by the x-axis and the horizontal solid line, collects the dominated points. Such points refer to the data encoding schemes that, for the considered traffic scenarios, resulted in an increase of both energy consumption and completion time as compared to the case in which no data encoding is used. BI and CDBI in all their configurations (32-, 16-, 8-, and 4-bit) belong to this region. The region between the oblique solid line and the horizontal line collects the data encoding schemes for which the percentage reduction in energy consumption is less than the percentage increase in completion time. Finally, the region between the y-axes and the oblique solid line collects the data encoding schemes for which the percentage reduction in energy consumption is greater than the percentage increase in completion time. Of course, this is the most interesting region and, as can be observed, only SC/SCS 32, SC/SCS 16, and SC/SCS 8 belong to this region. 3) Packet Size: As discussed in the general quantitative analysis in Section IV, the effectiveness of a data encoded scheme exploiting the pipeline nature of wormhole switching increases as packet size increases. To quantify this trend, Fig. 13 shows the percentage of energy per flit reduction using SCS for different packet size under bit-reversal traffic. The percentage reduction of energy/flit rapidly increases as packet size goes from two flits to six flits. For packets larger than six flits, there are no more energy improvements. In fact, the missing encoding exploitation passing from a packet to the next one (please note that the header flit is not encoded) is completely amortized by the exploitation of the encoding scheme for a large number of flits.

11 784 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 30, NO. 5, MAY 2011 Fig. 13. Percentage energy/flit reduction using SCS with different partition sizes, for different packet size under bit-reversal traffic. Fig. 15. Heterogeneous system composed by a multimedia sub-system, a MIMO-OFDM receiver, a PIP, and a MWD module. Fig. 14. Percentage of energy reduction when SCS 8 is used for different pirs and number of VCs under uniform traffic. C. Virtual Channel Based Implementations As it has been already stated in Section III, the effectiveness of the proposed data encoding schemes decreases if VCs are used. Such worsening arises not only because flit-level multiplexing will ruin the flit correlation within a packet as already discussed in the paper, but also because the power ratio between router and link increases. In fact, VCs introduce some overhead in terms of both additional resources (e.g., buffers, replication of the routing logic, and so on) and mechanisms for their management (e.g., complex arbitration). Fig. 14 shows the percentage of energy reduction when SCS 8 is used for different pirs and number of VCs under uniform traffic. As expected, the average percentage of energy reduction decreases as the number of VCs increases. It should be noted, however, that the use of the proposed encoding scheme still provides energy saving also when VCs are used. As can be observed, as pir increases, the percentage of energy reduction decreases more quickly for the no VC implementation. This is due to the fact that the saturation pir for VC based implementations is usually higher than that of the baseline implementation (i.e., without VCs) providing higher performance (e.g., lower average delay). Thus, although the power dissipated by VC-based routers is higher than that of the baseline routers, the higher performance allows for draining more traffic during a certain time window which positively affects energy consumption. Fig. 16. Percentage reduction of power, energy, energy/flit, and percentage increase of completion time for different configurations of SCS scheme. D. A Case Study In the previous subsection, we found that the SCS scheme represents the Pareto optimal scheme in terms of both energy versus completion time and power versus completion time as compared to the other data encoding schemes considered in this paper. In this subsection, we assess the effectiveness of SCS on a complex heterogeneous system shown in Fig. 15. The system is composed by the following sub-systems. 1) MMS: a generic MultiMedia System which includes a H.263 video encoder, a H.263 video decoder, a MP3 audio encoder, and a MP3 audio decoder [40]. 2) MIMO-OFDM: a MIMO-OFDM receiver in which, to support the maximum data rate of world-wide spectrum efficiency proposal for the next-generation wireless LAN systems, some of the IPs have been parallelized to multiple IPs [41]. 3) PIP and MWD: a picture-in-picture application and a multi-window display application [42], [43]. In this case study, both packet size and packet injection rate vary with communication flow. For instance, the communication flows involved in MMS-Enc and MMS-Dec use a packet size tuned on the basis of a macroblock. Packet injection rate has been computed for each communication flow on the basis of the bandwidth requirements for each application as reported in [40] [43]. Fig. 16 shows the percentage reduction of total energy, average power, energy per flit and the percentage increase of

12 PALESI et al.: DATA ENCODING SCHEMES IN NETWORKS ON CHIP 785 completion time when SCS is used as respect to the case in which no data encoding scheme is used. In terms of energy, the best configuration is SCS 8 which allows to save up to 21% of energy consumption with less than 11% penalty in completion time. Average power dissipation and energy per flit reduce from 18% to 46% and from 18% to 49%, respectively, passing from SCS 32 to SCS 4 with a penalty in completion time which ranges from 2% to 25%. VII. Conclusion The power dissipated by the links of a NoC accounts for a significant fraction of the total power budget [8] [10], [13] [15]. In this paper, we have proposed the use of data encoding techniques as a viable way to reduce both power dissipation and energy consumption of NoC links. The proposed schemes are transparent to the underlying NoC infrastructure as they operate on an end-to-end basis. No modification of the router architecture is needed as well as links width. Only the NI is augmented with the encoding/decoding logic that, although represents an overhead, does not introduce a significant penalty both in terms of cost (i.e., silicon area) and latency. The proposed encoding schemes have been compared with several encoding schemes proposed in literature on a set of representative data streams both synthetic and extracted from real applications. The experimental analysis shown that by using the proposed encoding schemes it is possible to reduce the power contribution of both the self switching activity and the coupling switching activity in inter-routers links. Precisely, as compared to a baseline implementation in which no data encoding techniques are used, a reduction of up to 37% of power dissipation and 18% of energy consumption has been observed without any significant degradation in terms of both performance and silicon area. Currently, we are in the evaluation phase of integrating the SCS scheme into the NI of the STMicroelectronics NoC-based interconnection infrastructure. References [1] International Technology Roadmap for Semiconductors: Interconnect. (2006) [Online]. Semiconductor Industry Assoc. Available: [2] S. Pasricha and N. Dutt, Trends in emerging on-chip interconnect technologies, IPSJ Trans. Syst. LSI Design Methodol., vol. 1, pp. 2 17, Aug [3] H.-J. Yoo, K. Lee, and J. K. Kim, Low-Power NoC for High-Performance SoC Design. Boca Raton, FL: CRC Press, [4] S. Borkar, Thousand core chips: A technology perspective, in Proc. ACM/IEEE Design Autom. Conf., Jun. 2007, pp [5] G. D. Micheli and L. Benini, Networks on Chips: Technology and Tools. San Mateo, CA: Morgan Kaufmann, [6] J. A. Davis, R. Venkatesan, A. Kaloyeros, M. Beylansky, S. J. Souri, K. Banerjee, K. C. Saraswat, A. Rahman, R. Reif, and J. D. Meindl, Interconnect limits on gigascale integration, Proc. IEEE, vol. 89, no. 3, pp , Mar [7] J. D. Meindl, Interconnect opportunities for gigascale integration, IEEE Micro, Special Issue Reliab.-Aware Microarchitecture, vol. 23, no. 3, pp , May [8] S. R. Vangal, J. Howard, G. Ruhl, S. Dighe, H. Wilson, J. Tschanz, D. Finan, A. Singh, T. Jacob, S. Jain, V. Erraguntla, C. Roberts, Y. Hoskote, N. Borkar, and S. Borkar, An 80-tile sub-100-w TeraFLOPS processor in 65-nm CMOS, IEEE J. Solid-State Circuits, vol. 43, no. 1, pp , Jan [9] M. B. Taylor, J. Kim, J. Miller, D. Wentzlaff, F. Ghodrat, B. Greenwald, H. Hoffman, P. Johnson, J.-W. Lee, W. Lee, A. Ma, A. Saraf, M. Seneski, N. Shnidman, V. Strumpen, M. Frank, S. Amarasinghe, and A. Agarwal, The raw microprocessor: A computational fabric for software circuits and general-purpose programs, IEEE Micro, vol. 22, no. 2, pp , Mar. Apr [10] F. Steenhof, H. Duque, B. Nilsson, K. Goossens, and R. P. Llopis, Networks on chips for high-end consumer-electronics TV system architectures, in Proc. Conf. Design Autom. Test Eur., 2006, pp [11] A. Ejlali, B. M. Al-Hashimi, P. Rosinger, S. G. Miremadi, and L. Benini, Performability/energy tradeoff in error-control schemes for on-chip networks, IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 18, no. 1, pp. 1 14, Jan [12] K. Srinivasan and K. S. Chatha, Layout aware design of mesh based NoC architectures, in Proc. Int. Conf. Hardw.-Softw. Codesign Syst. Synthesis, 2006, pp [13] J. C. S. Palma, L. S. Indrusiak, F. G. Moraes, A. G. Ortiz, M. Glesner, and R. A. L. Reis, Inserting data encoding techniques into NoC-based systems, in Proc. IEEE Comput. Soc. Annu. Symp. VLSI, Mar. 2007, pp [14] Y. Hoskote, S. Vangal, A. Singh, N. Borkar, and S. Borkar, A 5-GHz mesh interconnect for a teraflops processor, IEEE MICRO, vol. 27, no. 5, pp , Sep. Oct [15] L. Carloni, A. B. Kahng, S. Muddu, A. Pinto, K. Samadi, and P. Sharma, Interconnect modeling for improved system-level design optimization, in Proc. Asia South Pacific Design Autom. Conf., 2008, pp [16] A. Jantsch, R. Lauter, and A. Vitkowski, Power analysis of link level and end-to-end data protection in networks on chip, in Proc. IEEE Int. Symp. Circuits Syst., vol. 2. May 2005, pp [17] P. P. Pande, A. Ganguly, H. Zhu, and C. Grecu, Energy reduction through crosstalk avoidance coding in networks on chip, J. Syst. Architure, vol. 54, nos. 3 4, pp , [18] G.-Y. Wei, J. Kim, D. Liu, S. Sidiropoulos, and M. A. Horowitz, A variable-frequency parallel I/O interface with adaptive power-supply regulation, IEEE J. Solid-State Circuits, vol. 35, no. 11, pp , Nov [19] J. Kim and M. A. Horowitz, Adaptive supply serial links with sub- 1v operation and per-pin clock recovery, IEEE J. Solid-State Circuits, vol. 37, no. 11, pp , Nov [20] V. Soteriou and L.-S. Peh, Design-space exploration of power-aware on/off interconnection networks, in Proc. IEEE Int. Conf. Comput. Design, Oct. 2004, pp [21] G. Chen, F. Li, and M. Kandemir, Compiler-directed channel allocation for saving power in on-chip networks, ACM SIGPLAN Not., vol. 41, no. 1, pp , [22] S. E. Lee and N. Bagherzadeh, A variable frequency link for a poweraware network-on-chip, Integr. VLSI J., vol. 42, no. 4, pp , Sep [23] M. R. Stan and W. P. Burleson, Bus invert coding for low power I/O, IEEE Trans. Very Large Scale Integr. Syst., vol. 3, no. 1, pp , Mar [24] C. Su, C. Tsui, and A. Despain, Saving power in the control path of embedded processors, IEEE Design Test Comput., vol. 11, no. 4, pp , Aug [25] L. Benini, G. D. Micheli, E. Macii, D. Sciuto, and C. Silvano, Asymptotic zero-transition activity encoding for address busses in low-power microprocessor-based systems, in Proc. Great Lakes Symp. VLSI, Mar. 1997, pp [26] E. Musoll, T. Lang, and J. Cortadella, Reducing the energy of address and data buses with the working-zone encoding technique and its effect on multimedia applications, in Proc. Power Driven Architecture Workshop, 1998, pp [27] L. Benini, G. D. Micheli, E. Macii, M. Poncino, and S. Quer, Power optimization of core-based systems by address bus encoding, IEEE Trans. Very Large Scale Integr. Syst., vol. 6, no. 4, pp , Dec [28] L. Benini, A. Macii, E. Macii, M. Poncino, and R. Scarsi, Architectures and synthesis algorithms for power-efficient bus interfaces, IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 19, no. 9, pp , Sep [29] G. Ascia, V. Catania, M. Palesi, and A. Parlato, Switching activity reduction in embedded systems: A genetic bus encoding approach, IEE Proc. Comput. Digital Tech., vol. 152, no. 6, pp , Nov

786 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 30, NO. 5, MAY 2011 [30] K. W. Kim, K. H. Baek, N. Shanbhag, C. L. Liu, and S. M. Kang, Coupling-driven signal encoding scheme for low-power interface design, in Proc.

11th VLSI-SoC Int. Conf. Very Large Scale Integration, Dec. 2001, pp. 744 749. [32] M. Palesi, F. Fazzino, G. Ascia, and V.

Palesi, An encoding scheme to reduce power consumption in networks-on-chip, in Proc. IEEE Int. Conf. Comput. Eng. Syst., Dec. 2009, pp. 15 20. [34] L. M. Ni and P. K.

Micheli, Networks on chips: A new SoC paradigm, IEEE Comput., vol. 35, no. 1, pp. 70 78, Jan. 2002. [36] J. Xi and P.

13 786 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 30, NO. 5, MAY 2011 [30] K. W. Kim, K. H. Baek, N. Shanbhag, C. L. Liu, and S. M. Kang, Coupling-driven signal encoding scheme for low-power interface design, in Proc. IEEE/ACM Int. Conf. Comput.-Aided Design, Nov. 2000, pp [31] J. Henkel, H. Lekatsas, and V. Jakkula, Encoding schemes for address busses in energy efficient SoC design, in Proc. 11th VLSI-SoC Int. Conf. Very Large Scale Integration, Dec. 2001, pp [32] M. Palesi, F. Fazzino, G. Ascia, and V. Catania, Data encoding for lowpower in wormhole-switched networks-on-chip, in Proc. Euromicro Conf. Digital Syst. Des., 2009, pp [33] G. Ascia, V. Catania, F. Fazzino, and M. Palesi, An encoding scheme to reduce power consumption in networks-on-chip, in Proc. IEEE Int. Conf. Comput. Eng. Syst., Dec. 2009, pp [34] L. M. Ni and P. K. McKinley, A survey of wormhole routing techniques in direct networks, IEEE Comput., vol. 26, no. 2, pp , Feb [35] L. Benini and G. D. Micheli, Networks on chips: A new SoC paradigm, IEEE Comput., vol. 35, no. 1, pp , Jan [36] J. Xi and P. Zhong, A system-level network-on-chip simulation framework with analytical interconnecting wire models, in Proc. IEEE Int. Conf. Electro/Inform. Technol., May 2006, pp [37] D. Bertozzi and L. Benini, Xpipes: A network-on-chip architecture for gigascale systems-on-chip, IEEE Circuits Syst. Mag., vol. 4, no. 2, pp , Apr. Jun [38] F. Fazzino, M. Palesi, and D. Patti. Noxim: Network-on-Chip Simulator [Online]. Available: [39] P. P. Pande, C. Grecu, M. Jones, A. Ivanov, and R. Saleh, Performance evaluation and design tradeoffs for network-on-chip interconnect architectures, IEEE Trans. Comput., vol. 54, no. 8, pp , Aug [40] J. Hu and R. Marculescu, Energy and performance-aware mapping for regular NoC architectures, IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 24, no. 4, pp , Apr [41] S.-R. Yoon, J. Lee, and S.-C. Park, Case study: Noc based nextgeneration WLAN receiver design in transaction level, in Proc. Int. Conf. Adv. Commun. Technol., 2006, pp [42] E. G. T. Jaspers and P. H. N. de With, Chip-set for video display of multimedia information, IEEE Trans. Consumer Electron., vol. 45, no. 3, pp , Aug [43] E. B. van der Tol and E. G. Jaspers, Mapping of MPEG-4 decoding on a flexible architecture platform, in Proc. SPIE: Media Processors, vol , pp Maurizio Palesi (M 06) received the M.S. and Ph.D. degrees in computer engineering from the Università di Catania, Catania, Italy, in 1999 and 2003, respectively. Since November 2010, he has been an Assistant Professor with Kore University, Enna, Italy. Dr. Palesi serves on the Editorial Board of the Very Large Scale Integration Design Journal as an Associate Editor since May He has served as a Guest Editor for the Very Large Scale Integration Design Journal (Special Issue on Networkson-Chip) in 2008, as a Guest Editor for the International Journal of High Performance Systems Architecture (Special Issue on Power-Efficient, High Performance General Purpose and Application-Specific Computing Architectures) in 2009, and as a Guest Editor for the Elsevier MICPRO Journal (Special Issue on Network-on-Chip Architectures and Design Methodologies) in He serves as the Technical Program Committee Member for the following IEEE/ACM international conferences: RTAS, CODES+ISSS, ESTIMedia, SOCC, VLSI, ISC, and SITIS. He was the Co-Organizer of the International Workshops on Network-on-Chip Architectures in 2008, 2009, and Giuseppe Ascia received the M.S. degree in electronic engineering and the Ph.D. degree in computer science from the Università di Catania, Catania, Italy, in 1994 and 1998, respectively. In 1994, he joined the Institute of Computer Science and Telecommunications, Università di Catania. Currently, he is an Associate Professor with the Dipartimento di Ingegneria Informatica e delle Telecomunicazioni, Università di Catania. His current research interests include soft computing, very large scale integration design, hardware architectures, and low-power design. Fabrizio Fazzino (M 08) received the M.S. degree in computer engineering from the University of Catania, Catania, Italy, in Until 2001, he was responsible for the functional verification of 32 bit lines of microprocessors with STMicroelectronics, Catania. Since 2004, he has collaborated with the Department of Computer and Telecommunications Engineering, University of Catania. He is currently a Silicon Engineer with Icera, Inc., Bristol, U.K. Vincenzo Catania received the M.S. degree with Honors in electrical engineering from the Università di Catania, Catania, Italy, in Until 1984, he was responsible for testing microprocessor systems with STMicroelectronics, Catania. Since 1985, he has cooperated in research on advanced computer architectures and computer networks with the Dipartimento di Ingegneria Informatica e delle Telecomunicazioni, Facoltà di Ingegneria, Università di Catania, where he is currently a Full Professor of computer science. Since November 2006, he has served as the Director of the Department of Computer Science and Telecommunications Engineering, Università di Catania. He is the author of more than 200 articles on international journals and conference proceedings and holds two patents. Currently, his research focuses on pervasive embedded systems, network-on-chip architectures, and mobile terminal platforms and services.

Encoding Scheme for Power Reduction in Network on Chip Links

RESEARCH ARICLE OPEN ACCESS Encoding Scheme for Power Reduction in Network on Chip Links Chetan S.Behere*, Somulu Gugulothu** *(Department of Electronics, YCCE, Nagpur-10 Email: chetanbehere@gmail.com)