Bandwidth-aware routing algorithms for networks-on-chip platforms M. Palesi 1 S. Kumar 2 V. Catania 1

Size: px

Start display at page:

Download "Bandwidth-aware routing algorithms for networks-on-chip platforms M. Palesi 1 S. Kumar 2 V. Catania 1"

Dorcas Clark
6 years ago
Views:

Published in IET Computers & Digital Techniques Received on 6th July 2008 Revised on 2nd April 2009 In Special Issue on Networks on Chip ISSN 1751-8601 Bandwidth-aware routing algorithms for

1 Published in IET Computers & Digital Techniques Received on 6th July 2008 Revised on 2nd April 2009 In Special Issue on Networks on Chip ISSN Bandwidth-aware routing algorithms for networks-on-chip platforms M. Palesi 1 S. Kumar 2 V. Catania 1 1 Dipartimento di Ingegneria Informatica e delle Telecomunicazioni, University of Catania, Italy 2 Department of Electronics and Computer Engineering, School of Engineering, Jönköping University, Jönköping, Sweden mpalesi@diit.unict.it Abstract: General purpose routing algorithms for a network-on-chip (NoC) platform may not be able to provide sufficient performance for some communication intensive applications. This may be because of low adaptivity offered by a general purpose routing algorithm resulting in some links getting highly congested. In this study the authors demonstrate that it is possible to design highly efficient application-specific routing algorithms which distribute traffic more uniformly by using information regarding applications communication behaviour (communication topology and communication bandwidth). The authors use off-line analysis to estimate expected load on various links in the network. The result of this analysis is used along with the available routing adaptivity in each router to distribute less traffic to links and paths which are expected to be congested. The methodology for application-specific routing algorithms is extended to incorporate these features to design highly adaptive deadlock-free routing algorithms which also distribute traffic more uniformly and reduce network congestion. The authors discuss architectural implications and analyse area and power overheads of the proposed approach on the design of a table-based NoC router. 1 Introduction Network on chip (NoC) is likely to be used in highperformance multi-core embedded systems in a near future. Many factors affect the performance achieved by an application on an NoC platform. For applications that require intensive communication among cores, the main factor which affects the overall performance of an NoC is represented by its routing algorithm [1]. Traditionally, routing algorithms have been designed without any reference to the characteristics of the traffic which will stimulate the network. The main reason was that, in a general purpose domain, the communication traffic cannot be accurately characterised, thus the routing algorithms are designed to provide deadlock freedom under any type of traffic and give good average performance. As a consequence, the design of the routing algorithm conservatively assumes that all the network nodes may need to communicate with each other. However, in the application-specific domain, which characterises the area of embedded systems, we assume that an accurate characterisation of the communication traffic is possible [2, 3]. The embedded system designer has good knowledge of the application which will be mapped on the system. This knowledge opens new directions in system optimisation like, for instance, the customisation of the routing algorithm for a given application. Based on this, APSRA, a methodology to design applicationspecific routing algorithms for NoC systems was presented in [4]. However, the basic APSRA does not take into account the communication attributes like the communication bandwidth requirements of different communicating task pairs mapped on different network nodes. Thus, selection of the routing paths to be removed to restrict the routing function and to guarantee deadlock freeness, is carried out in a blind fashion. It is equivalent to assuming that all the communications have the same bandwidth requirements. Such unawareness may lead to a bad distribution of the traffic load over the network. This is particularly true when the range of the bandwidth requirements of different communications is large. Unfortunately, this is a very frequent case in real applications. In [5], for example, the range of communication IET Comput. Digit. Tech., 2009, Vol. 3, Iss. 5, pp

2 bandwidth requirements for a Video Object Plane decoder in a MPEG-4 decoder system spans from 16 to 500 MB/s. The performance of a routing algorithm designed using APSRA methodology will also be greatly affected by its selection function in the router. This function should dynamically choose one among multiple admissible output ports for a new packet. We propose a new strategy for load estimation and design of the selection function. We propose that the application s communication behaviour along with routing function (topology, admissible paths, communication bandwidth between pairs etc.) should be analysed off-line and selection probabilities should be assigned to each admissible output port for packet coming from a certain input port. The above two considerations motivate this work and our proposal for improvement of APSRA methodology. As the traffic characteristics of a communication node pair is generally different from that of another pair, they should be distinguished. For this reason, we believe that emphasising the role of communication bandwidth requirements during the design of the routing algorithm design adds a new degree of freedom in system performance optimisation. 2 Related work An adaptive routing algorithm can be seen as the cascade of two main blocks which implement an adaptive routing function and a selection function (a.k.a., selection policy or selection strategy), respectively. First, a routing function computes the set of admissible outputs channels towards which the packet can be forwarded to reach the destination. Then, a selection function, is used to select one output channel from the set of admissible output channels depending on dynamic network conditions and/or locally stored information. Both the blocks have an important impact on overall network cost and performance and will constitute the topic of this paper. Regarding routing functions, many proposals for wormholeswitched networks have been presented in the literature [6 10]. Glass and Ni in [7] propose a turn model for designing wormhole routing algorithms for mesh and hypercube topology networks that are deadlock and livelock free. This model has been later utilised by Chiu [10] to develop the Odd Even adaptive routing algorithm for meshes without virtual channels. In comparison with the turn model, the degree of routing adaptiveness provided by the model is more even for different source destination pairs. Murali et al. [11] present a methodology to design application-specific NoCs using floorplan information. The routing function is designed by using the turn prohibition algorithm presented by Starobinksi et al. [12]. In the Starobinski s approach it is assumed that all the nodes of the network communicate with each other but this assumption is far away from the reality especially if we consider as a scenario of an heterogeneous system-on-chip implementing a specific application. Another application-specific design methodology for NoC systems is presented by Srinivasan et al. [13] where virtual channels are used to deal with deadlocks. An application-specific routing algorithm named APSRA has been proposed by Palesi et al. [4]. APSRA exploits communication information to maximise the adaptivity while ensuring deadlock-free routing for an application. The COmmunication Synthesis Infrastructure (COSI) framework [14] is used to define specific interconnect design flows for a variety of applications from chips to systems. Routing is modelled in a way that is very similar to the definition of routing tables in APSRA [4]. Moreover, as in APSRA, the definition of deadlock is based on the channel dependency graph. Our current work extends APSRA methodology to achieve multiple objectives of maximising adaptivity and distributing traffic more uniformly over the network. As regards selection functions, in [15], Schwiebert and Bell presented a detailed simulation study of various selection functions for several fully adaptive wormhole routing algorithms for 2D meshes. The obtained results show that the choice of selection function has a significant effect on the average message latency and saturation behaviour. Similar conclusions have been drawn by Feng and Shin [16]. An analysis of several selection functions in order to evaluate their influence on network performance has been carried out by Martinez et al. [17]. Improvement in network throughput (up to 10%) and in latency when network is close to saturation (up to 40%) has been observed. Hu and Marculescu [18] propose a routing scheme called DyAD which combines the advantages of both deterministic and adaptive routing schemes. The router works in deterministic mode when the network is not congested, and switches to an adaptive mode when the network becomes congested. In [19] Ye et al. present a contention-look-ahead on-chip routing scheme that is similar to [20]. It is a non-minimal routing in the sense that based on the value of two delay penalty indices the router chooses whether to send the packet towards a profitable route (minimal route) or a misroute (non-minimal route). The proposed approach has not been proved to be deadlock free. Differently from the other approaches which focus on output selection, in [21] the authors investigate the impact of input selection and present a contention-aware input selection technique that improves the routing efficiency. The concept of neighbours-on-path has been defined by Ascia et al. [22] to design a new selection policy which takes decision based on information deriving from the status of nodes belonging to the admissible paths from the current node. There is an abundance of work on path selection with bandwidth and latency awareness [23, 24]. Extensive research in these topics has been developed in the context of telecommunication and data networks. To the best of our knowledge, bandwidth-aware routing algorithms is a topic that has been left largely untouched in the context of on-chip interconnection networks. Except APSRA none of the aforementioned works exploit application information to optimise the routing algorithms. 414 IET Comput. Digit. Tech., 2009, Vol. 3, Iss. 5, pp & The Institution of Engineering and Technology 2009

3 Although APSRA uses communication topology and communication concurrency information, there are other important information features that could be exploited to improve the effectiveness of a routing algorithm. Communication bandwidth is one of that. General routing algorithms assume that all the communications are characterised by the same bandwidth requirements. This behaviour is rarely observed in real applications. For instance, looking at the task graph of the multimedia application [25] shown in Fig. 7, communication bandwidth requirements ranges from 10 to 500 MB/s. To the best of our knowledge, there are no contributions aimed at improving performance of routing algorithm by exploiting communication bandwidth information. This paper contributes in this direction presenting a methodology to design application specific and bandwidthaware routing functions along with a novel selection policy. This paper is an extension of [26]. Extensions include power, area and timing analysis of the router implementing the proposed routing scheme; delay, throughput and energy analysis of the NoC; and both an informal and a formal description of the methodology. 3 Terminology and problem formulation Simply stated, for a given application and a given network topology, the goal is to generate a routing algorithm which is strongly adaptive and spreads the traffic over the network in such a way that the communication traffic of any link will not exceed its capacity (maximum sustainable bandwidth). To formulate the problem more formally, we borrow the following terms from [4]. The communication graph, CG ¼ G(T, C), is a directed graph, where T is the set of tasks and C is the set of communications. Each communication c i, j ¼ (t i, t j ) [ C connects task t i [ T to task t j [ T. For a communication c [ C, the function B(c) returns the bandwidth requirement, that is the minimum bandwidth that should be allocated by the network in order to meet the performance constraints for communication c. The topology graph, TG ¼ G(N, L), is a directed graph which models the network topology. N is the set of network nodes, and L is the set of network channels. Channel l i, j ¼ (n i, n j ) connects node n i [ N to node n j [ N. Given a channel l [ L, the function Cap(l ) returns its capacity. The mapping function, M:T! N, maps tasks to network nodes [e.g. if M(t i ) ¼ n j then task t i is mapped on node n j of the network]. 3.1 Link load estimation As we are dealing with adaptive routing, the required bandwidth for communication c is split over multiple paths Figure 1 Effective bandwidth for a communication from node n s to node n d at 100 MB/s assuming a fully adaptive minimal routing that the routing function allows for c. For the sake of example, consider Fig. 1 which shows a 4 2 mesh-based network topology. Let us suppose that communication c ¼ (t s, t d ) requires a bandwidth of 100 MB/s and that the routing function allows all the minimal paths from node n s ¼ M(t s ) to node n d ¼ M(t d ) (four paths in total). The load is distributed over the paths as shown in Fig. 1 which reports, for each network channel, the effective bandwidth (or effective load) (EB) and the total number of paths containing that channel. Formally, the effective bandwidth of a channel l [ L because of a communication c [ C can be computed as jpt(c, l)j EB(c, l) ¼ B(c) jp(c)j where P(c) denotes the set of minimal paths admitted by the routing function for communication c, and PT(c, l) ¼ {P [ P(c):l [ P} is the pass through link set, that is the set of paths of c which contain the link l. Finally, we indicate with AB(l) the aggregate bandwidth of l which is computed as AB(l) ¼ X c[c EB(c, l) Using these definitions, the bandwidth-aware routing algorithm problem should meet the following constraint. Given a communication graph CG, a topology graph TG and a mapping function M, find a routing function R which is deadlock free and such that 8l [ L ) AB(l) Cap(l) (1) that is, the communication load of any channel, l, must not exceed its capacity Cap(l). 4 The proposed methodology In this section we provide a high-level overview of the proposed methodology and we discuss about the assumptions made and its limitations. IET Comput. Digit. Tech., 2009, Vol. 3, Iss. 5, pp

4 4.1 Overview An overview of the proposed methodology is shown in Fig. 2. The application is modelled by means of a communication graph. The communication graph together with the topology graph and a mapping function, which defines where each task is mapped on the NoC, represent the inputs of the proposed methodology. This information is used to build the application-specific channel dependency graph (ASCDG) [4]. If it contains cycles, they are iteratively broken by removing application-specific dependencies selected by means of a procedure that will be discussed in Section 5.1. The heuristic behind such a procedure is to assign more adaptivity to communications characterised by higher communication bandwidth requirements. As soon as all the cycles have been removed the routing function is deadlock free. Then, a link load analysis is performed to identify links in which aggregated bandwidth exceeds the link capacity. In this case a load balancing procedure, which will be described in Section 5.2, is used to selectively remove routing paths and to reduce the aggregated bandwidth on overloaded links. At the same time, it tries to allocate alternative routing paths in such a way that load is distributed almost equally among links. As a result a new deadlock-free routing function is obtained. Finally, a set of selection probabilities, which will be used by the selection policy described in Section 5.3, are computed. 4.2 Assumptions and scope In this work two important issues are not covered. The first is related to the way in which communications characteristics inducted by the application are modelled, and the second concerns the out-of-order delivery problem which characterises any adaptive routing algorithm. Figure 2 Block diagram of the proposed design flow 416 IET Comput. Digit. Tech., 2009, Vol. 3, Iss. 5, pp & The Institution of Engineering and Technology 2009

5 In the overview of the proposed methodology we assumed that the input application is already mapped and scheduled on the NoC platform before the design of routing algorithm starts. We also assumed that the communication volume between various tasks (and hence between various cores after the mapping step) is already determined using application profiling. It should be pointed out that, although the use of a bandwidth annotated communication graph (also known as the core graph or the communication task graph or the application characterisation graph) is generally used as entry point in many design methodologies [2, 3, 27], the application profiling task, which allows to determine communication volume between various tasks (even before the application and communication is mapped onto the platform), is still an open issue. In this context, the design space exploration tool from hartes could be useful for this purpose [28]. Another example is the task graph extraction (TGE) tool from Princeton [29]. The way in which communications are characterised in this work constitutes a simplification of the problem. In fact, a certain communication is characterised in terms of its maximum bandwidth requirements only without considering other important communication attributes like burstiness. This simplified model of communication behaviour results in a pessimistic analysis as we assume that a communication will demand the same bandwidth (the maximum bandwidth) for its entire lifetime. Further, the assumption that all communications are potentially concurrent results in exaggerated communication traffic density which may never happen if communication dependencies are taken into consideration. This may reduce the actual benefit of our schemes when applied to real applications in which degree of communication concurrency is less. The second open issue in this work is related to the routing algorithm. Although the routing algorithm we propose is multi-path, we do not take into consideration the mechanisms required for reordering packets at the destination. To cope with out-of-order packets delivery problem which characterises any adaptive routing algorithm, a possibility is to use the re-ordering mechanism at network reconvergent nodes proposed by Murali et al. [30]. In this case it needs to restrict the routing function in such a way as to remove all the intersecting paths for each source/destination pair. However, this will strongly impact the effectiveness of the proposed routing algorithm since one of its main benefit (high adaptivity) is reduced. However, in this work we distinguish between application performance from network performance, although the former depends on the later. Our focus is to improve network performance (network latency and throughput) and not application performance. That is, the proposed routing method, like other adaptive routing algorithms, is more useful to applications which can tolerate out of order delivery of packets. 5 Bandwidth-aware routing algorithm In this section we present our proposal for designing highly adaptive deadlock-free and bandwidth-aware routing algorithms. The section is organised in three subsections. The first subsection presents the strategy used to select and remove dependencies in the ASCDG which minimise the amount of bandwidth that must be redistributed among the remaining routing paths. The second subsection deals with the problem of checking and recovering when aggregated bandwidth on some network links exceeds link capacity. Finally, the last subsection describes a new selection function aimed at exploiting the peculiarities of the proposed routing function. 5.1 Bandwidth-aware routing function A cycle in the ASCDG is a succession of application-specific direct dependencies D ¼ {d 1, d 2,..., d n }, where a d [ D is a pair (l i, l j ) with l i, l j [ L. Here the problem is the selection of the best dependency to be removed to break the cycle D. Removing a dependency means removing all the paths which use that dependency. As soon as a path is removed, the fraction of bandwidth it transports must be redistributed between the remaining paths. For instance, suppose that the direct dependency d between channel l i and channel l j in Fig. 1 must be removed to break a cycle in the ASCDG. Removing d means prohibit path 3. As soon as path 3 is removed, the 25 MB/s transports are redistributed between path 1 and path 2 as shown in Fig. 3a. The idea we propose Figure 3 Bandwidth allocation a After removing channel dependency from l i to l j in Fig. 1 b After removing path 2 from a IET Comput. Digit. Tech., 2009, Vol. 3, Iss. 5, pp

6 in this paper is to choose and remove the dependency d which minimises the overhead of bandwidth that should be allocated to the remaining paths that do not use the dependency d. Formally, let us indicate with PT 2 (c, d) the pass through dependency set, that is the set of paths of c which use the dependency d ¼ (l 1, l 2 ) PT 2 (c, d) ¼ PT(c, l 1 ) > PT(c, l 2 ) Let d be an application-specific direct dependency. To remove d all the paths of any communication c which use d must be removed. For communication c the aggregated bandwidth to be redistributed is [B(c)=jP(c)j] jpt 2 (c, d)j. This bandwidth is redistributed between the jp(c)j jpt 2 (c, d)j remaining paths which do not use the dependency d. Based on this, the dependency to be removed is the d [ D such that the cost function cost(d) ¼ X B(c) jpt 2 (c, d)j c[c jp(c)j(jp(c)j jpt 2 (c, d)j) (2) is minimised. This ensures that the dependency which will be chosen for removal is such that the load on the paths which use that dependency is redistributed in such a way that it results in minimum increase in load on alternative paths. The cycles breaking algorithm is shown in Fig. 4. First, all the cycles of the ASCDG are detected by the function GetAllCycles and stored in the list cycles. Then, the so-called enumeration tree is built. The meaning of the enumeration tree is as follows. The order in which the cycles in ASCDG get treated determines both the overall adaptivity of the generated routing algorithm and the routability for all the communications. More precisely, with regard to the second point, certain cycle removal sequences might make some communications unroutable. In our implementation we used a back-tracking mechanism in which removing sequences are generated by performing a depth-first search of the solution space. Fig. 5 shows the enumeration tree generated by four cycles c 1, c 2, c 3, c 4. If, for instance, the removal sequence c 1! c 2 causes reachability problems then the sub-tree under c 1! c 2 is not considered for further analysis. The back-tracking mechanism returns to c 1. If the removal sequence c 1! c 3! c 2 results in a reachability problem then the back-tracking mechanism returns to c 3. If the removal sequence c 1! c 3! c 4! c 2 is Figure 4 Break cycles algorithm feasible (i.e. it does not result in reachability problems) the procedure terminates. The steps to break all the cycles of the ASCDG start from line 6 in Fig. 4. First, a backup of ASCDG, C and P is performed. Then, a cycle sequence cseq is extracted from the enumeration tree. The steps from lines 10 to 22 remove all the cycles in the same sequence as defined by cseq. For each of such cycles, only the channel dependencies that, if removed, do not cause reachability problems, are considered. This check is performed by assuring that there does not exist any communication whose all routing paths use such channel dependency (line 13). Thus, the channel dependency, d 0, which minimises the cost function (2) is selected and removed from the ASCDG (line 27). Then, all the routing paths which use d 0 are removed from the set of admissible paths (line 28). In case of reachability problems (line 24), the ASCDG, C and P are Figure 5 Enumeration of cycle sequences for four cycles 418 IET Comput. Digit. Tech., 2009, Vol. 3, Iss. 5, pp & The Institution of Engineering and Technology 2009

7 restored and the sub-tree of the enumeration tree whose root is the cycle whose removal has caused reachability problems is pruned (line 25). In this case a new iteration is performed with a new sequence of cycles (line 6). The overall time complexity of the algorithm is O(2 n ), where n depends on size of the rectangle containing the source and destination nodes (in the case of a mesh-based topology). The complexity of the proposed approach is not because of the heuristic itself but because of the computation of the ASCDG. The construction of the ASCDG involves the annotation of each minimum path between any source/ destination pair as defined in the communication graph. The basic assumption is that we start from a minimal fully adaptive routing algorithm. Thus, as the NoC size increases, the approach could become infeasible if some nodes located far from each other need to communicate. It should be pointed out, however, that this is the worst-case condition. In fact, any topological mapping algorithm tries to map most frequent and most critical communications in such a way as to minimise the physical distance between the source and destination nodes. For long-distance communications, one can consider a subset of all the minimal paths. A detailed analysis of the complexity of building the ASCDG has been presented in our previous work [4]. 5.2 Bandwidth reallocation Using the procedure discussed in the previous subsection, we obtain a routing function which is deadlock free (as the ASCDG is acyclic) and which generates a set of routing paths by providing more adaptivity to communications characterised by higher communication bandwidth. However, it is possible that the aggregate bandwidth on some network links exceeds the capacity of these links [i.e. condition (1) is not satisfied for some l [ L]. In this case some routing paths passing on that link, must be removed to reduce the aggregate bandwidth on that link down to the links capacity or, in a more general way, down to a user-defined value. For instance, looking again at Fig. 3a if either network links capacity is 50 MB/s or we want that links load do not exceed 50 MB/s, path 2 should be removed as shown in Fig. 3b. The proposed bandwidth reallocation algorithm is shown in Fig. 6. The input parameters are the set of network links, the set of communications, the set of admissible paths derived from the procedure described in the previous subsection and a threshold which defines the maximum bandwidth which has not to be exceeded in any network link. The output is the updated set of routing paths. The procedure starts by sorting network links in descending order based on their aggregate bandwidth. For each link l and for each communication c which has at least one path using l, and more than one path, two lists named paths2rem and paths2enr are generated as follows. paths2rem contains all the paths for c that should be removed as they use network links whose load exceeds the threshold. paths2enr contains Figure 6 Bandwidth reallocation algorithm those paths that can be used by other communications (i.e. can be enriched) as they use links whose load is below the threshold. Then, the list paths2rem is scanned and routing paths belonging to it are removed from P. Of course, removing a path causes the redistribution of the bandwidth allocated on it to the other paths belonging to paths2enr (see, for example, Fig. 3). Thus, the path elimination stops when there is at least one path in paths2enr that contains a link whose load exceeds the threshold. The above steps are repeated until the load on each link does not exceed the threshold. This procedure aborts if the path elimination step cannot be operated because of reachability issues which arises when it needs to remove a path which is unique for a certain communication. Although the presented algorithm assumes that all the network links have the same capacity, it is simple to generalise by replacing the scalar input parameter threshold with a function T :L! < which returns the bandwidth threshold associated to any channel l [ L. In this case, the condition AB(l). threshold in lines 5, 15 and 24 is replaced with AB(l ). T(l ). 5.3 Load balancing selection function To be effective, a good routing function must be coupled with an intelligent selection function. In fact, selection schemes IET Comput. Digit. Tech., 2009, Vol. 3, Iss. 5, pp

8 Figure 7 Communication graph of the MMS strongly affect the overall performance of any adaptive routing algorithm [15 17]. Generally, selection policies take decisions based on on-line measurement or estimation of traffic density. However, such estimation is a costly and difficult task. One of the ways to implement the selection function is to randomly distribute packets to admissible output ports. But this selection policy can lead to a large load imbalance on network links and in actual practice degrade network performance. Online information about traffic density and congestion on paths leading to the packet destination can be useful in selecting the appropriate admissible port. Most of the current approaches use local information regarding usage of buffer associated with an output port in the router (or neighbouring router in that direction) as a measure of communication traffic in that direction [18]. Some approaches use more elaborate look-ahead strategies for this purpose [22]. These selection strategies give better latency performance, especially when communication volume is high. The idea behind the proposed selection policy can be summarised by means of an example. Let us consider again Fig. 1. Let us suppose that all the four minimal paths from node n s to node n d are allowed by the routing function. When n s receives an header flit destined to n d, the routing function returns, as a set of admissible output channels, the set feast, southg. Now, let us suppose that the router in node n s is aware of the number of admissible paths to reach node n d starting from channel east and south, respectively. In our example, there are three paths from east and one path from south. So, the selection policy should use the east output channel with higher probability than south output channel (e.g. use east port with probability 0.75 and south port with probability 0.25). Formally, let j be a uniformly distributed random variable in the interval [0, 1], and {l 1, l 2,..., l n } the set of admissible output channels 420 IET Comput. Digit. Tech., 2009, Vol. 3, Iss. 5, pp & The Institution of Engineering and Technology 2009

9 returned by the routing function, then the selection function is defined as " # S(l 1, l 2,..., l n ) ¼ l i, i:j [ Xi 1 Pr{l j }, Xi Pr{l k } where Prflg indicates the probability to select output channel l, which is proportional to the number of admissible paths starting from l and that can be used to reach the destination. Of course, these probabilities are computed off-line and stored into the router as discussed in Section 7.1. j¼1 6 Evaluation and results We evaluate the proposed approach on both synthetic and real traffic scenarios. As synthetic traffic scenarios, we consider uniform, transpose, bit-reversal, shuffle, butterfly and hot-spot [31]. For them the bandwidth for each communicating pair has been randomly generated between 10 and 100 MB/s. As a more realistic communication scenario we consider a generic multimedia system (MMS) which includes an H.263 video encoder, an H.263 video decoder, an mp3 audio encoder and an mp3 audio decoder [25]. The communication graph of MMS is depicted in Fig. 7. It has been partitioned into 40 distinct tasks which have been mapped on a 5 5 mesh-based NoC using the mapping technique proposed in [32]. In the following we refer as APSRA the approach proposed in [4], with APSRA-BW the variant of APSRA using the heuristic presented in Section 5.1, and with APSRA-BWL the augmented version of APSRA-BW with the bandwidth reallocation procedure discussed in Section 5.2. We organise this section in two subsections. In the first one, we perform a bandwidth analysis aimed to show how the proposed approach allows to (i) uniformly distribute the communication bandwidth over network links, and (ii) avoid that bandwidth allocated on network links exceed link capacity. In the second subsection, we perform a dynamic analysis using a flit-accurate NoC simulator to show the performance improvements both in terms of delay and throughput. 6.1 Bandwidth analysis Let us start by showing the effectiveness of the proposed approach in uniformly distributing the traffic over the network. To do this, we use as a metric the standard deviation of the aggregate bandwidth in the network links. Using this metric, we compare APSRA, APSRA-BW and APSRA-BWL on a 8 8 mesh-based NoC under different traffic scenarios. For the APSRA-BWL, we fix the threshold to 90% of the maximum aggregate bandwidth when fully adaptive minimal routing is used. For each traffic, Table 1 reports the reduction in percentage of standard deviation of the aggregated bandwidth in k¼1 (3) Table 1 Percentage reduction of standard deviation of the aggregated bandwidth in network links Traffic APSRA-BW APSRA-BWL uniform bit-reversal butterfly 0 2 shuffle transpose1 0 2 transpose2 0 2 hot-spot_c hot-spot_tr 5 10 MMS 5 5 Average network links when both APSRA-BW and APSRA-BWL are used. As can be seen, the proposed heuristic to break cycles of the ASCDG allows to better distribute the bandwidth across the network. There are some situations, in which there is not any reduction in standard deviation. This is the case of transpose and butterfly traffic in which the ASCDG is acyclic and the cutting edge heuristic does not take place. On average the standard deviation of the aggregated bandwidth in network links decreases by 10%. An additional improvement of 2% is obtained when the bandwidth redistribution procedure is used. On the other side, as discussed in Section 5.2, the elimination of some routing paths operated by the bandwidth redistribution procedure, negatively affects the adaptiveness [10] of the routing function as shown in Fig. 8. It is interesting to observe that, for some traffics, like bit-reversal and shuffle, the adaptivity of APSRA-BW is higher than that of APSRA. Although the main objective of APSRA is the maximisation of adaptivity, the heuristic used to break cycles immediately stops when the first solution is found. At any rate, as can be observed, the average adaptivity still remains much higher as compared to that of odd even [10]. Fig. 9 shows the aggregate bandwidth of any link of a 9 9 mesh-based NoC under uniform traffic for both the routing algorithm generated by APSRA and by APSRA- BWL. The threshold has been fixed to 550 MB/s. As can be observed, when APSRA is used, the aggregate bandwidth in several link exceeds the threshold. If this threshold represents the network link capacity, such bandwidth overheads translate in local network congestion that, because of back pressure mechanism along with the wormhole switching techniques, propagates to the entire network causing a strong degradation of overall network performance. IET Comput. Digit. Tech., 2009, Vol. 3, Iss. 5, pp

Figure 8 Adaptivity exhibited by odd even and by routing algorithms generated by APSRA, APSRA-BW and APSRA-BWL under different traffic scenarios Figure 9 Aggregate bandwidth per link for a 9 9

10 shows the absolute number of network links which exceed a given threshold when APSRA, APSRA-BW and APSRA-BWL are used.

10 Figure 8 Adaptivity exhibited by odd even and by routing algorithms generated by APSRA, APSRA-BW and APSRA-BWL under different traffic scenarios Figure 9 Aggregate bandwidth per link for a 9 9 mesh-based NoC under uniform traffic Routing algorithm used is generated by APSRA (top) and APSRA-BWL (bottom) Fig. 10 shows the absolute number of network links which exceed a given threshold when APSRA, APSRA-BW and APSRA-BWL are used. As can be observed, both APSRA-BW and APSRA-BWL allow to reduce the number of bandwidth violations as compared to APSRA. On average, the number of links exceeding the threshold when APSRA-BWL is used, is about the half of that obtained when APSRA is used. In particular, APSRA-BWL allows to meet bandwidth constraints which are almost 30 and 20% more stringent as compared to APSRA and APSRA-BW, respectively. 422 IET Comput. Digit. Tech., 2009, Vol. 3, Iss. 5, pp & The Institution of Engineering and Technology 2009

11 the load balancing selection policy (LB adaptive routing) have been used to distinguish the effect of the selection policy. Graph also reports results for deterministic XY routing and adaptive odd even routing. Once again, APSRA-BW and APSRA-BWL outperform APSRA. For a given average delay, APSRA-BW and APSRA-BWL are able to sustain higher bandwidth communication traffic than APSRA. Performance improvement over XY and odd even is even more evident. Figure 10 Absolute number of network links which exceed the threshold when APSRA, APSRA-BW and APSRA-BWL are used 6.2 Performance analysis Now, we evaluate the different routing algorithms in terms of average delay. Delay is defined as the time (in clock cycles) that elapses from the occurrence of a header flit injection into the network at the source node to the occurrence of a tail flit reception at the destination node. Noxim [33] is used as NoC simulation platform. Poisson packet injection distribution is used for synthetic traffic scenarios whereas self-similar packet injection distribution is used for MMS scenario (self-similar traffic has been observed in the bursty traffic between on-chip modules in typical multimedia applications [34]). Fig. 11 shows the average delay variation under uniform traffic for different ranges of communication bandwidth. That is, the bandwidth for each communicating pair has been randomly generated between the lower and upper bounds reported on the x-axes. In this experiment both the random selection policy (RND oblivious routing) and Figure 11 Average delay variation under uniform traffic for different ranges of communication bandwidth Fig. 12 shows average delay, throughput and energy for different packet injection rate (pir) factors under MMS traffic scenario. That is, starting from the communication graph of the application, we compute the pir of any communication c as pir(c) ¼ communication bandwidth of c packet size flit size clock frequency Thus, a point in the graph at a given pir factor p is computed simulating the network using a pir value of p pir(c) for a communication c. As can be observed, both the oblivious routings (odd even and APSRA with random selection function) and adaptive routings (APSRA-BW and APSRA-BWL with LB selection function) outperforms XY deterministic routing. For instance, looking at Figs. 12a and b, moving from XY to odd even the pir factor which saturates the network (a network is said to start saturating when increase in applied load does not result in linear increase in throughput [35]) increases by 33%. An additional improvement of 25% is obtained when application-specific routing is used. Finally, the use of an effective selection function like that proposed in this paper adds a further 10 and 40% of improvement when APSRA- BW and APSRA-BWL are considered, respectively. Fig. 12c shows the average energy per cycle per flit for different pir factors. We used the high-level energy estimation feature provided by noxim simulator to compute energy numbers [22]. Please note that the values after the saturation pir factor do not carry useful information as there the network is congested and flits into the network spend much of their travel time waiting into routers buffer. Thus, considering the range of pir factor where none of the algorithms are saturated, we observe that applicationspecific routing algorithms are more than 6 and 5% energy efficient than XY and odd even, respectively. If we restrict the analysis to APSRA, APSRA-BW and APSRA-BWL we observe that the proposed approach allows to reduce energy consumption by 6%. Taking APSRA as the baseline implementation, a summary of the improvements in terms of percentage increase in saturation pir factor, reduction of both average delay and energy consumption for different traffic scenarios is shown in Fig. 13. For all traffic scenarios but MMS the bandwidth for each communicating pair has been randomly generated between 10 and 100 MB/s. As can be observed, on average APSRA-BWL improves saturation point by IET Comput. Digit. Tech., 2009, Vol. 3, Iss. 5, pp

Figure 12 Simulation results for MMS traffic a Delay variation b Throughput variation c Energy variation Figure 13 Summary of the results taking APSRA as baseline a Percent increase in saturation pir

12 Figure 12 Simulation results for MMS traffic a Delay variation b Throughput variation c Energy variation Figure 13 Summary of the results taking APSRA as baseline a Percent increase in saturation pir b Percent reduction in average delay c Percent reduction in energy consumption 38%, reduces average delay by 43% and energy consumption by 4%. Finally, Fig. 14 shows the links utilisation under uniform traffic for APSRA and APSRA-BWL. Link utilisation value is discretised by three levels: low (white), medium (grey) and high (black). As can be observed, when APSRA-BWL is used links utilisation are more evenly distributed as compared to APSRA. For instance, looking at links utilisation when APSRA is used, there are several high utilised links (black) and many low utilised links (white). When APSRA-BWL is used, traffic flows responsible for the high utilisation of some links, are redistributed in favour of low utilised links. This is confirmed by the higher number of medium utilised links when APSRA-BWL is used. 7 Implications for router architecture In this section we present a router architecture design to support the proposed routing algorithm (routing function and selection function). 7.1 Router architecture Fig. 15 shows an architecture of the proposed router for the case of a mesh network topology and minimal routing. The top part of the picture shows the high level view of the router, whereas the bottom part shows the block diagrams of the modules which implement routing function and selection function associated to the west input port. The routing function is implemented by means of a routing table. The routing table is addressed by the destination id. An entry of the routing table contains two main fields: AOC and Pr. AOC encodes the set of admissible output channels that can be used to reach the current destination. If we consider the west input port, AOC is a four bit field whose bits indicate which of the output ports among north (N), east (E), south (S) and local (L) that can be used to reach the current destination. Pr encodes the probability used by the selection function as discussed in Section 5.3. The number of bits used to encode Pr determines the precision of the selection function. For instance, using three bits, eight probability levels are possible (from to 1). 424 IET Comput. Digit. Tech., 2009, Vol. 3, Iss. 5, pp & The Institution of Engineering and Technology 2009

Figure 14 Links utilisation under uniform traffic for APSRA and APSRA-BWL A possible implementation of the selection function reported in (3) is shown in the bottom right corner of Fig. 15.

13 Figure 14 Links utilisation under uniform traffic for APSRA and APSRA-BWL A possible implementation of the selection function reported in (3) is shown in the bottom right corner of Fig. 15. The connector labelled with 1, is used in several parts of the circuit. It is set when the routing function returns more than one admissible output channel. If it is zero, only one admissible output channel can be used. In this case, the selection logic is bypassed and clock gating is used to prevent the unnecessary activity of unused blocks. The DirEncS block converts the one-hot encode used at the input, to the encode of the selected output channel. If more than one (max two, because we are considering minimal routing) output channels can be used, a selection must be operated. The input pr is shifted left (multiplied) and compared with the current value stored in the linear feedback shift register (LFSR). If it is less, the first output channel is selected, otherwise the second one is selected. This selection is, of course, conditioned by the whrt word which encodes the reservation status of the output channels operated by wormhole switching technique. Precisely, suppose that north and east output channels are admissible and north should be selected after the comparator. However, if north output channel is reserved but east is not, east will be selected. This computation is performed by the DirEncM block which returns the encode of the selected output channel. 7.2 Area, timing and power analysis A router implementing deterministic XY routing algorithm, a router implementing adaptive odd even routing and twotable-based routers, one implementing a random selection policy (TB-RND) and the other implementing the load Figure 15 Block diagram of the router for a mesh network topology Top view (top), routing function and selection function associated to the west input port (bottom) IET Comput. Digit. Tech., 2009, Vol. 3, Iss. 5, pp

14 balancing selection policy (TB-LB), have been designed in VHDL and synthesised using Synopsys Design Compiler and mapped on a 90 nm technology library from TSMC. We considered 8 8 mesh topology networks, four-flits FIFO input buffers with flit size of 64 bits. The analysis is carried out at a granularity of the following main blocks. Arbiter: It is a general arbiter which manages situation where several packets simultaneously want to use the same output. In this case, arbitration between these has to be performed. In this implementation round-robin policy is used. XBar: It is a general 5 5 crossbar block which allows to simultaneously route non-conflicting packets. Input FIFOs: They are the FIFO buffers at the input of each router. For a mesh topology there are five FIFO buffer in total. We considered four entry FIFO buffer with an entry size of 64 bits (flit size). WHRT: This block implements the Wormhole Reservation Table which stores the output port selected by the routing algorithm associated to a given input port. Routing function: It is the block that gives the set of admissible outputs for the current node and a given destination. As we are considering mesh topologies and minimal routing, the maximum number of admissible outputs is two. Selection function: This block probabilistically selects one of the outputs from the set of admissible outputs returned by the routing function. Control: Control logic for sequencing various activities in the router. The effectiveness of the proposed selection policy depends on the number of bits used to encode the selection probabilities stored in the routing table (field Pr in the routing table shown in Fig. 15). We used three bits (i.e. eight probability levels from to 1) as no appreciable performance improvements have been observed by using more than three bits. For instance, Fig. 16 shows the average delay variation under MMS traffic when different discretisation levels are used to encode the selection probabilities Area analysis: Fig. 17a. shows the area breakdown for the considered routers. As expected, although a good percent of the area is due to FIFO buffers, control logic and arbiter, the impact of routing table is quite evident. The use of the LB selection function determines an area overhead on routing function block (i.e. routing table) and selection function block of 56 and 73% respectively. The Figure 16 Average delay variation under MMS traffic when different discretsation levels are used to encode selection probabilities overhead in the routing table is due to the additional field Pr which stores the selection probabilities used by the selection function. However, as input FIFO buffers dominate the area, globally this overhead translates to approx 8% of overall router area only Power analysis: Average power dissipation values of the main blocks composing the four routers are shown in Fig. 17b. Once again, the main contribution to power dissipation is due to FIFO buffers. The second highest contribution is due to the crossbar. Power dissipated by routing tables is 8 and 3% more than that dissipated by routing blocks implementing XY and odd even, respectively. With regard to the selection function, power dissipated by the LB selection function is about 80% more than that dissipated by a random selection function. In terms of global router power dissipation, routing table contributes by about 12% whereas LB selection function by 6%. It should be pointed out that both the routing table and the selection function block are active only when an header flit is processed. In fact, the above analysis is very conservative as it has been assumed that all the blocks in the router are characterised by the same utilisation factor (worst-case analysis). In practical cases, power contribution because of routing table and LB selection function are likely to be lower than that reported above Timing analysis: Fig. 17c shows the delay of the different blocks composing the four routers. We considered a five-stages pipeline implementation of the router with the following stages: FIFO, routing, selection, arbitration and crossbar. In this case the clock frequency is tuned over the FIFO stage except for the router implementing the odd even routing whose slowest stage is routing. The access to the routing table as well as the computation of the LB selection function do not affect the router clock frequency. 426 IET Comput. Digit. Tech., 2009, Vol. 3, Iss. 5, pp & The Institution of Engineering and Technology 2009

Figure 17 Comparison between routers implementing XY routing, odd-even routing and table based router with random selection policy and balancing selection policy a Breakdown of area b Breakdown of

15 Figure 17 Comparison between routers implementing XY routing, odd-even routing and table based router with random selection policy and balancing selection policy a Breakdown of area b Breakdown of power dissipation c Breakdown of delay d Area, timing and power values normalised with respect to a router implementing XY routing 7.3 Summary of the architectural implications Fig. 17d compares different routers in terms of area, delay and power. Values are normalised with respect to a router implementing XY routing. In terms of area, RT-LB is 8% bigger than a classic table-based router implementing a random selection function. Such difference is mainly because of the increase in width of routing table as it needs to store the selection probabilities required by the LB selection function. In terms of timing, the increase of routing table and the LB selection function do not impact the clock frequency of the router as the slowest pipestage continues to be the FIFO stage. Average power dissipation of RT-LB router is 14, 7 and 3% higher than that of XY, odd even and RT-RND router, respectively. However, as it has been shown in the experimental section, performance improvement obtained using the proposed routing and selection functions results in an overall saving in energy consumption. This is due to the fact that, although a RT- RND router is more power hungry than the other routers, a network built with RT-LB routers requires less cycles to drain a given volume of traffic with a consequent reduction in energy consumption. The average energy consumed in a period of time is the product between the average power dissipation and the duration of the period. 8 Conclusions An application-specific routing algorithm has a potential to provide substantially higher communication performance than general purpose routing algorithms. In this paper we have presented an extension to APSRA methodology to design highly adaptive bandwidth-aware applicationspecific deadlock-free routing algorithms for NoC platforms. The basic idea behind the approach is the exploitation of communication bandwidth information to customise the routing algorithm for a given application. The approach is divided into two phases. In the first phase, information regarding communication bandwidth required between a pair of cores is used in the heuristic while removing cycles in ASCDG to ensure deadlock freedom and deciding selection probabilities for various available paths for a communication. This helps the resulting routing algorithm to achieve high adaptivity along with spreading the traffic uniformly over the network links. In the second phase, the routing function is further restricted in an iterative manner to reduce loads on some overloaded network links. The approach has been evaluated on both synthetic and real traffic scenarios. The results obtained show that the routing algorithm generated by the proposed approach (i) is highly adaptive, (ii) reduces the variation of load in the network links and (iii) ensures that the link IET Comput. Digit. Tech., 2009, Vol. 3, Iss. 5, pp

Bandwidth Aware Routing Algorithms for Networks-on-Chip

1 Bandwidth Aware Routing Algorithms for Networks-on-Chip G. Longo a, S. Signorino a, M. Palesi a,, R. Holsmark b, S. Kumar b, and V. Catania a a Department of Computer Science and Telecommunications Engineering