SYSARC 768 No. of Pages 14, Model 5+ ARTICLE IN PRESS UNCORRECTED PROOF

Size: px

Start display at page:

Download "SYSARC 768 No. of Pages 14, Model 5+ ARTICLE IN PRESS UNCORRECTED PROOF"

Garry Lynch
5 years ago
Views:

1 1 Journal of Systems Architecture xxx (2007) xxx xxx 2 Deadlock free routing algorithms for irregular mesh topology 3 NoC systems with rectangular regions q 4 Rickard Holsmark a, *, Maurizio Palesi b, Shashi Kumar a 5 a School of Engineering, Jönköping University, Sweden 6 b DIIT, University of Catania, Italy 7 Received 15 December 2006; received in revised form 18 May 2007; accepted 17 July Abstract 10 The simplicity of regular mesh topology Network on Chip (NoC) architecture leads to reductions in design time and manufacturing 11 cost. A weakness of the regular shaped architecture is its inability to efficiently support cores of different sizes. A proposed way in lit- 12 erature to deal with this is to utilize the region concept, which helps to accommodate cores larger than the tile size in mesh topology NoC 13 architectures. Region concept offers many new opportunities for NoC design, as well as provides new design issues and challenges. One 14 of the most important among these is the design of an efficient deadlock free routing algorithm. Available adaptive routing algorithms 15 developed for regular mesh topology cannot ensure freedom from deadlocks. In this paper, we list and discuss many new design issues 16 which need to be handled for designing NoC systems incorporating cores larger than the tile size. We also present and compare two 17 deadlock free routing algorithms for mesh topology NoC with regions. The idea of the first algorithm is borrowed from the area of fault 18 tolerant networks, where a network topology is rendered irregular due to faults in routers or links, and is adapted for the new context. 19 We compare this with an algorithm designed using a methodology for design of application specific routing algorithms for communica- 20 tion networks. The application specific routing algorithm tries to maximize adaptivity by using static and dynamic communication 21 requirements of the application. Our study shows that the application specific routing algorithm not only provides much higher adap- 22 tivity, but also superior performance as compared to the other algorithm in all traffic cases. But this higher performance for the second 23 algorithm comes at a higher area cost for implementing network routers. 24 Ó 2007 Elsevier B.V. All rights reserved. 25 Keywords: Networks on Chip; Mesh topology; Routing algorithms; Wormhole switching; Deadlock; Application specific routing Introduction 28 Network on Chip (NoC) is slowly being accepted as an 29 important paradigm for implementing communication 30 among various cores in a SoC. Network topology and 31 routing algorithms are two of the most important aspects 32 which distinguish various proposed NoC architectures [1 33 5]. Fixed tile size based two dimensional mesh topology 34 is favored by many research groups because of its layout efficiency, good electrical properties and simplicity in addressing on-chip resources. Such a physically homogeneous network is not efficient for incorporating cores of different sizes in the network. In such a network, the tile size should be able to accommodate the physically largest core, such as a shared memory. It will also be hard to reuse earlier designed multi-core sub-systems within a fixed tile size based NoC. To overcome these problems the concept of a region was proposed in [1]. This concept allows a rectangular area in the mesh, larger than a tile, to be declared as a region. The region is isolated from the outside network using a wrapper as shown in Fig. 1. There are many advantages of using a modified mesh topology NoC to handle cores larger than the tile size rather than developing a q This paper is an extended version of the paper presented at DSD 2006 [26]. * Corresponding author. address: hori@ing.hj.se (R. Holsmark) /$ - see front matter Ó 2007 Elsevier B.V. All rights reserved. doi: /j.sysarc

2 2 R. Holsmark et al. / Journal of Systems Architecture xxx (2007) xxx xxx NoC Router Region Wrapper Normal Sized NoC Tile Region Fig. 1. Region within a mesh topology NoC. 49 new topology for each new SoC. The modified topology 50 automatically inherits the scalability property of the 51 underlying mesh topology. Due to known and uniform 52 length of wires for the links it is possible to guarantee good 53 electrical properties of the signals. The design of the routers 54 can also be reused across designs. 55 In a NoC system with regions, routing of packets 56 becomes more complex. Some network routers are 57 removed from the mesh network to accommodate a large 58 region. In effect, a region acts as an obstacle to the network 59 traffic. This not only results in higher packet latency, but 60 deadlock free routing algorithms designed for regular mesh 61 network are no more usable. 62 Routing in networks has been classified in several ways 63 in literature [6]. Routing schemes can be classified as source 64 routing or distributed routing. In source routing the source 65 node decides the entire path for a packet and appends it as 66 a field in the packet. In this scheme there is no possibility of 67 adapting the route after the packet leaves the source. In 68 distributed routing schemes a router on receiving the 69 packet decides whether it should be delivered to the local 70 resource or forwarded to a neighboring router. Routing 71 algorithms are also classified as deterministic or adaptive. 72 In deterministic routing the routing path is decided only 73 from the source and destination addresses. In an adaptive 74 routing scheme multiple paths from the source node to des- 75 tination node are possible. A particular path can be 76 selected to optimize certain performance parameters. 77 Two properties which are necessary in all usable routing 78 algorithms are deadlock- and livelock freedom. These two 79 properties respectively ensure that packets are not blocked 80 in the network for ever or wonder across the network 81 indefinitely [7]. A large number of algorithms exist for reg- 82 ular topology networks which ensure these properties. But 83 only a few algorithms exist for irregular topologies which 84 are both efficient and allow deadlock free routing. Another 85 desirable property of a routing algorithm is that it gives fair 86 and uniform performance to all equal priority traffic in the 87 network. Achievement of this property is harder in a net- work with irregular topology than in a regular topology network. This paper resulted from an effort to search for an algorithm which can handle irregularity, induced in a regular mesh topology by multiple rectangular regions of various sizes, and can provide these properties. The cost of a routing scheme is reflected in the implementation cost of the router. Generally, there is a tradeoff between cost and performance implying that routing schemes providing higher performance are costlier as compared to routing schemes with lower performance. Recently, power consumption is also being considered as a cost parameter in the design of network on chip architectures [8]. This paper focuses on evaluating and comparing performance of two distinct types of adaptive deadlock free routing algorithms for irregular topology mesh networks. The rest of the paper is organized as follows. In Section 2 we review related work. Section 3 presents the region concept and lists its applications and design issues in SoC design. In Section 4, we discuss the important issue of deadlock free routing and describe two different types of routing algorithms that can be used for NoC platforms with regions. We also briefly discuss the hardware implication of the algorithms. In Section 5 we present evaluation of these two routing algorithms and present results comparing their performance for synthetic communication traffic as well as traffic in a realistic multi-media application. Section 6 concludes the paper and lists some research problems for the future. 2. Related work Many factors affect the overall performance of a NoC. Network topology, flow control mechanism, switching technique and routing algorithm represent just a short list. In this paper we focus on routing algorithms in which the underlying switching technique is based on the wormhole concept [9]. Wormhole switching used in communication networks is proposed by several researchers (e.g., [4]) as most suitable for on-chip communication. It is preferred for two main reasons. First, it requires smaller router buffers as compared to the store-and-forward switching scheme. Second, network latency becomes relatively insensitive to path length due to the pipelined nature of flow of flits. Unfortunately, wormhole routing is very susceptible to deadlocks because messages are allowed to hold many resources while requesting others. To solve the problem of deadlock, many algorithms have been proposed for mesh topology networks in literature. For example, the simple X Y routing algorithm and Turn-model based [10] algorithms like west-first, are deadlock free in mesh networks. However, none of these can be used for meshes with regions, since circumventing a region is impossible because of the restrictions on the allowed turns. Neeb et al. [11] have proposed a methodology called INoC in which a customized topology and a customized deadlock free routing algorithm is designed for an applica

3 R. Holsmark et al. / Journal of Systems Architecture xxx (2007) xxx xxx tion. They show that for irregular traffic loads, the perfor- 144 mance of INoC approach is better than regular topologies 145 like Mesh, Tori and Spidergon. As the first step, INoC 146 approach starts with a floor-plan of required hardware 147 resources and a bidirectional chain topology in which all 148 pairs have path between them. Additional shortcut 149 channels are added to increase bandwidth required to sat- 150 isfy application s traffic. A change in traffic patterns 151 requires re-computation of required short-cuts. A table 152 based router design is assumed for implementation. The 153 approach does not consider physical sizes of various cores 154 and its effect on physical layout. 155 Bolotin et al. [4] have proposed a non-homogeneous 156 mesh topology NoC architecture allowing rectangular 157 cores larger than the mesh tile. Their solution to deadlock 158 free routing is to extend X Y routing with hard coded 159 paths which are computed off-line. This solution has prob- 160 lems with being reused across applications and cannot take 161 care of modifications in the communication topology of an 162 application. 163 A problem similar to regions can be recognized when 164 designing fault-tolerant routing algorithms for mesh net- 165 works. Several of these algorithms consider faults to be 166 contained in rectangular blocks similar to regions. In this 167 category, virtual channels [12] have been used to facilitate 168 design of such algorithms. In [13] Boppana and Chalasani 169 show that, using just one extra virtual channel per physical 170 channel, the well-known e-cube algorithm can be used to 171 provide deadlock free routing in networks with non-over- 172 lapping fault rings. In the same paper the authors prove 173 that at most four additional virtual channels are sufficient 174 to make fully adaptive algorithms tolerant to multiple 175 faulty blocks in n-dimensional meshes. A deterministic 176 fault-tolerant wormhole routing algorithm for mesh net- 177 work is presented by Zhou and Lau in [14]. The proposed 178 algorithm can tolerate convex fault-connected regions but 179 requires three virtual channels. Nevertheless, use of virtual 180 channels adds resources and increase design complexity. 181 Some researchers have proposed fault tolerant algorithms 182 without the use of virtual channels. These are based on 183 non-adaptive routing algorithms that are modified to work 184 in the presence of faults or regions. In [15], Wu proposes 185 modifications to X Y routing algorithm to route around 186 faulty blocks, but also imposes some restrictions. In [16] 187 an algorithm that is less restricted was proposed by Chen 188 and Chiu. Based on [16] a non-minimal deadlock free rout- 189 ing algorithm is also described for irregular topology NoC 190 with regions in [17,39]. Mejia et al. in [18] propose a deter- 191 ministic routing methodology for tori and meshes which 192 achieves high performance without the use of virtual chan- 193 nels. Furthermore, it is topological agnostic in nature, 194 meaning it can handle any topology derived from any com- 195 bination of faults. Unfortunately all the aforementioned 196 routing algorithms are deterministic, i.e. they do not allow 197 adaptivity to communication traffic. 198 Adaptivity is a characteristic of a routing algorithm to 199 adapt to changing situations. Therefore, number of alter- native paths provided by a routing algorithm for routing a message from a source node to a destination node can be used as a measure of its adaptivity. A routing algorithm, with high adaptivity also has a potential of providing high performance (low latency, low packet drop and high throughput), fault tolerance and uniform utilization of network resources. Of course adaptivity has some drawbacks like the problem that packets can reach the destination in an out-of-order fashion due to the difference in congestion levels on the multiple paths. However different approaches have been proposed in literature to cope with this problem like the use of simple re-ordering mechanism at network reconvergent nodes proposed by Murali et al. in [19]. One of the most important steps in the development of a theoretical framework for the design of adaptive deadlock free routing algorithms is due to Duato. In [20] he proposed a general theory to develop highly adaptive deadlock free routing algorithms for a general communication network which uses wormhole switching technique. Duato s theory is based on the idea of channel dependency graphs [21]. These graphs are used to identify a set of consecutive communication channels in the network, which if used concurrently can cause a deadlock situation. If no cycles exist in such a graph, the analyzed routing algorithm is deadlock free. Duato s theory does not exploit the possible knowledge of the communication traffic characteristics since it has been designed in a general-purpose domain where virtually each network node can communicate with any other node of the network. In [22] we focused on the embedded system domain where, often, the knowledge of communication traffic characteristics is available at design time. We took advantage of this additional knowledge to extend Duato s theory in such a way as to generate highly adaptive and deadlock free application specific routing algorithms. The approach, named APSRA (Application Specific Routing Algorithm), has been evaluated on homogeneous 2D mesh NoC architectures and compared with turn model based routing algorithms. However the approach is general and can be applied to any network topology like nonhomogeneous 2D mesh with regions. In this work we apply APSRA methodology to develop routing algorithms for mesh topology in which irregularity has been introduced by incorporating regions. This paper makes the following main contributions. We list the issues and problems when designing a mesh topology NoC system using cores larger than the tile size. We propose and compare performance and cost of two distinct approaches for designing deadlock free routing algorithms for this special type of irregular topology networks. The simulation based performance analysis clearly demonstrates that the APSRA approach is distinctively better. 3. Region concept and new design issues The region concept presented in [1] was intended for use of larger resources, which do not fit in the fixed sized slot of a regular mesh architecture layout. Region concept could

4 4 R. Holsmark et al. / Journal of Systems Architecture xxx (2007) xxx xxx 255 in addition be useful for encapsulating a group of resources 256 which have very high and special communication require- 257 ments which cannot be supported by the general NoC com- 258 munication infrastructure. Within such a region, there 259 could be specialized interconnections as well as communi- 260 cation protocols for achieving the required performance. 261 The concept also allows encapsulation of a group of 262 resources as a region for special requirements such as 263 power consumption or data security. 264 Above applications of region implies that the region 265 structure is physically different in design from its surround- 266 ings. This is however not necessary; it is possible that the 267 region is defined as a logical structure. In this case the inter- 268 nal hardware design of the region is identical with the out- 269 side NoC structure but is logically isolated from the 270 surrounding network. This assumes that there are configu- 271 rable routers in the NoC that can be used for defining and 272 maintaining a region. These routers on the region bound- 273 ary isolate the computation and communication within 274 the region from external traffic. Another application of 275 the region concept is to support different configurations 276 of power/performance modes of resources inside a region 277 by control of operating voltage, clock frequency etc. 278 We argue that reuse of multi-core subsystems will 279 become a very important application of the region concept 280 in the near future. Region concept can, for example, be 281 applied for the reuse of subsystems which have been devel- 282 oped for efficient processing of multi-media applications. 283 These solutions are currently available as separate SoCs. 284 Hence, the concept of region offers the possibility of raising 285 the level of reuse from a core to a level where specially 286 designed multi-core subsystems can be reused. It is unlikely 287 that these subsystems will physically fit in the general slot 288 for a core in the mesh NoC. Without the region concept External Access Points Multi- Core System Region Wrapper the subsystem will have to be redesigned keeping in view the NoC constraints. The effort required to redesign may be too high, or the redesigned subsystem may not be able to achieve the required performance in the NoC context. Fig. 2 illustrates the possibility of reusing a multi-core SoC, presented in [23], as a NoC region. The region concept presented in [1] suggested a convex shape of a region. This is easier to handle in terms of routing but may not be optimal in the case of placement and shape of the region Routing in NoC with regions Efficient routing of messages within the network is essential in order to fully exploit the power of the computing resources and achieve good performance for applications running on them. A good routing algorithm should not only provide low latency for messages but should also be deadlock free when the network is concurrently routing multiple messages. However, incorporating regions in mesh networks result in a major change of the communication infrastructure and the existing mesh routing algorithms cannot be directly reused. In addition to creating problems of deadlock freedom, regions also affect the traffic distribution in the network. Traffic flows which get obstructed by the region have to circumvent it in order to make progress. This could make the border links of the region more heavily used as compared to other links. Adaptive routing is one solution that can reduce the problem of local congestions. Normally, the term adaptive refers to a possibility to sense congestion and take action to divert from it. In this sense it is reactive. When regions are used in a NoC it is possible that this information is incorporated in the routing algorithm so that occurrence of congestion is reduced or avoided Accessing and addressing regions How the region is accessed and how it can access other resources is an important issue while designing with regions. Since a region occupies a larger area than a standard resource, it may be useful to consider several addresses and several access points to it. A large region may internally provide different types of access mechanisms to its internal resources. The external access points have to be properly connected to the internal access mechanisms. The purpose for which the region is used can also affect how the region is designed. A large shared memory is likely to require several access points distributed around the entire border, whereas a system with many processing elements may be accessed only by a few resources outside the region. The number of access points determines the communication bandwidth between a region and the rest of the network; the position of access points on the region boundary affects the communication latency of data. When using a region the issue of access-points and addresses to the region must be defined. The three major options, in Fig. 2. Multi-core subsystem [23] as a NoC region

5 R. Holsmark et al. / Journal of Systems Architecture xxx (2007) xxx xxx order of increased routing complexity and accessing power 343 are: Use the corner router which originally had a resource 345 connected to it as a single access point Use the routers on the border that originally had con- 347 nections to resources within the region, as multiple 348 access points Use all the possible routers on the boundary as multiple 350 access points to the region. In this case some routers are 351 connected to both a standard resource and a region Fig. 2 illustrates how a region can be accessed using 354 multiple access points. The routers through which access 355 is possible are shaded gray Design issues 357 As described in the previous section, the concept of 358 region extends the possibilities for NoC paradigm based 359 SoC design in many interesting and useful ways. Many 360 new design space exploration activities can be performed 361 in NoC systems with regions. Here we list a few new 362 degrees of freedom available for exploration. 363 Placement of the region. 364 Shape or aspect ratio of the region. 365 Number of access points to the region. 366 Position of access points on the region boundary These parameters span a huge design space which poses 369 new challenges in the research area of design space explo- 370 ration strategies Deadlock free routing algorithms for NoC systems 372 The deadlock free algorithms developed for homoge- 373 nous mesh networks, like Odd Even routing algorithm 374 [24], cannot be directly used in NoC with regions. To be 375 able to reach all destinations the routing algorithm has to 376 decide upon turns to get around the region. This will in 377 many situations violate rules that were used to secure dead- 378 lock freeness property in the case of a homogenous topol- 379 ogy NoC. Breaking these rules in order to reach a 380 destination may result in a deadlock situation. 381 In the following subsections we describe two routing 382 algorithms that we have used in our evaluation of routing 383 performance of NoC in the presence of regions. They rep- 384 resent two distinct approaches that can be applied to guar- 385 antee deadlock free routing in a NoC both with and 386 without regions. Due to cost considerations of on-chip 387 resources, we only present algorithms that do not require 388 virtual channels. However, it is possible to include this fea- 389 ture to increase network performance even further. The 390 first approach is adopted from the area of fault tolerant 391 routing. It is a general routing algorithm in the sense that 392 it works for any traffic scenario and region placement in a NoC. This results in good scalablilty and it supports dynamic changes of both architecture and communication patterns. The second approach has evolved from knowledge of the design optimization of embedded systems. It relies on the assumption that communication among tasks in an embedded application is known in advance. This information about the communication is incorporated when designing the routing algorithm. As we need not to consider all possible communication patterns, fewer restrictions have to be applied for the routes of the actual communications to avoid deadlocks. Thus, an application specific routing algorithm can have more adaptivity as compared to a general algorithm. However, any change in architecture or communication pattern requires a re-analysis and possibly re-design of the complete routing algorithm Routing algorithm adapted from fault tolerance area Chen and Chiu [16] presented a fault tolerant algorithm that can be used for routing in the presence of regions. However, the published algorithm had some errors which have later been corrected. The new version of this algorithm is possible to use in the presence of regions for reaching all destinations in a deadlock free manner [17]. We describe the basic ideas in the original algorithm here, for a full description of the algorithm, see [16] and [17]. For our purpose a faulty block described in the original algorithm is equivalent to a region. Chen and Chiu [16] borrow the idea of rings and chains from [13] to isolate the faulty nodes from the rest of the network. For messages which do not encounter any ring or chain, they allow nonadaptive routes which use maximum one turn from source to destination. For messages encountering faulty blocks it becomes necessary to allow some extra turns which are forbidden during normal routing. Only a few combinations of forbidden turns are allowed in a clever manner such that these turns can never combine with each other (or with the allowed normal turns) to form a cycle. When routing on paths not affected by faults, messages are forwarded in the network according to their type, as illustrated in Fig. 3. A message is of type row first (RF) if it has the destination to its west. If the destination is to its north or south it is a column first (CF) message. A message of type RF can thus change to CF when it reaches the CF CF RF CF CF RO Fig. 3. Message types and corresponding allowed routes in algorithm

6 6 R. Holsmark et al. / Journal of Systems Architecture xxx (2007) xxx xxx 437 column of destination. If it has its destination to its east it 438 is of type column first (CF) except when the destination is 439 in the same row, then it is row only (RO). A CF can also 440 change to RO if the destination is in the same row to its 441 east. However, an RO message never changes its type. If 442 a message hits the border of a faulty block special rules 443 apply depending on the type of the message and whether 444 the border resides on fault ring or a fault chain. There 445 are different rules for routing around these depending on 446 whether faults are surrounded by, an s-chain (chain that 447 touch the south border only), a non s-chain (chain that 448 touches only the west or west and south border) or ring 449 (all other positions of rings and chains). 450 Fig. 4 illustrates routes for some messages when travel- 451 ing in the presence of faulty blocks (regions). In this, mes- 452 sages are denoted by their source (Sn) and destination 453 (Dn) Application specific routing algorithms 455 Typical routing algorithms for NoC systems are 456 designed for a specific network topology and are indepen- 457 dent from the application which will be mapped on the 458 NoC. If a small variation of the topology should occur 459 (e.g., due to the merging of tiles of a mesh based network 460 to form a region) the routers need to be redesigned. The 461 use of routing tables helps to overcome this problem and 462 makes the router general and configurable. Routing tables 463 are filled up with information, which enables the communi- 464 cation between every pair of network nodes. The constraint 465 to be satisfied is that the channel dependency graph (CDG) 466 [21] should not contain any cycle to be sure that the routing 467 is deadlock free [20]. To do this, some possible paths, that 468 allow two nodes to communicate, must be prohibited caus- 469 ing a degradation of routing adaptiveness. This is, how- 470 ever, a strong limitation in an embedded system scenario 471 and the designer cannot exploit his knowledge of the appli- 472 cation that will be mapped on the NoC. 473 Often the designer is aware about which cores that com- 474 municate, and which do not. To overcome this limitation a 475 methodology to generate application specific routing func- 476 tions has been proposed in [22]. The basic idea of this meth- S1 S3 D2 D3 odology, known as APSRA (APplication Specific Routing Algorithm), is to extend Duato s theory in such a way as to exploit the designer s knowledge about communication characteristics of the application being implemented. Fig. 5 shows the APSRA design methodology. The inputs of the methodology are: (1) the application modelled by means of task graphs, (2) the network topology modelled by means of a topology graph, and (3) a mapping function which maps each task of the task graph to a node of the topology graph. In addition, concurrency information, available after the task scheduling phase, can also be considered [29]. Using this information an application specific channel dependency graph (ASCDG) is built. In [22] it is proved that if the ASCDG is acyclic then the routing is deadlock free. Since the ASCDG is a sub-graph of the CDG, it has more probability to be acyclic. This probability is quite high since, in practical cases, each node of the network communicates with a small subset of other nodes. The result is that a number of dependencies that are present in the CDG (which is built by conservatively assuming that all the network nodes will communicate) are not present in the ASCDG (which is built by knowing the actual communicating pairs). However, if the ASCDG is not acyclic, a heuristic to break all the cycles with the objective to minimise the impact on the degree of adaptiveness, and with the constraint to guarantee destination reachability has been proposed in [22]. The output of the methodology is a set of routing tables (one for each router of the NoC) which not only guarantees the reachability and the deadlock freeness of communication among tasks but also tries to maximise routing adaptivity. Finally, a compression technique can be used to compress the generated routing tables [27] APSRA: A practical example For the sake of example, let us consider the communication graph and the topology graph depicted in Fig. 6a and b respectively. Although for this example the topology is mesh-based, the approach is general and can be applied to any network topology without modification. As mapping function, let us consider M(T i )=P i, i = 1,2,3,4,5. The CDG for a minimal fully adaptive routing algorithm is shown in Fig. 6c. Since it contains six cycles, Duato s theorem cannot assure deadlock freeness of the minimal fully adaptive routing for this topology. The number of cycles is reduced to two for the ASCDG as shown in Fig. 6d. We observe that some dependencies in the CDG are not present in the ASCDG. For instance, the edge corresponding to dependency l 1,2! l 2,3 in CDG does not appear in ASCDG. In fact, channels l 1,2 and l 2,3 can be used in sequence only for the communications T1! T3, T1! T6, and T4! T3 which are not present in the CG. Although also in this case we cannot assure deadlock freeness, we can simply break the cycle as follows. The application specific channel dependency l 4,1! l 1,2 is due to the communication D1 S2 Active Nodes Faulty Nodes Route Non S-Chain S-Chain Fault-Ring Fig. 4. Message routes when encountering fault rings and chains

7 R. Holsmark et al. / Journal of Systems Architecture xxx (2007) xxx xxx 7 Application Application to be mapped to be mapped Communication Graph T2 T1 T4 T3 Tn Mapping Mapping Function Function Network Topology P1 P2 P3 P4 P5 P7 P8 P6 P9 a T6 C1 C2 Cm Comm. Concurrency Communication Graph T1 T5 l 12 l 21 l 41 l 14 l 52 l 45 l 54 Memory Memory budget budget 533 T4! T2. Such communication can be realized by both 534 paths P4! P5! P2 and P4! P1! P2. If the routing 535 function is restricted in such a way as the latter path is 536 prohibited, the application specific channel dependency 537 l 4,3! l 3,1 does not exist any longer. In a similar way it is 538 possible to break the second cycle, removing, for instance, 539 the dependency l 1,4! l 1,5 due to the communication 540 T1! T5. However, this restriction reduces the degree of 541 adaptiveness of the routing. Now suppose that we have 542 some knowledge about communication concurrency and 543 suppose that communication T1! T5 and communication T2 T4 T3 l 23 l 32 APSRA APSRA Routing Tables Compression Compression P10 P11 P12 P13 Compressed Routing Tables Fig. 5. Overview of APSRA design methodology. l 25 l 63 l 36 l 56 l 65 b P1 P4 l 12 l 21 l 41 l 14 l 52 Topology Graph l 45 l 54 P2 P5 l 12 l 21 l 41 l 14 l 52 l 45 l 54 T2! T4 do not overlap in time. Fig. 6e highlight the dependencies due to such communications. Since these communications are not concurrent, the associated dependencies are not concurrently active too. The result is that the two cycles are actually false cycles. In conclusion, for this latter case a minimal fully adaptive routing is deadlock free. l 23 l 32 l 56 l 65 P3 l 25 l 63 l 36 c Channel Dependency Graph d Application Specific e T1 T5 P6 Channel Dependency Graph l 23 l 32 l 25 l 63 l 36 Fig. 6. Comparison of cyclic dependencies without and with APSRA methodology. l 56 l 65 T Some notes about APSRA s complexity The construction of the ASCDG involves the annotation of each minimum path between any source/destination T4 l 12 l 21 l 41 l 14 l 52 l 25 l 45 l

8 8 R. Holsmark et al. / Journal of Systems Architecture xxx (2007) xxx xxx 554 pair as defined in the communication graph. The basic 555 assumption is that we start from a minimal fully adaptive 556 routing algorithm. If we consider a mesh-based topology, 557 the complexity to annotate all the minimal paths for a 558 given source destination pair is O(2 n ) where n is the dimen- 559 sion of the quadrant containing the source and destination 560 nodes. It means that, as the NoC size increases, the 561 approach could become infeasible if some nodes located 562 far from each other need to communicate. It should be 563 pointed out, however, that this is a very worst case condi- 564 tion and, in any case, it can be managed efficiently consid- 565 ering the following. First, any topological mapping 566 algorithm (like [26,32 35]), tries to map most frequent 567 and most critical communications in such a way as to min- 568 imise the physical distance between the source and destina- 569 tion nodes. This leads to mapped architectures which seems 570 to mimic a kind of small-world phenomenon [36] in which 571 there are many communications following short paths and 572 few communications which require long paths. Second, the 573 long distance communications, which determine the com- 574 plexity in building the ASCDG can be treated in a more 575 practical way. That is, for these long distance communica- 576 tions, one can consider a subset of all the minimal paths. 577 To limit the number of minimal paths to be annotated, 578 an idea could be to fix a budget of minimal paths that 579 can be used for any communication. In this way, the com- 580 plexity can be tuned by simply modifying the budget, which 581 can be considered as a user defined parameter Hardware implications of APSRA 583 There are two main ways to implement a routing algo- 584 rithm depending on the way the underlying routing func- 585 tion is implemented. The first way is implementing the 586 routing function in hardware logic. In this case an FSM 587 can be used to compute the set of admissible output ports 588 based on the current node address, the destination address 589 and some status information stored in the router. For sim- 590 ple routing functions, this results in small and fast routers. 591 This method has been used by several NoC proposals 592 [4,30]. 593 The second way to implement the routing function is to 594 use a routing table [31]. A schematic diagram of the archi- tecture of a table based router is shown in Fig. 7. The destination address is used to compute the entry s address of the table which encodes the set of admissible output ports where the message can be forwarded on. The main advantages of table-based routers are related to their flexibility and configurability characteristics, and in the possibility of implementing any complex routing function without any variation in cost, since the data stored in the table defines the routing function. The drawbacks are related to the facts that, in general, table-based implementations are costly, both in terms of silicon area and power dissipation, as compared to that using custom logic to implement the routing function. To cope with this drawback several techniques have been proposed [27,37,38]. All these techniques strive for the same objective, that is the compression of the routing table and the design of new router architectures which are able to work with the compressed tables. In [29] we showed that the cost overhead of a routing table implementation based on the compression technique and architecture presented in [27] represent only a small fraction of the overall router cost. In particular, for a lossless compression, we found that, the overhead over a XY router is about 10% (this overhead can be much more reduced whenever a small degradation in routing adaptivity is admitted). As regards energy cost, we determine the energy dissipated in a router by running Synopsys Design Power on the gate-level netlist of the router (including the FIFO buffers) when it is stimulated by different random input data streams. The average energy dissipated by a flit for one hop in the network was estimated to be nj, nj, and nj for XY-based, Chen and Chiubased, and table-based router respectively. Although energy consumption of table-based routers is higher than that exhibited by the other routers, it does not mean that overall NoC energy consumption is higher as well. It should be pointed out that flit switch is not the only source of power dissipation in NoCs. That is, even if accessing compressed routing tables implies an additional energy, this may be balanced by a reduced usage of FIFO buffers due to better avoidance of congestion. 5. Evaluation and comparison of algorithms 5.1. Adaptivity analysis One metric to characterize an adaptive routing algorithm is the degree of adaptiveness [7], which is essentially a measure of the number of paths the algorithm allows from the source to the destination. More precisely, it is defined as the average of the degree of adaptiveness of all communicating pairs. For a given source destination pair the degree of adaptiveness is defined as the ratio between the number of admissible paths and the total number of paths connecting the source node to the destination node. From a practical point of view, the degree of adaptiveness for a given routing algorithm R has been obtained Fig. 7. Schematic diagram of a table based router architecture

9 R. Holsmark et al. / Journal of Systems Architecture xxx (2007) xxx xxx by averaging the degree of adaptiveness for each communi- 650 cation of the communication graph. The degree of adap- 651 tiveness of a communication c is computed as follows: a APSRA Chiu P s = M(src(c)); P d = M(dst(c)); n = NumberOfAdmissiblePaths(R, P s, P d ); m = NumberOfPaths(P s, P d ); return n/m; 657 where src(c) and dst(c) return the source and destination 658 task for the communication c; M is the mapping function; 659 NumberOfAdmissiblePaths(R, P s, P d ) returns the number 660 of paths R allows from P s to P d ; and NumberOfPaths(P s, 661 P d ) return the total number of paths from P s to P d. 662 Fig. 8 shows the degree of adaptiveness of both APSRA 663 and Chen and Chiu s routing algorithm for a 7 7 NoC 664 with a 2 2 region placed at the center of the NoC with 665 four access points (Center, 4AP) and one access point 666 (Center, 1AP) and at the bottom left corner of the NoC 667 with three access points (BL, 3AP) and one access point 668 (BL, 1AP). On average the degree of adaptiveness exhibited 669 by APSRA exceeds the 80% mark, which proves the effec- 670 tiveness of the approach. To compare the algorithms for 671 different region sizes, we define a new adaptivity measure 672 called relative adaptivity. It represents the ratio between 673 the total number of admitted paths when region is present 674 and the number of paths without region. Fig. 9a shows the 675 relative adaptivity for a region of varying size located at the 676 center of the NoC, whereas Fig. 9b shows this variation for 677 regions located at the bottom left corner of the NoC. For 678 both cases and for each region the access point is located 679 at the top right corner. As expected, the relative adaptivity 680 decreases with the increase in size of the region in general. 681 For regions located at the corner of the NoC there is a 682 minimum in relative adaptivity when region size is (or half the dimension of mesh NoC). If region size 684 increases further the relative adaptivity increases. This 685 effect is caused by the fact that a region located at the bot- 686 tom left corner of the NoC obstructs only communications 687 between nodes located at the north quadrant and east 688 quadrant of the region. The number of these nodes is equal Degree of adaptiveness for regions 3 3, 4 3, and 4 4. For this reason, whilst the number of paths without region decrease on average (because access point moves in direction of the center of the NoC), the number of paths remains fairly the same when region size increases from 3 3to4 3 and further to Simulation based evaluation In addition to analysis of adaptivity, we evaluated the two algorithms by using simulation models. For this purpose we developed a NoC simulator in SDL (Specification and Description Language). The simulator supports both regular as well as irregular mesh topologies. To understand the basic behaviour of the algorithms, our first simulations are performed with synthetic communication patterns, where a single region is placed at different positions in a 7 7 sized mesh. In a second set-up we use the communication pattern of a real multimedia application. The simulation model is in this case a 8 8 NoC with a total of 5 regions. The simulator implements wormhole switching with a packet size of 10 flits. Every router has two flit input and one flit output buffer. The router simultaneously routes packets destined to non-conflicting output ports. The minimal link delay is three cycles/flit and the maximum link bandwidth is 0.5 flits/ cycle (1 packet/20 cycles). Cores are modeled as traffic generators and resource network interface has output buffer large enough to keep packet generation un-affected by network Center. 4 AP Center. 1 AP BL. 3 AP BL. 1 AP APSRA Chiu Fig. 8. Adaptiveness vs. access points and placement of regions. Relative adaptiveness b Relative adaptiveness x1 2x1 2x2 3x2 3x3 APSRA Chiu 1x1 2x2 3x2 3x3 4x3 4x4 Fig. 9. Relative adaptiveness vs. size of region: (a) region in centre and (b) region in bottom left corner

10 10 R. Holsmark et al. / Journal of Systems Architecture xxx (2007) xxx xxx 716 conditions. The flits in a packet are sent in a burst mode at 717 the maximum link bandwidth and the gap between the 718 packets has a Poisson distribution with k = 10. Simulations 719 were carried out using Telelogic SDL simulation tool (Tau ). 721 The following parameters were used to study the perfor- 722 mance of a NoC platform. Performance values were col- 723 lected over 60,000 packets, after a warm-up session of ,000 packets. 725 Average Latency: The average transmission delay of a 726 packet from source (when the header leaves) to the des- 727 tination (when the tail has reached). 728 Blocked Routing Cycles/Router: The total number of 729 routing cycles when packets were blocked in a router Latency is measured to get an overall view about how 732 the performance in the network is affected by changes in 733 network configuration and packet injection rate. Blocked 734 Routing Cycles/Router can give information where the net- 735 work is most congested Results with synthetic traffic patterns 737 Destinations for generated packets are randomly 738 selected with hot-spot probability of 60% for region access 739 points. We compare APSRA and Chen and Chiu s algo- 740 rithm with region either in bottom left corner with three 741 access points (bl_ap3) or in centre of network with four 742 access points (c_ap4). Latency values are in this case aver- 743 aged over five random traffic scenarios to reduce the risk of 744 exceptional cases. 745 Communication traffic is classified into three types, 746 namely, as communication traffic to region, and as other 747 traffic where a resource other than the region is a destina- 748 tion, and as all communications which is the aggregate of 749 the first two types of traffic. 750 The first result shows average latency for all communi- 751 cations in the network, as depicted in Fig. 10. The lowest 752 latency values are obtained for APSRA with central region 753 (apsra_c_ap4). Second lowest latency values are obtained 754 with Chen and Chiu s algorithm and central region 755 (chiu_c_ap4). Latency (cycles) 53 apsra_bl_ap chiu_bl_ap3 apsra_c_ap4 chiu_c_ap4 Although APSRA clearly display lower latency for the identical case, this indicates that position of the region is of higher importance than which routing algorithm that is used. Looking how the algorithms perform when the region is placed at the bottom left corner, APSRA (apsra_bl_ap3) again shows lower latency than Chen and Chiu s algorithm (chiu_bl_ap3). The difference is not as large compared with the central region set-up, but seem to grow with increased load. In Fig. 11 we give average latency for traffic with destinations other than the region. The worst position from latency point of view, up to an injection rate of 5%, is with Chen and Chiu s algorithm and region in centre (chiu_- c_ap4). In this case all the other combinations provide similar latency values in this range. However, when injection rate is increased above 5%, Chen and Chiu s algorithm and region in corner position (chiu_bl_ap3) rapidly saturates. Next to saturate is APSRA with region in corner (apsra_bl_ap3). The best result from saturation point of view is when using APSRA and region in centre (apsra_c_ap4), although it has slightly higher latency at lower injection rates. In any case placing a region in centre seems to have less effect on tendency to create severe congestion. We also give results for traffic destined only to region (see Fig. 12). In this case also APSRA with central region shows the best performance results in terms of low latency. In this case, however Chen and Chiu s algorithm with cen Packet Injection Rate (% of LBW) Latency (cycles) apsra_bl_ap3 chiu_bl_ap3 apsra_c_ap4 chiu_c_ap4 Fig. 11. Average latency for communications destined outside region, with region in bottom left (bl) and centre (c), vs injection rate in % of link bandwidth (LBW). Latency (cycles) apsra_bl_ap3 chiu_bl_ap3 apsra_c_ap4 chiu_c_ap Packet Injection Rate (% of LBW) Fig. 10. Average latency for all communications with region placed in bottom left (bl) and centre (c), vs. packet injection rate in % of link bandwidth (LBW) Packet Injection Rate (% of LBW) Fig. 12. Average latency for communications destined to region in bottom left (bl) and centre (c), vs injection rate in % of link bandwidth (LBW).

R. Holsmark et al. / Journal of Systems Architecture xxx (2007) xxx xxx 11 783 tral region clearly gives better results than both algorithms 784 with region at bottom left position.

11 R. Holsmark et al. / Journal of Systems Architecture xxx (2007) xxx xxx tral region clearly gives better results than both algorithms 784 with region at bottom left position. Compared with the 785 traffic to region results, this is the case for all injection 786 rates. Worst performance is also in this measurement 787 shown by Chen and Chiu s algorithm with region in bot- 788 tom left corner. 789 Fig. 13 gives more detail about what causes the differ- 790 ence in latency values. The diagrams present values on 791 how many routing cycles the packets were blocked in differ- 792 ent routers. These results are from one of the simulations 793 with 10% packet injection rate, where the difference in 794 latency was very large. Note that the scale of blocked rout- 795 ing cycles is not the same in the two diagrams. 796 Fig. 13a and b reveals that APSRA algorithm does not 797 cause as much blockage as does Chiu algorithm. It can be 798 noted that also APSRA algorithm, in Fig. 13a, shows a 799 large number of blocked packets around the border of 800 the region. This increase results from packet routes which 801 have to circumvent the region to reach its destination. Still, 802 the distribution is fairly even and much smaller than for 803 Chen and Chiu s algorithm, in Fig. 13b. 804 Note that Chen and Chiu s algorithm results in more 805 blockages close to north and west border of the region. 806 The reason is that this path is highly utilized by the algo- 807 rithm in the procedures of routing around region border. 808 As a result these paths easily become congested, which results in more situations when packets get blocked. APS- RA on the other hand is not biased towards specific routes, and thus spreads the traffic more evenly around the border. As APSRA in many situations have several paths to select from it is also possible to avoid congested routes which further decreases the blockage Multimedia application As a real case study, we consider a multimedia application which implements a H.263 video decoder and a MP3 audio decoder [25]. Fig. 14 shows the communication graph, and the mapping of the tasks onto the NoC. The mapping has been obtained by using a modified version of the approach presented in [26]. A total of five regions are used in this case. Three big regions are used to host two memories and a buffer. Two small regions host the motion compensation (MC) block and the ADD block. We consider one access-point for each region. The location of the access-point is represented with a black dot. The remaining gray tiles of the NoC are supposed to communicate in a random fashion. The degree of adaptiveness exhibited by the routing algorithm generated by APSRA is In particular, the communications belonging to the audio/video decoder can be routed using a minimal fully adaptive routing algorithm. Only few restrictions on routing are applied to the random traffic. Fig. 15 shows the average latency for different packet injection rate exhibited by APSRA and Chen and Chiu s algorithm. As can be seen, APSRA algorithm has a performance advantage. The latency at lower load situations is a b MEM 1 HUFF1 VLD IDCT MC MEM 2 VLD ADD MEM 2 MEM 1 IQ IDCT IMDCT MC HUFF1 HUFF2 SUM ADD BUF BIT RES 1 BUF BIT RES 2 IQ HUFF 2 BIT RES 1 BIT RES IMDCT SUM Fig. 13. Blocked routing cycles/router with (a) APSRA algorithm and (b) Chen and Chiu s algorithm. Fig. 14. Multimedia application, (a) communication graph and (b) topology mapping.

12 12 R. Holsmark et al. / Journal of Systems Architecture xxx (2007) xxx xxx Latency (cycles) clearly lower, and for higher packet injection rates APSRA 840 manages to keep the communication below saturation up 841 to a packet injection rate of 45%. For Chen and Chiu s 842 algorithm this state instead occurs at 30% Discussion on results 844 The simulation results show that APSRA has an overall 845 advantage in communication latency, for identical traffic 846 scenarios. This is probably an effect of its unbiased behav- 847 ior, as compared to Chen and Chui s algorithm, which has 848 tendency to create highly congested routes. In addition, the 849 higher adaptivity of the algorithm makes it possible to 850 avoid congested routes. This is especially shown in the 851 results of the traffic not destined to the region. In this case, 852 a large difference is shown between APSRA and Chen and 853 Chiu s algorithm for the region in the centre. Even though 854 the average distance for APSRA is slightly longer for a 855 region in the centre, as indicated by somewhat higher 856 latency at lower loads, APSRA manages to keep communi- 857 cation below saturation up to approximately 8%. For the 858 same scenario, Chen and Chiu s algorithm has significantly 859 higher latency. 860 Considering traffic to region, the latency is more domi- 861 nated by the distance from sources to the destinations, 862 which in this case is shorter with a centrally placed region. 863 Since traffic to the region has a probability of 60% this also 864 dominates the average latency when we consider all com- 865 munications case. A large difference can be identified 866 when comparing injection rates and saturation between 867 the synthetic and multimedia simulations. This can be 868 explained by two different properties regarding the models. 869 First, the number of communications is larger in the syn- 870 thetic simulations; on average every node generates traffic 871 to one other node. Second, the hot-spot traffic increases 872 the risk of high local traffic rates, which further increase 873 the risk of congestions Conclusions Chiu APSRA Packet Injection Rate (% of LBW) Fig. 15. Average latency for multimedia applicaton, vs injection rate in % of link bandwidth (LBW). 875 In this paper we have highlighted the importance of the 876 region concept in mesh topology NoC architecture. We 877 have also listed new issues which a designer will encounter while designing a heterogeneous mesh topology NoC system using multi-port or multi-access point cores. We presented and compared two deadlock free routing algorithms for mesh NoC with regions. Our analysis and simulation based evaluation demonstrate that minimal distance deadlock free algorithms designed using APSRA methodology out-performs the other algorithm borrowed from fault tolerant area in terms of adaptivity and latency. The area of a NoC router required by the APSRA based algorithm is expected to be larger than the router for the other algorithm. This is because APSRA requires tables (memory) within each router to store routing information, whereas the other algorithm can be implemented as an optimized FSM. However, routing table compression techniques can be used to improve the cost/performance tradeoff in table-based routers [27,38]. In [29], we have shown that re-configurability of routing tables can be used to enhance communication performance for applications in which communication patterns change during its execution. Future developments will mainly address the definition of design space exploration strategies to optimally determine region placement, shape, and number of access points. 7. Uncited reference [28] Q1 901 Acknowledgements We thank Prof. Petru Eles for valuable discussions and suggestions during the development of this research. The work reported in this paper was supported by the project, Specialization and Evaluation of Network on Chip Architectures for multi-media applications, funded by the Swedish K.K. Foundation. We are also thankful to the anonymous reviewers for their constructive comments which helped us to improve the manuscript. References [1] S. Kumar, A. Jantsch, J-P. Soininen, M. Forsell, M. Millberg, J. Öberg, K. Tiensyrjä, A. Hemani, A network on chip architecture and design methodology, in: Proceedings IEEE Annual Symposium on VLSI, Pittsburgh, PA, USA, April 2002, pp [2] W.J. Dally, B. Towles, Route Packets, Not wires: On-chip interconnection networks, in: Proceedings Design Automation Conference, Las Vegas, NV, June 2001, pp [3] P. Guerrier, A. Greiner, A generic architecture for on-chip packetswitched interconnections, in: Proceedings Design and Test in Europe, March 2000, pp [4] E. Bolotin, A. Morgenshtein, I. Cidon, R. Ginosar, A. Kolodny, Automatic hardware-efficient SoC integration by QoS network on chip, in: Proceedings IEEE International Conference on Electronics, Circuits and Systems, December 2004, pp [5] P.P. Pande, C. Grecu, A. Ivanov, R. Saleh, Design of a switch for network on chip applications, in: Proceedings International Symposium on Circuits and Systems (ISCAS), vol. 5, May 2003, pp [6] E. Fleury, P. Fraigniud, A General theory for deadlock avoidance in wormhole-routed networks, IEEE Transactions on Parallel and Distributed Systems 9 (7) (1998)

R. Holsmark et al. / Journal of Systems Architecture xxx (2007) xxx xxx 13 932 [7] J. Duato, S. Yalamanchili, L. Ni, Interconnection Networks: An 933 Engineering Approach, Morgan Kaufmann, 2002.

937 [9] L.M. Ni, P.K. McKinley, A survey of wormhole routing techniques in 938 direct networks, IEEE Computer 26 (1993) 62 76. 939 [10] C.J. Glass, L.M. Ni, The turn model for adaptive routing, Journal of 940 the Association for Computing Machinery 41 (5) (1994) 874 902.

13 R. Holsmark et al. / Journal of Systems Architecture xxx (2007) xxx xxx [7] J. Duato, S. Yalamanchili, L. Ni, Interconnection Networks: An 933 Engineering Approach, Morgan Kaufmann, [8] P. Vellanki, N. Banerjee, K.S. Chatha, Quality-of-service and error 935 control techniques for mesh-based network-on-chip architectures, 936 Integration VLSI Journal 38 (3) (2005) [9] L.M. Ni, P.K. McKinley, A survey of wormhole routing techniques in 938 direct networks, IEEE Computer 26 (1993) [10] C.J. Glass, L.M. Ni, The turn model for adaptive routing, Journal of 940 the Association for Computing Machinery 41 (5) (1994) [11] C. Neeb, N. Wehn, Designing efficient irregular networks for 942 heterogeneous systems-on-chip, in: EUROMICRO Conference on 943 Digital System Design: Architectures, Methods and Tools, August , pp [12] W.J. Dally, H. Aoki, Deadlock-free adaptive routing in multicom- 946 puter networks using virtual channels, IEEE Transactions on Parallel 947 and Distributed Systems 4 (4) (1993) [13] R.V. Boppana, S. Chalasani, Fault-tolerant wormhole routing 949 algorithms for mesh networks, IEEE Transactions on Computers (7) (1995) [14] J. Zhou, C.M. Lau, Fault-tolerant wormhole routing in 2D meshes, 952 in: Proceedings International Symposium on Parallel Architectures, 953 Algorithms and Networks, December 2000, pp [15] J. Wu, A fault-tolerant and deadlock-free routing protocol in 2D 955 meshes based on odd even turn model, IEEE Transactions on 956 Computers 52 (9) (2003) [16] K.-H. Chen, G.-M. Chiu, Fault-tolerant touting algorithm for meshes 958 without using virtual channels, Journal of Information Science and 959 Engineering 14 (4) (1998) [17] R. Holsmark, S. Kumar, Design issues and performance evaluation of 961 mesh NoC with regions, in: Proceedings Norchip Conference, Oulu, 962 Finland, November 2005, pp [18] A. Mejia, J. Flich, J. Duato, S.-A. Reinemo, T. Skeie, Segment-based 964 routing: An efficient fault-tolerant routing algorithm for meshes and 965 tori, in: Parallel and Distributed Processing Symposium, April [19] S. Murali, D. Atienza, L. Benini, G. De Micheli, A multi-path routing 967 strategy with guaranteed in-order packet delivery and fault tolerance 968 for networks on chip, in: Proceedings Design Automation Confer- 969 ence, San Francisco, California, USA, July 2006, pp [20] J. Duato, A new theory of deadlock-free adaptive routing in 971 wormhole networks, IEEE Transactions on Parallel and Distribuited 972 Systems 4 (12) (2003) [21] W.J. Dally, C. Seitz, Deadlock-free message routing in multiprocessor 974 interconnection networks, IEEE Transactions on Computers 36 (5) 975 (1987) [22] M. Palesi, R. Holsmark, S. Kumar, V. Catania, A methodology for 977 design of application specific deadlock-free routing algorithms for 978 NoC systems, in: Proceedings International Conference on Hard- 979 ware-software Codesign and System Synthesis, Seoul, Korea, Octo- 980 ber 2007, pp [23] S. Ishiwata et al., A Single-chip MPEG-2 codec based on custom- 982 izable media embedded processor, IEEE Journal of Solid-State 983 Circuits 38 (3) (2003) [24] G.-M. Chiu, The odd even turn model for adaptive routing, IEEE 985 Transactions on Parallel Distribuited Systems 11 (7) (2000) [25] K. Srinivasan, K.S. Chata, G. Konjevod, Linear-programming-based 987 techniques for synthesis of network-on-chip architectures, IEEE Trans- 988 actions on Very Large Scale Integration Systems 14 (4) (2006) [26] G. Ascia, V. Catania, M. Palesi, A multi-objective genetic approach 990 to mapping problem on network-on-chip, Journal of Universal 991 Computer Science 12 (4) (2006) [27] M. Palesi, S. Kumar, R. Holsmark, A method for router table 993 compression for application specific routing in mesh topology NoC 994 architectures, in: SAMOS VI: Embedded Computer Systems: Archi- 995 tectures, Modeling, and Simulation, Samos, Greece, July [28] R. Holsmark, M. Palesi, S. Kumar, Deadlock free routing algorithms 997 for mesh topology NoC systems with regions, in: EUROMICRO 998 Conference on Digital System Design: Architectures, Methods and 999 Tools, August 2006, pp [29] M. Palesi, S. Kumar, R. Holsmark, V. Catania, Exploiting communication concurrency for efficient deadlock free routing in reconfigurable NoC platforms. in: 14th Reconfigurable Architectures Workshop March 27 28, 2007, Long Beach, California, USA. [30] X. Wang, D.S.-Tortosa, T. Ahonen, Jari Nurmi. Asynchronous network node design for network-on-chip, in: International Symposium on Signals, Circuits and Systems, July 2005, pp [31] A.S. Vaidya, A. Sivasubramaniam, C.R. Das, LAPSES: A recipe for high performance adaptive router design, in: 5th International Symposium On High-Performance Computer Architecture, January 1999, pp [32] J.-M. Chang, M. Pedram, Codex-dp: Co-design of communicating systems using dynamic programming, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 19 (7) (2000) [33] T. Lei, S. Kumar, A two-step genetic algorithm for mapping task graphs to a network on chip architecture, in: EUROMICRO Symposium on Digital Systems Design, September [34] S. Murali, G. De Micheli, Bandwidth-constrained mapping of cores onto NoC architectures, design, automation, and test in Europe, February 2004, pp [35] J. Hu, R. Marculescu, Energy- and performance-aware mapping for regular NoC architectures, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 24 (4) (2005) [36] U.Y. Ogras, R. Marculescu, It s a small world after all: NoC performance optimization via long-range link insertion, IEEE Transactions on Very Large Scale Integration Systems 14 (7) (2006) [37] J. Flich, A. Mejia, P. Lopez, J. Duato, Region-based routing. An efficient routing mechanism to tackle unreliable hardware in newtork on chips, in; First IEEE/ACM International Symposium on Networks-on-Chip, May [38] E. Bolotin, I. Cidon, R. Ginosar, A. Kolodny, Routing table minimization for irregular mesh NoCs, Design Automation and Test in Europe, March [39] Richard Holsmark, Shashi Kumar, Corrections to Chen and Chiu s fault tolerant routing algorithm for mesh networks, Journal of Information Science and Engineering 23 (6) (2007). Rickard Holsmark is a Ph.D. student with the Embedded System Group at School of Engineering, Jönköping University, Sweden. His research is focused towards specialized architectures and routing algorithms for Networks on Chip. Other areas of interest are embedded systems in general, system level design and processor architectures. He received a Bachelor of Science degree (2001) in electronics, with specialization in microcontroller systems. After this he completed a Master of Science degree (2003) in electronics, with specialization in embedded systems. Both of these degrees where received at Jönköping University. Maurizio Palesi received the Dr. Eng. degree and the Ph.D. degree in computer engineering from Università di Catania, Italy, in 1999 and 2003 respectively. Since December 2003, he has held a research contract as Assistant Professor at the Dipartimento di Ingegneria Informatica e delle Telecomunicazioni, Facoltà di Ingegneria, Università di Catania. From January 2007 he is Associate Editor of VLSI Design Journal, Hindawi Publishing Corporation. His research focuses on Platform based system design, design space exploration, low-power techniques for embedded systems, and Networkon-Chip architectures

070 14 R. Holsmark et al. / Journal of Systems Architecture xxx (2007) xxx xxx 073 Shashi Kumar is a professor of Embedded Sys- 074 tems at School of Engineering, Jönköping Uni- 075 versity.

14 R. Holsmark et al. / Journal of Systems Architecture xxx (2007) xxx xxx 073 Shashi Kumar is a professor of Embedded Sys- 074 tems at School of Engineering, Jönköping Uni- 075 versity. His research interests include system-level 076 modeling and synthesis, parallel architectures and 077 algorithms, reconfigurable computing and heu- 078 ristic search algorithms. He was member of the 079 team which was the first to propose the idea of 080 packet switched communication for on-chip 081 communication and coined the term Network on 082 Chip (NoC) in Prof. Kumar has interest in 083 various aspects of NoC design including NoC topologies, QoS issues in NoC communication, NoC architectural mod- eling and evaluation, application specific NoC architecture design, mapping applications to NoC platforms and testing of NoC. He received B.Tech, M.Tech and PhD degrees from the Indian Institute of Technology Delhi in 1974, 1976 and 1985 respectively

Bandwidth Aware Routing Algorithms for Networks-on-Chip

1 Bandwidth Aware Routing Algorithms for Networks-on-Chip G. Longo a, S. Signorino a, M. Palesi a,, R. Holsmark b, S. Kumar b, and V. Catania a a Department of Computer Science and Telecommunications Engineering