Efficient Lookahead Routing and Header Compression for Multicasting in Networks-On-Chip

Size: px
Start display at page:

Download "Efficient Lookahead Routing and Header Compression for Multicasting in Networks-On-Chip"

Transcription

1 Efficient Routing and Header Compression for Multicasting in Networks-On-Chip Lei Wang Poornachandran Kumar Rahul Boyapati Ki Hwan Yum Eun Jung Kim Department of Computer Science and Engineering Texas A&M University College Station, TX 7784 {wanglei, poorna, rahul, yum, ABSTRACT As technology advanced, Chip Multi-processor (CMP) architectures have emerged as a viable solution for designing processors Networks-on-Chip (NOCs) provide a scalable communication method for CMP architectures as the number of cores is increasing Although there has been significant research on NOC designs for unicast traffic, the research on the multicast router design is still in infancy stage Considering that one-to-many (multicast) and one-to-all (broadcast) traffic are more common in CMP applications, it is important to design a router providing efficient multicasting In this paper, we propose an efficient lookahead routing with limited area overhead for a recently proposed multicast routing algorithm, Recursive Partitioning Multicast (RPM) [17] Also, we present a novel compression scheme for a multicast packet header that becomes a big overhead in large networks Comprehensive simulation results show that with our route computation logic design, providing lookahead routing in the multicast router only costs less than 2% area overhead and this percentage keeps decreasing with larger network sizes Compared with the basic lookahead routing design, our design can save area by over 5% With header compression and lookahead multicast routing, the network performance is improved by 22% in a (16 16) network on average Categories and Subject Descriptors C12 [Computer Systems Organization]: Multiprocessors- Interconnection architectures; C14 [Parallel Architectures]: Distributed architectures General Terms Design, Performance 1 INTRODUCTION With the growing number of cores, providing efficient communication in a single die is becoming critical for CMPs and Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee ANCS 1, October 25-26, 21, La Jolla, CA, USA Copyright (c) 21 ACM /1/1 $1 System-on-Chips (SoCs) [2] [12] Traditional interconnects such as shared buses and dedicated wires do not give a scalable solution, but increase the complexity of chip designs Networks-on-chip (NOCs) have been widely accepted as a promising architecture to orchestrate chip-wide communication in CMPs such as Intel Teraflop 8-core [8], Tilera 64-core [18], RAW [15] and TRIPS [7] As the number of processing cores is increasing, the size of the NOC is also increasing to provide connectivity to all processor cores Thus, it is inevitable to suffer from high communication latency in the near future To overcome high communication latency, it is urgent to design a low latency router within a limited area budget and power constraint There have been a handful of studies on low-latency router designs using speculation, pre-computation, or aggressive flow control However, in all previous work, NOCs have been designed for unicast traffic, which are very inefficient at handling multicast and broadcast traffic Various CMP applications and programming models require one-to-many communication such as broadcast and multicast in NOCs In cache-coherent shared memory systems with a large number of cores, it is essential to provide efficient multicasting to maximize performance It is known that cache coherence protocols heavily rely on multicast or broadcast communication characteristics to maintain ordering amongst requests [13] or to invalidate shared data spread on different caches using directory Recent work on multicast routing in NOCs are Virtual Circuit Tree Multicasting (VCTM) [9], blbdr [14] and RPM [17] VCTM [9] proposes an efficient multicast and broadcast mechanism Before sending multicast packets, VCTM needs to send a setup packet to build a tree VCTM requires extra storage to maintain the tree information for multicast blbdr [14] enables the concept of virtualization at the NOC level and isolates the traffic into different domains Multicasting in blbdr is based on broadcasting in a small domain RPM [17], recently proposed multicast routing algorithm, is scalable and deadlock free It provides multicast routing based on recursive partitioning of the whole network In a 2-D mesh topology the whole network is divided into eight parts according to the position of the current node(some parts can be empty if the current node stays in the network edge) When a multicast packet is generated, according to the distribution of destination nodes in the eight parts, the source router will decide how many copies it needs and in which

2 direction each replica should be sent After each replica arrives at the next hop, that node becomes a new source and will partition the network based on its position This partitioning procedure occurs recursively until all the destination nodes receive a copy of that packet Compared with VCTM and blbdr, RPM is more bandwidth-efficient and scalable However, it is challenging to design a low latency router supporting RPM due to the complexity of routing computation logic and the overhead of a big packet header In a wormhole-switched network, the router latency plays a dominant role in the packet latency Reducing the router latency with less pipeline stages is critical to improve packet latency routing [5] in unicast is used to remove route computation from the critical path by calculating packets routes one hop ahead of the current router However, providing lookahead routing for RPM is not easy since some intermediate routers need to make replicas for one packet, which means a router can have more than one downstream router How to efficiently calculate routing information for multiple downstream routers is an open problem Another problem in RPM is the overhead of the packet header The packet header normally carries the whole destination list RPM uses bit string encoding to implement the destination list However, as the network size grows, the number of bits will increase Taking a (16 16) mesh network for example, we need to define a 256-bit header Considering that the link width is 128 bits and the flit size is the same as the link width, we need two flits to carry the destination list It takes two cycles for the downstream router to receive the whole header, which means routing calculation cannot be done in one cycle It is equivalent to increasing the number of router pipeline stages which eventually degrades the network performance Motivated by these problems of RPM, we propose an efficient lookahead routing design and introduce a novel compression scheme Our route computation logic can support lookahead routing with limited area and power overhead The proposed compression scheme can highly reduce the overhead of a head flit in multicast traffic Our compression stage can be overlapped with lookahead routing stage, which means the compression stage will not be in the critical path and have no effect on router latency Our main contributions are summarized as follows: We explore the detailed design of a multicast router, especially how to provide lookahead routing with a recent multicast routing algorithm We analyze the overhead of a multicast header and propose a compression scheme As far as we know, this is the first work to introduce header compression into the NOC router design We evaluate our multicast router design using the most recent multicast routing algorithm by varying the traffic patterns and network sizes Detailed simulation results show that with our route computation logic design, providing lookahead routing in a multicast router only costs less than 2% area overhead and this percentage keeps decreasing with larger network sizes Compared with the basic lookahead routing design, our design can save over 5% area With header compression and lookahead multicast routing, the network performance can be improved by over 22% in a large network The rest of this paper is organized as follows We briefly present the unicast, multicast router architecture and RPM routing in Section 2 We propose the multicast lookahead routing design in Section 3 The header compression scheme is discussed in Section 4 In Section 5, we describe evaluation methodology and summarize the simulation results Finally, we draw conclusions in Section 6 2 BACKGROUND In this section, we present a unicast router architecture and its pipeline stages Then we briefly describe a recently proposed multicast routing algorithm, Recursive Partitioning Mulitcast (RPM) 21 Unicast Pipelined Router Architecture Figure 1 shows a virtual channel (VC) unicast router architecture used in NOCs [4] The main building blocks are input buffer, route computation logic, VC allocator, switch allocator, and crossbar To achieve high performance, unicast routers process packets with four pipeline stages, which are routing computation (RC), VC allocation (VA), switch allocation (SA), and switch traversal (ST) First, the RC stage directs a packet to a proper output port of the router Next, the VA stage allocates one available VC of the downstream router determined by RC The SA stage arbitrates input and output ports of the crossbar, and then successfully granted flits traverse the crossbar in the ST stage Considering that only the head flit needs to do routing computation and middle flits always have to stall at RC stage, recent router designs use techniques such as lookahead routing [5] to reduce the number of pipeline stages The functionality of lookahead routing is the same as normal RC stage, which is to calculate the output ports for packets However, instead of calculating routing information for the current router, lookahead routing does calculation for the next hop and stores the routing information in the head flit For the current router, routing information is already obtained in the upstream router The RC stage and VA stage can be overlapped, since the VC allocator does not need to wait for the output of RC logic In this way, lookahead routing removes route computation from the critical path Recent work [6] [11] uses lookahead signals or advanced bundles to implement lookahead routing for unicast routers 22 Multicast Router Architecture Multicast (one to many) and broadcast (one to all) refer to the traffic patterns in which the same message is sent from one source node to a set of destination nodes Compared with the unicast router design, the multicast router has its own characteristics In multicast traffic, a packet that has multiple destinations needs to be replicated to several copies in intermediate routers To support a multicast function, routers need a replication component To avoid the storage overhead for replica management, normally replications take place at the ST stage, and the basic unit is a flit rather than

3 Input Input 4 Route Computation VC 1 VC 2 VC n VC 1 VC 2 VC n Input buffers Input buffers VC Allocator Switch Allocator Crossbar switch Output Output 4 Figure 1: Unicast Pipelined Router Architecture network, where the source is 9 and its six destinations are, 1, 2, 3, 12 and 14 The RPM partition logic will divide the network according to the source Node 9 Destinations 2 and 3 lie in Part, destination 1 lies in Part 1, destination lies in Part 2, destination 12 lies in Part 4 and destination 14 lies in Part 6 According to the RPM routing rules, current Node 9 needs to make two replicas, one for North and the other for South To avoid redundant replication destinations, 1, 2 and 3 only stay in the North packet header while destinations 12 and 14 only stay in the South packet header When the North replica arrives at Node 5 and the south replica gets to Node 13, Node 5 and Node 13 becomes the new source nodes and will go through the same procedure Node 9 does This procedure recursively occurs until all the destinations receive one copy of the original data a packet RC stage in multicast is not only just to calculate the output directions but to decide how many replications the current router should make At the VA stage, virtual channel allocator should consider multiple requests from the same packets which may traverse to different output ports at the same time The same situation will occur at the SA stage Here, we can see that the key point in multicast router design is the RC logic Eight Parts Three Parts (5, 6, 7) 3 W Source node N S E 23 RPM Routing Recently Wang et al [17] proposed a deadlock free multicast routing algorithm for NOCs, Recursive Partitioning Multicast (RPM) Unlike previous work VCTM [9], RPM does not maintain any lookup table for multicast in each router RPM directly sends out multicast packets without sending a unicast+setup packet to each destination first In other words, RPM does not need to build a tree structure in each intermediate router before sending the real multicast data packet However, in each multicast packet header or head flit, RPM needs one field to indicate all the destination nodes positions This overhead of the head flit can be one drawback of RPM which makes it hard to scale to large networks Detailed discussion about head flits will be in Section 4 The core idea of RPM is that the routing decision is made based on the current network partitioning The definition of a current router is the router which receives one copy of the original multicast packet even though it is not one of the destinations The current router divides the whole network into at most eight parts according to its position Destination nodes in a multicast packet belong to one of these parts In the general case, if the source node is in the center of the network, all the eight parts have at least one node However, if the source node is located in the corner or edge of the network, some parts may be empty Taking a (4 4) mesh network as an example, partitioning is defined as in Figure 2 The network is recursively partitioned according to the position of the current router until all the destinations get the packet Special rules are provided to avoid redundant replication of packets in each intermediate router Virtual networks (VN), which separate a physical network into multiple networks, are introduced to make RPM a deadlock-free algorithm Figure 3 shows a multicast traffic example in a (4 4) mesh 1 7 Three Parts (, 1, 7) 4 Three Parts (3, 4, 5) Three Parts (1, 2, 3) Figure 2: Network Partitioning Based on Source Node Positions Normally the whole network will be divided into 8 parts However, for the nodes located in the corner or edge, some parts are empty Source North Header 1111 Continuous zeros South Header 11 Continuous zeros Destination North South Figure 3: Packet Header Patterns in a Multicast Traffic After making two replicas (North and South), there are a lot of zeros in each packet header which can be potentially compressed 3 MULTICAST LOOKAHEAD ROUTER DE- SIGN In unicast traffic, the packet at the head of each VC needs to use the route computation logic and proceed to one output port It means that, for each packet, the current router only has one downstream router To support lookahead routing in unicast, each router can have one route computation logic for each input port or each VC However, in multicast 1

4 traffic, one packet may go to several output ports in an intermediate router, if this packet needs to be replicated in this router Then, for this packet, the current router will have more than one downstream router To provide multicast lookahead routing, each input port or VC should have more than one route computation logic in order to not increase the latency of the RC stage, or else routers should get all the next hop route information in multiple cycles without adding any logic Taking a 2D Mesh topology as an example, each router has at most four neighboring routers In the worst case, to support multicast lookahead routing, each router should have three 1 route computation logic blocks to calculate route information for all the downstream routers in one single cycle Without adding more RC logic blocks, performing lookahead routing for multicast will take more than one cycle Then the benefit of lookahead routing will be lost network parts for the two next-hop routers are represented by the solid and the dashed lines It can be clearly seen that there are a lot of sharing parts between the parts of the two next-hop routers The overlapping portions are depicted with the routers in gray Hence, there exists a lot of redundancy in the partition logic if three RPM logics are implemented separately for the three next-hop routers The new partitioning scheme is illustrated in Figure 6 Instead of directly dividing the network into 8 parts, the network is first divided into 24 parts, and the eight-part signals for each of the three next-hop routers are computed from these 24 parts signals The 24 parts comprise of 8 immediate surrounding neighbor routers and 16 parts that may contain more than one router The part signals for each of the 24 parts are generated by OR-ing the destination bits in the header The eight part signals for each of the nexthop routers are computed using the 24 part signals Table 1 shows how the Part -7 signals are calculated for the two next-hop routers (Node 1 and Node 7) in Figure 6 +)((",$ (&)$"(!"#$ %&' (&)$"(* -&)$"(, $%" &/"(1'',2 '1($$&, 1("1* 31($$&, 4&),51(6 7&( 31($$&, 4&),51(6 7&( Figure 4: A Partitioning Logic for RPM The partitioning logic is simply implemented using OR gates with inputs from the destination list Each router has its own logic based on its position in the network By analyzing the RPM routing logic design, we find that multicast lookahead routing can be efficiently supported by redesigning its partitioning logic The RPM partition logic occupies an area that scales linearly with the network size In a (n n) network, the RPM partition logic block consumes an area of the order of n 2 times the area of an OR gate An example of a partition logic is shown in Figure 4 It is implemented for Router 9 in a (4 4) network as shown in Figure 1 The destination bits of a multicast packet generate part signals for eight parts using OR gates Although there are four next-hop routers, their partitions have a considerable overlap This overlap can be exploited to remove redundant logics to avoid providing three RPM routing logic, and finally reduce the area Considering a (8 8) network in Figure 5, the current router and two next-hop routers are marked in the figure The eight 1 We assume minimal routing Packets cannot be routed back Figure 5: Basic Network Partitioning for Two Next- Hop Routers The grey nodes are the overlapping portions of the two partitioning From Table 1, it can be observed that P is used to compute the Part signal for both the North and the East next-hop routers Similarly, P 3 and P 4 are used to compute Part 2 signal In this way, we can reuse the common logic and avert redundancy The area can be roughly obtained from the number of OR gates utilized to generate the part signals The number of OR gates required for n bits representing n routers is n 1 Therefore, each part which contains a b routers needs ab 1 OR gates to generate the part signals The original 8 partition scheme for a (n n) network is shown in Figure 7 The number of OR gates can be calculated for each of the 8 parts and then summed up to find the total number of gates for each RPM RC logic The total number of OR gates can be found to be n 2 9 for a (n n) network For lookahead routing without area optimization, the number of gates is three times that of a single logic which is 3n 2 27 The partitioning scheme for a (n n) network with area optimization is shown in Figure 8 The number of OR gates to generate the 24 part signals can be found to be n 2 25 The 8 part signals for the next hop routers can be generated as described in Table 1 The number of OR

5 D E?; J B@??9C; D E?;?=@;9? I D E?; H D E?; G 89:; <=>?=@;9?A D E?; F Table 1: Calculation of the 8 Parts Signals from the Newly Defined 24 Parts (N: Node, P: Part, OR: or gate) D E?; K D E?; L D E?; M 8=P9 H 8=P9 G 8 =P9 F 8=P9 I 8=P9 M 8=P9 J 8=P9 K 8=P9 L E?; GK D D E?; GJ E?; D GI D E?; N D E?; O D E?; GF D E?; GG E?; D GH Y=@;9? QC ;<9 =Z9?[E>>QC\ >E?;Q;Q=C E?9EA D E?;Q;Q=C R=@C PE?S T=? C9U AV<9W9 UQ;< E?9E =>;QWQXE;Q=C Figure 6: A New Network Partitioning Scheme, which divides the network into 24 parts instead of 8 parts gates needed to compute the eight part signals for the three next-hop routers can be counted to be 45 (3 15) This is a constant even the network size increases The total number of gates for lookahead routing with area optimization is n Therefore, the extra area overhead of lookahead routing is reduced from 2n 2 18 to 29 3 gates with the help of the area optimization technique i j k ]^_` c b j k ]^_` d ]^_` e i j b ]^_` b ]^_` f i j lm n k n bo ]^_` a b j lm n k n bo ]^_` h lm n i n bo j k lmn i n bo j b lmn i n bo j lmn k n bo pq rm` _s q`r_ ]^_` g Figure 7: Gates Calculation with Original RPM Logic Detailed area and power analysis of our new RC logic design are in Section 5 2 n n (n 2 9) Part Num Direction Computation logic of next hop N P OR P 1 E P OR P 15 1 N P 2 E N OR P 1 2 N P 3 OR P 4 E N 1 OR N 2 OR P 2 OR P 3 OR P 4 OR P 5 3 N N 2 OR P 5 E N 3 OR P 6 4 N N 3 OR N 4 OR P 6 OR P 7 OR P 8 OR P 9 E N 4 OR N 5 OR P 7 OR P 8 OR P 9 OR P 1 5 N N 5 OR P 1 E N 6 OR P 11 6 N N 6 OR N 7 OR P 11 OR P 12 OR P 13 OR P 14 E P 12 OR P 13 7 N N OR P 15 E P 14 4 HEADER COMPRESSION In RPM routing, normally the packet header carries the whole destination list When the network size becomes large, such as (16 16), the overhead of the header becomes so critical that the downstream router may not receive the whole destination list in one cycle Let s go back to the previous multicast example in Figure 3 Since we have totally 16 nodes in a (4 4) mesh network, using bit string encoding needs 16 bits to carry the destination list in the worst case However, when we analyze this traffic in detail, we can find some interesting points in this destination list bit encoding First, in Node 9, the packet will be replicated into two copies, one for North and the other for South According to RPM [17], to avoid redundant replication, the North packet header only carries destinations, 1, 2 and 3, while the South packet carries only destinations 12 and 14 In these two packet headers, at least half of the bits are zeros since North packet header will never carry destinations below the source row and South packet will never carry destinations above the source row Hence, there are twelve continuous zeros in North packet, and fourteen zeros in South packet among which twelve are continuous zeros Second, when a multicast packet traverses in the network, each time it arrives at one destination, the bit for that position will become zero, which means as time elapses, the number of zeros in the header increases Instead of transmitting all these redundant zeros, the packet header can be compressed There has been plenty of work done in data compression, such as frequent value based compression [19], significance based compression [3] and frequent pattern compression [1] Frequent value based compression

6 tuvw tuvw { tuvw z tuvw y tuvw x ˆ y y y Š ˆ {Œ y ˆ y y y y y y y Š ˆ {Œ tuvw } ƒ z ƒ y ƒ x tuvw y} y ˆ tuvw y y y y y tuvw ~ ƒ { ƒ y Š ˆ {Œ y ˆ ƒ ƒ } ƒ ~ tuvw y{ tuvw y y y y y y y Š ˆ {Œ Š {Œ ˆ Š {Œ y Š {Œ y Š {Œ y Š {Œ Š ˆ {Œ tuvw tuvw tuvw yx tuvw yy tuvw yz bits corresponding to these destinations are always set to in the header Therefore, the bits in the header for these destinations are removed as depicted in Figure 9(b) According to the policy of avoiding redundant packet replications in RPM, each direction can at most have destinations in three parts, which means using three bits for parts information in the header is always enough The destination bits for the three parts follow row-wise ordering The part signals P i P j and P k are set to 1 if at least one destination exists in their parts respectively Any part signal being zero indicates that no destination exists in that part, which means that all the destination bits in that part are set to zero, since we store all the destination bits of a part continuously ÄÇ Ä Æ ÄÅ Žvv Šw ƒžw v ª «Figure 8: Gates Calculation of Routing Logic after Logic Reuse ² is based on temporal locality of values when accessing memory and hence they can be compressed while transmitting or storing Significance based compression is built on the fact that most of the significant bits have redundant information and an index can be transmitted along with lower order bits instead of the whole word In frequent pattern compression, each 32-bit word is encoded as a 3-bit prefix plus data if it matches a pattern in a table that has a list of patterns selected based upon their high frequency in the commercial and integer benchmarks But, considering the feature of recursive partitioning at each node in RPM, we propose a new compression scheme which can efficiently reduce the overhead of a packet header ÄÈ ³ «ÄÉ ± ««««² «ÄÊ µ ¹ ÄË º¹»¼½¾ ¼½µ¾ ÄÌ Figure 1: In a (4 4) network, Node 9 sends a multicast packet and Node 1 is the only destination The whole network is divided into eight parts according to the position of Node 9 ÜÝÞ ÜßÞ Í Î Ï Ð Ñ Ò Ó Ô Õ Ö ÎÍ ÎÎ ÎÏ ÎÐ ÎÑ ÎÒ Í Î Ï Ð Ñ Ò Ó Ô Õ Ö ÎÍ ÎÎ ÎÏ ÎÐ ÎÑ ÎÒ Í Î Ï Ð Ñ Ò Ó Ô Õ Ö ÎÍ ÎÎ Ú Ù Ø Û Ï Ð Ó Ô ÎÑ ÎÒ ÎÍ ÎÎ š œ ž Ÿ ž Í Î Ï Ð Ñ Ò œ ž Ÿ ž œ ž Ÿ ž š ÜàÞ Î Í Í Î Î Í Ú Ù Ø Û ÎÍ ÎÎ Figure 9: RPM Header Format and Proposed Header Compression Format The original packet header format of RPM is shown in Figure 9(a), while a new format is shown in Figure 9(b) The original header contains destinations in serial order from to n-1 The new header format is implemented to exploit the features provided by RPM partition logic It includes a compression bit to indicate if the header is compressed or not The compression bit is followed by three relevant part signal bits P i, P j and P k which are obtained from the RPM partition logic Other bits following these part signal bits represent the destinations that belong to the corresponding parts i, j and k Part i, j and k are selected based on the direction in which the packet will be sent For instance, if the packet is sent eastward, the parts i, j and k are respectively, 6 and 7 of the current router Destinations in the other parts will not receive this east direction packet Hence the Figure 11: Comparison of Different Headers (a) RPM format, (b) Proposed Header Compression format and (c) the compressed header for the example in Figure 1 An example of this compressed packet header in a (4 4) network is shown in Figure 1 It shows the current router and the direction in which the packet is traveling The original packet header and new packet header formats are illustrated in Figure 11(a) and (b) 16 routers requires 16 bits in the original header format The size of the new header varies depending on the position of the current router and the destination list In this example Router 9 sends a packet to Router 1 Parts i, j, k are, 6 and 7 respectively The destination bits are Router 2,3,6,7 for part, Routers 14, 15 for part 6 and Routers 1, 11 for part 7 Since the destination is only 1, P and P 6 are while P 7 is 1 The destination bits for P (2,3,6,7) and P 6 (14,15) are omitted, because the

7 part bit P set to already carries the information that the destination bits for routers 2, 3, 6 and 7 are zeros, and similarly for P6 The final header after compression is shown in Figure 11(c) The first bit for compression is set to 1 which means that this header is compressed The second, third and fourth bits are the part signals for parts, 6 and 7 which are, and 1 respectively The destination bits for part 7 whose part signal is 1 are set to 1 and for Router 1 and 11 respectively It can be observed that the number of bits were reduced from 16 to just 6 in this case The 6 bits include a compression overhead of 4 bits and just 2 bits for sending the actual destination bits The compression overhead of 4 bits is a constant and will not increase with the size of the network Furthermore, the partition logic can be used to implement the compression eliminating the need for a new compression logic â ã ä å æ ç è é ê â ââ ï ì î ì í ì ð ãëâåëâ äëâæëââ çëâ èëââ âåëâ âæëââ â ââ ñòóô ã ñòóô ä ñòóô ç ñòóô è ñòóô âå ñòóô âæ ñòóô â ñòóô ââ Since partitioning in RPM is based on the position of the current router, routers at different positions will decode the same header in different ways This also depends upon the traversal direction of packets However, the header decoding for a particular router in a dedicated direction is fixed and can be implemented by synthesized hardware logic for each router Also, the size of a compressed header can vary from packets to packets depending on whether each of the part signals is set to or 1 For instance, in Figure 1, if Routers 6 and 7 are packet destinations, then Part would be set to 1 by the RPM partition logic and hence the destination bits 2, 3, 6 and 7 would be included in the header Therefore, the location of the bits for destination node 1 and 11 in the header should be changed In this way, decoding logic of the packet header needs multiplexing among the different possible locations for each destination bit and using P i, P j and P k as the selection bits Figure 12 shows the hardware of the decompression stage in Router 1 in a (4 4) network for a packet sent from Router 9 The decompression logic needs to decode the destination bits from the header and forward it to the RPM logic for route computation The destination bits for Router 2, 3, 6 and 7 are located at the bit number 4, 5, 6 and 7 if the P is 1, otherwise they will not present in the header These four destination bits can be extracted from the header with four multiplexers The destination bits 14 and 15 can be located if P 6 is set to 1) at bit number 8, 9 or 4, 5 depending on whether P is set to 1 or respectively Hence, the extraction of these destination bits need an extra stage of multiplexing Similarly, the destination bits of Router 1 and 11 can have four different locations and can be multiplexed using the combination of P and P 6 as the selection signals In a similar mode, the logic can be customized for every input port of each router In some special cases, one packet header needs to carry a huge destination list For example, we have 1K nodes and only a 128-bit header Even with compression it is hard to put the whole destination list into one flit Then at the source node, we can use more than one flit to carry the destination list However, as the packet proceeds to the next hop, according to RPM, it will be replicated to different directions Meanwhile, the original destination list will be split into several parts to make a new header for each replicas In this way, each header of replicas will not contain huge amount of destinations, and hence has many zeros which makes it easier to be compressed This procedure will Figure 12: An Implementation of Decode Component in Router 1 for Packets from Router 9 recursively occurs in the whole path until every destinationreceives a copy of the original packet Detailed analysis of our compression scheme will be in Section 5 5 EXPERIMENTAL EVALUATION We evaluate the area overhead of our design, especially for the RC logic We then analyze the effectiveness of our compression scheme We also evaluate the performance of our multicast router design with different synthetic multicast workloads 51 Methodology We use a cycle-accurate network simulator that models all router pipeline delays and wire latencies We use Orion 2 [1] for area estimation Orion 2 simulator uses a recent model [16] and estimates the area of transistors and gates using the analysis in [2] The area depends on the technology-level and process-level input parameters Orion 2 simulator provides value estimates for inverters and 2- input AND and NOR gates The simulator also adds an additional 1% to the total area to account for global white space We model a link as 128 parallel wires, which takes advantage of abundant metal resources provided by future multi-layer interconnects On top of Uniform Random (UR), Bit Complement (BC) and Transpose (TP) unicast packets, our synthetic workloads have multicast packets For multicast packets, the destination numbers and positions are uniformly distributed, while unicast packets destinations are determined by three patterns (UR, BC, and TP) We also control the percentage of multicast packets Table 2 summarizes the simulated configurations 52 Compression Analysis In this part, the efficiency of the proposed header compression scheme is evaluated with different network sizes Here we define the header as the set of bits used to indicate the multicast destinations The number of destinations is randomly selected and varies from 1, which implies a unicast packet, to the total number of nodes which refers to broadcasting The original number of bits in the header is equal to the number of nodes in the network For instance, the number for a (4 4) network is 16 and for a 16x16 network

8 Table 2: Network Configuration!"## $ $ # % &' #( $ % $ "$ Characteristic Configuration Topology 8 8 Mesh or Mesh Routing RPM Virtual Channels/Port 4 Virtual Channel Depth(flits) 4 Packet Length(flits) 4 or 5 Traffic Pattern UR, BC and TP Multicast Packet Portion 1% Multicast Destination Number 2-16 (uniformly distributed) Simulation Warmup Cycles 1, Total Simulation Cycles 2, úöõõ úõõõ ùõõ øõõ õõ öõõ õ û ù û ù úö û úö úø û úø öü û öü ýö û ýö þ ÿ ÿ is 256 Without compression, the number remains fixed as the packet is replicated and forwarded along the network path Hence the average size of the uncompressed header is n 2 for a (n n) network On the other hand, the size of a compressed header can be different depending on the extent of compression Even at the source node different multicast packets can have different header sizes Furthermore, as one packet traverses, the destination list may be split into many parts, which means the replicated packet header will have a better chance to get compressed In our simulation, we consider this effect As clearly illustrated in Figure 13, the header size without any compression is bigger than the other two Compression at the source node yields around 45% reduction in the header size for all the network sizes except the (4 4) network which yields approximately 25% reduction The reduction from compression is predominantly due to the elimination of the bits for destinations in certain parts which are not included in some direction The (4 4) network is an exception because the 4-bit compression overhead becomes significant when the network is small This overhead is negligible in large networks Looking at Figure 13, compression at all routers yields more significant header reductions Comparing the results between the header size at the source and the average header size at all intermediate routers, it can be inferred that as the packet traverses to the downstream routers, there is more and more opportunity for compression due to the characteristics of RPM As a packet is forwarded downstream, RPM divides the destinations among the replicated packets in different directions, leading to the removal of certain other destination bits from the packet header Thus the number of destinations in each packet decreases Therefore, a lot of bits in the header tend to be zeros which provides a high chance for compression The compression rate ranges from 78% for a (8 8) network to 96% for a (32 32) network Figure 14 shows another interesting result of our compression scheme In this experiment, we fix the network size as (16 16) and keep changing the number of destinations The position of each destination is randomly selected As Figure 14 illustrates, even with large number of destinations, the average size of the header is not increasing One explanation can be from the number of replications We can see that the larger the number of destinations, the more the number of replications Each replication splits the header into more parts with less destinations in each of them thus providing more chance to do compression Figure 13: Comparison of Original Header and the Header after Compression for Different Network Sizes I GD H F DC E? A?B >?@ Q R2=7S2 T27 U 2= 3 5V2 ) -),) +) *) ) Q R2=7S2 6:; < 2= 8 W =2XY 5Z / *,+ - *+/ :; < 2=,) + +) * HD NO *) LD M@ J?K Figure 14: Header Compression with Different Number of Destinations in a (16 16) Network 53 Area and Power Analysis In our area analysis, we focus on the RC component since it scales with the number of routers and increases dramatically in size with lookahead routing We evaluate dynamic and static power consumption for RC component with 5% switching activity and 1V supply voltage in 65 nm technology Table 3 shows the area of the partitioning logic without lookahead routing for different network sizes From the table, we can see that the area of RC logic increases linearly with network size The RC logic is implemented using NOR, Inverter and AND gates The number of Inverter and NOR gates is almost the same as the total number of routers in the network The RPM direction calculation logic adds a constant to the total number of gates With lookahead routing enabled, if the area optimization is not used, the RC logic needs to be replicated three times and hence the area triples However, in our new RC design, we divide the network more finely into 24 parts instead of 8 We can generate the original 8 part signals for each next-hop Table 3: Number of Gates for Basic Multicast RC Network Size NOR gates INV gates AND gates ) C? P HD

9 Table 4: Area of Different RC Designs Network Size Area of Area of basic RC Area of RC basic with lookahead with lookahead RC(µm 2 ) routing(µm 2 ) routing and optimization(µm 2 ) router from the same 24 parts and achieve logic reuse The area comparison between the two designs above is given in Table 4 The table clearly shows area reduction with our RC design The area reduction is more prominent as the size of the mesh increases Figure 15 illustrates the power consumption of the RC component Since there are less logic gates in the optimized lookahead RC design, both static and dynamic power are saved Power(uW) Basic Static Power Optimized Basic Optimized Basic Dynamic Power Optimized Basic Figure 15: RC Power Analysis Figure 16 shows the comparison of power consumption per packet between headers with no compression and with compression In this part, we evaluate the power for Uniform Random Traffic pattern in a (16 16) mesh network The power value is obtained before the network is saturated The power of the router with header compression is normalized to that of the router without header compression We can see that header compression saves router power by 2% since it can reduce the packet header size Normalized Power(mw/pkt) Workload(flits/cycle/node) Figure 16: Power Analysis of Header Compression 54 Performance Figure 17 (a) (b) and (c) summarize the performance results of lookahead routing in an (8 8) network for the three synthetic traffic patterns In an (8 8) network, there are 64 Optimized nodes Considering the link width is 128 bits, even without header compression, the entire destination list can be fit into one header flit Therefore, in an (8 8) network, lookahead routing is independent of header compression routing removes one stage from the router pipeline, improving the packet latency but having little effect on network throughput To study the effect of header compression, we choose a (16 16) mesh network where there are 256 nodes Since the flit size is 128 bits, we need two flits to carry the whole destination list Meanwhile, routing computation cannot be done in one cycle because it takes at least two cycles to get all the destination information However, with header compression, most of the time the destination list can be fit into one flit Even if the destination list cannot be compressed into one flit at the source node, as the packets proceed to the next hop, normally the original destination list will be split into multiple replications Each replica only carries a part of the original set of destinations and therefore is easier to get compressed Figure 17 (d) (e) and (f) show that the design with header compression and lookahead routing has the best performance for all the three traffic patterns Without compression, lookahead routing is inefficient because the downstream router needs to wait more than one cycle to get the whole destination list, which negates the benefit of lookahead routing 6 CONCLUSIONS The prevalent use of NOCs in current CMP architectures indicates that it is important for NOCs to support multicast traffic In this paper, we explore the details of the multicast router design, especially the RC logic We propose an efficient RC design to get rid of redundant gates while implementing lookahead routing We also design a novel header compression scheme for multicast packets With this header compression, lookahead multicast routing in a large network becomes efficient Simulation results show that with the RC logic design, providing lookahead routing in multicast routers only costs less than 2% area overhead over non-lookahead routing and this percentage keeps decreasing with larger network sizes Compared with the basic lookahead routing design, our design saves area by over 5% With header compression and lookahead multicast routing, the network performance can be improved by over 22% in a large network 7 REFERENCES [1] A R Alameldeen and D A Wood, Adaptive Cache Compression for High-Performance Processors, in ISCA, 24, pp [2] L Benini and G D Micheli, Networks on Chips: A New SoC Paradigm, IEEE Computer, vol 35, no 1, pp 7 78, 22 [3] D Citron and L Rudolph, Creating a Wider Bus Using Caching Techniques, in HPCA, 1995, pp 9 99 [4] W J Dally and B Towles, Principles and Practices of Interconnection Networks Morgan Kaufmann, 23 [5] M Galles, Scalable pipelined interconnect for distributed endpoint routing: The SGI SPIDER chip, in Proceedings of Hot Interconnect 4, 29, pp

10 Latency(cycle) Latency(cycle) Non-lookahead Latency(cycle) Non-lookahead 9 15 Latency(cycle) Non-lookahead (a) UR(8 8) (b) BC(8 8) (c) TP(8 8) non-compress_non-lookahead non-compress_non-lookahead non-compress_non-lookahead compress_non-lookahead compress_non-lookahead compress_non-lookahead compress_lookahead compress_lookahead compress_lookahead Latency(cycle) (d) UR(16 16) (e) BC(16 16) (f) TP(16 16) Figure 17: Performance of Multicast Routing with Header Compression for Three Synthetic Traffic Patterns Latency(cycle) [6] P Gratz, C Kim, R G McDonald, S W Keckler, and D Burger, Implementation and Evaluation of On-Chip Network Architectures, in ICCD, 26 [7] P Gratz, K Sankaralingam, H Hanson, P Shivakumar, R G McDonald, S W Keckler, and D Burger, Implementation and Evaluation of a Dynamically Routed Processor Operand Network, in Proceedings of NOCS, 27, pp 7 17 [8] Y Hoskote, S Vangal, A Singh, N Borkar, and S Borkar, A 5-GHz Mesh Interconnect for a Teraflops Processor, IEEE Micro, vol 27, no 5, pp 51 61, 27 [9] N E Jerger, L-S Peh, and M H Lipasti, Virtual Circuit Tree Multicasting: A Case for On-chip Hardware Multicast Support, in Proceedings of ISCA, 27 [1] A B Kahng, B Li, L-S Peh, and K Samadi, ORION 2: A Fast and Accurate NoC Power and Area Model for Early-Stage Design Space Exploration, in DATE, 29, pp [11] A Kumar, P Kundu, A P Singh, L-S Peh, and N K Jha, A 46Tbits/s 36GHz Single-Cycle NoC Router with a Novel Switch Allocator in 65nm CMOS, in ICCD, 27, pp 63 7 [12] R Marculescu, Ü Y Ogras, L-S Peh, N D E Jerger, and Y V Hoskote, Outstanding Research Problems in NoC Design: System, Microarchitecture, and Circuit Perspectives, IEEE Trans on CAD of Integrated Circuits and Systems, vol 28, no 1, pp 3 21, 29 [13] M M K Martin, M D Hill, and D A Wood, Token Coherence: Decoupling Performance and Correctness, in ISCA, 23, pp [14] S Rodrigo, J Flich, J Duato, and M Hummel, Efficient Unicast and Multicast Support for CMPs, in Proceedings of MICRO-41, 28 [15] M B Taylor, W Lee, S P Amarasinghe, and A Agarwal, Scalar Operand Networks: On-Chip Interconnect for ILP in Partitioned Architecture in Proceedings of HPCA, 23, pp [16] S Thoziyoor, N Muralimanohar, J H Ahn, and N P Jouppi, CACTI 51, HP Laboratories, Tech Rep, 28 [17] L Wang, YJin, HKim, and EJKim, Recursive Partitioning Multicast: A Bandwidth-Efficient Routing for Networks-On-Chip, in Proceedings of the 3th International Symposium on Networks-on-Chip, 29 [18] D Wentzlaff, P Griffin, H Hoffmann, L Bao, B Edwards, C Ramey, M Mattina, C-C Miao, J F B III, and A Agarwal, On-Chip Interconnection Architecture of the Tile Processor, IEEE Micro, vol 27, no 5, pp 15 31, 27 [19] J Yang, Y Zhang, and R Gupta, Frequent Value Compression in Data Caches, in MICRO, 2, pp [2] H Yoshida, K De, and V Boppana, Accurate Pre-layout Estimation of Standard Cell Characteristics, in DAC, 24, pp

Early Transition for Fully Adaptive Routing Algorithms in On-Chip Interconnection Networks

Early Transition for Fully Adaptive Routing Algorithms in On-Chip Interconnection Networks Technical Report #2012-2-1, Department of Computer Science and Engineering, Texas A&M University Early Transition for Fully Adaptive Routing Algorithms in On-Chip Interconnection Networks Minseon Ahn,

More information

Thomas Moscibroda Microsoft Research. Onur Mutlu CMU

Thomas Moscibroda Microsoft Research. Onur Mutlu CMU Thomas Moscibroda Microsoft Research Onur Mutlu CMU CPU+L1 CPU+L1 CPU+L1 CPU+L1 Multi-core Chip Cache -Bank Cache -Bank Cache -Bank Cache -Bank CPU+L1 CPU+L1 CPU+L1 CPU+L1 Accelerator, etc Cache -Bank

More information

DLABS: a Dual-Lane Buffer-Sharing Router Architecture for Networks on Chip

DLABS: a Dual-Lane Buffer-Sharing Router Architecture for Networks on Chip DLABS: a Dual-Lane Buffer-Sharing Router Architecture for Networks on Chip Anh T. Tran and Bevan M. Baas Department of Electrical and Computer Engineering University of California - Davis, USA {anhtr,

More information

Lecture: Interconnection Networks

Lecture: Interconnection Networks Lecture: Interconnection Networks Topics: Router microarchitecture, topologies Final exam next Tuesday: same rules as the first midterm 1 Packets/Flits A message is broken into multiple packets (each packet

More information

Pseudo-Circuit: Accelerating Communication for On-Chip Interconnection Networks

Pseudo-Circuit: Accelerating Communication for On-Chip Interconnection Networks Department of Computer Science and Engineering, Texas A&M University Technical eport #2010-3-1 seudo-circuit: Accelerating Communication for On-Chip Interconnection Networks Minseon Ahn, Eun Jung Kim Department

More information

Achieving Lightweight Multicast in Asynchronous Networks-on-Chip Using Local Speculation

Achieving Lightweight Multicast in Asynchronous Networks-on-Chip Using Local Speculation Achieving Lightweight Multicast in Asynchronous Networks-on-Chip Using Local Speculation Kshitij Bhardwaj Dept. of Computer Science Columbia University Steven M. Nowick 2016 ACM/IEEE Design Automation

More information

Lecture 16: On-Chip Networks. Topics: Cache networks, NoC basics

Lecture 16: On-Chip Networks. Topics: Cache networks, NoC basics Lecture 16: On-Chip Networks Topics: Cache networks, NoC basics 1 Traditional Networks Huh et al. ICS 05, Beckmann MICRO 04 Example designs for contiguous L2 cache regions 2 Explorations for Optimality

More information

ACCELERATING COMMUNICATION IN ON-CHIP INTERCONNECTION NETWORKS. A Dissertation MIN SEON AHN

ACCELERATING COMMUNICATION IN ON-CHIP INTERCONNECTION NETWORKS. A Dissertation MIN SEON AHN ACCELERATING COMMUNICATION IN ON-CHIP INTERCONNECTION NETWORKS A Dissertation by MIN SEON AHN Submitted to the Office of Graduate Studies of Texas A&M University in partial fulfillment of the requirements

More information

BARP-A Dynamic Routing Protocol for Balanced Distribution of Traffic in NoCs

BARP-A Dynamic Routing Protocol for Balanced Distribution of Traffic in NoCs -A Dynamic Routing Protocol for Balanced Distribution of Traffic in NoCs Pejman Lotfi-Kamran, Masoud Daneshtalab *, Caro Lucas, and Zainalabedin Navabi School of Electrical and Computer Engineering, The

More information

Power and Performance Efficient Partial Circuits in Packet-Switched Networks-on-Chip

Power and Performance Efficient Partial Circuits in Packet-Switched Networks-on-Chip 2013 21st Euromicro International Conference on Parallel, Distributed, and Network-Based Processing Power and Performance Efficient Partial Circuits in Packet-Switched Networks-on-Chip Nasibeh Teimouri

More information

Lecture 12: Interconnection Networks. Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E)

Lecture 12: Interconnection Networks. Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E) Lecture 12: Interconnection Networks Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E) 1 Topologies Internet topologies are not very regular they grew

More information

NEtwork-on-Chip (NoC) [3], [6] is a scalable interconnect

NEtwork-on-Chip (NoC) [3], [6] is a scalable interconnect 1 A Soft Tolerant Network-on-Chip Router Pipeline for Multi-core Systems Pavan Poluri and Ahmed Louri Department of Electrical and Computer Engineering, University of Arizona Email: pavanp@email.arizona.edu,

More information

Low-Power Interconnection Networks

Low-Power Interconnection Networks Low-Power Interconnection Networks Li-Shiuan Peh Associate Professor EECS, CSAIL & MTL MIT 1 Moore s Law: Double the number of transistors on chip every 2 years 1970: Clock speed: 108kHz No. transistors:

More information

Lecture 22: Router Design

Lecture 22: Router Design Lecture 22: Router Design Papers: Power-Driven Design of Router Microarchitectures in On-Chip Networks, MICRO 03, Princeton A Gracefully Degrading and Energy-Efficient Modular Router Architecture for On-Chip

More information

Lecture 13: Interconnection Networks. Topics: lots of background, recent innovations for power and performance

Lecture 13: Interconnection Networks. Topics: lots of background, recent innovations for power and performance Lecture 13: Interconnection Networks Topics: lots of background, recent innovations for power and performance 1 Interconnection Networks Recall: fully connected network, arrays/rings, meshes/tori, trees,

More information

PRIORITY BASED SWITCH ALLOCATOR IN ADAPTIVE PHYSICAL CHANNEL REGULATOR FOR ON CHIP INTERCONNECTS. A Thesis SONALI MAHAPATRA

PRIORITY BASED SWITCH ALLOCATOR IN ADAPTIVE PHYSICAL CHANNEL REGULATOR FOR ON CHIP INTERCONNECTS. A Thesis SONALI MAHAPATRA PRIORITY BASED SWITCH ALLOCATOR IN ADAPTIVE PHYSICAL CHANNEL REGULATOR FOR ON CHIP INTERCONNECTS A Thesis by SONALI MAHAPATRA Submitted to the Office of Graduate and Professional Studies of Texas A&M University

More information

ES1 An Introduction to On-chip Networks

ES1 An Introduction to On-chip Networks December 17th, 2015 ES1 An Introduction to On-chip Networks Davide Zoni PhD mail: davide.zoni@polimi.it webpage: home.dei.polimi.it/zoni Sources Main Reference Book (for the examination) Designing Network-on-Chip

More information

A Layer-Multiplexed 3D On-Chip Network Architecture Rohit Sunkam Ramanujam and Bill Lin

A Layer-Multiplexed 3D On-Chip Network Architecture Rohit Sunkam Ramanujam and Bill Lin 50 IEEE EMBEDDED SYSTEMS LETTERS, VOL. 1, NO. 2, AUGUST 2009 A Layer-Multiplexed 3D On-Chip Network Architecture Rohit Sunkam Ramanujam and Bill Lin Abstract Programmable many-core processors are poised

More information

Asynchronous Bypass Channel Routers

Asynchronous Bypass Channel Routers 1 Asynchronous Bypass Channel Routers Tushar N. K. Jain, Paul V. Gratz, Alex Sprintson, Gwan Choi Department of Electrical and Computer Engineering, Texas A&M University {tnj07,pgratz,spalex,gchoi}@tamu.edu

More information

Prediction Router: Yet another low-latency on-chip router architecture

Prediction Router: Yet another low-latency on-chip router architecture Prediction Router: Yet another low-latency on-chip router architecture Hiroki Matsutani Michihiro Koibuchi Hideharu Amano Tsutomu Yoshinaga (Keio Univ., Japan) (NII, Japan) (Keio Univ., Japan) (UEC, Japan)

More information

ECE/CS 757: Advanced Computer Architecture II Interconnects

ECE/CS 757: Advanced Computer Architecture II Interconnects ECE/CS 757: Advanced Computer Architecture II Interconnects Instructor:Mikko H Lipasti Spring 2017 University of Wisconsin-Madison Lecture notes created by Natalie Enright Jerger Lecture Outline Introduction

More information

Global Adaptive Routing Algorithm Without Additional Congestion Propagation Network

Global Adaptive Routing Algorithm Without Additional Congestion Propagation Network 1 Global Adaptive Routing Algorithm Without Additional Congestion ropagation Network Shaoli Liu, Yunji Chen, Tianshi Chen, Ling Li, Chao Lu Institute of Computing Technology, Chinese Academy of Sciences

More information

Lecture 3: Flow-Control

Lecture 3: Flow-Control High-Performance On-Chip Interconnects for Emerging SoCs http://tusharkrishna.ece.gatech.edu/teaching/nocs_acaces17/ ACACES Summer School 2017 Lecture 3: Flow-Control Tushar Krishna Assistant Professor

More information

A Dynamic NOC Arbitration Technique using Combination of VCT and XY Routing

A Dynamic NOC Arbitration Technique using Combination of VCT and XY Routing 727 A Dynamic NOC Arbitration Technique using Combination of VCT and XY Routing 1 Bharati B. Sayankar, 2 Pankaj Agrawal 1 Electronics Department, Rashtrasant Tukdoji Maharaj Nagpur University, G.H. Raisoni

More information

Lecture 25: Interconnection Networks, Disks. Topics: flow control, router microarchitecture, RAID

Lecture 25: Interconnection Networks, Disks. Topics: flow control, router microarchitecture, RAID Lecture 25: Interconnection Networks, Disks Topics: flow control, router microarchitecture, RAID 1 Virtual Channel Flow Control Each switch has multiple virtual channels per phys. channel Each virtual

More information

Interconnection Networks: Topology. Prof. Natalie Enright Jerger

Interconnection Networks: Topology. Prof. Natalie Enright Jerger Interconnection Networks: Topology Prof. Natalie Enright Jerger Topology Overview Definition: determines arrangement of channels and nodes in network Analogous to road map Often first step in network design

More information

A Novel Energy Efficient Source Routing for Mesh NoCs

A Novel Energy Efficient Source Routing for Mesh NoCs 2014 Fourth International Conference on Advances in Computing and Communications A ovel Energy Efficient Source Routing for Mesh ocs Meril Rani John, Reenu James, John Jose, Elizabeth Isaac, Jobin K. Antony

More information

Routing Algorithms, Process Model for Quality of Services (QoS) and Architectures for Two-Dimensional 4 4 Mesh Topology Network-on-Chip

Routing Algorithms, Process Model for Quality of Services (QoS) and Architectures for Two-Dimensional 4 4 Mesh Topology Network-on-Chip Routing Algorithms, Process Model for Quality of Services (QoS) and Architectures for Two-Dimensional 4 4 Mesh Topology Network-on-Chip Nauman Jalil, Adnan Qureshi, Furqan Khan, and Sohaib Ayyaz Qazi Abstract

More information

SURVEY ON LOW-LATENCY AND LOW-POWER SCHEMES FOR ON-CHIP NETWORKS

SURVEY ON LOW-LATENCY AND LOW-POWER SCHEMES FOR ON-CHIP NETWORKS SURVEY ON LOW-LATENCY AND LOW-POWER SCHEMES FOR ON-CHIP NETWORKS Chandrika D.N 1, Nirmala. L 2 1 M.Tech Scholar, 2 Sr. Asst. Prof, Department of electronics and communication engineering, REVA Institute

More information

Design and Implementation of Lookup Table in Multicast Supported Router

Design and Implementation of Lookup Table in Multicast Supported Router 2011 International Conference on Computer Science and Information Technology (ICCSIT 2011) IPCSIT vol. 51 (2012) (2012) IACSIT Press, Singapore DOI: 10.7763/IPCSIT.2012.V51. 119 Design and Implementation

More information

WITH the development of the semiconductor technology,

WITH the development of the semiconductor technology, Dual-Link Hierarchical Cluster-Based Interconnect Architecture for 3D Network on Chip Guang Sun, Yong Li, Yuanyuan Zhang, Shijun Lin, Li Su, Depeng Jin and Lieguang zeng Abstract Network on Chip (NoC)

More information

Déjà Vu Switching for Multiplane NoCs

Déjà Vu Switching for Multiplane NoCs Déjà Vu Switching for Multiplane NoCs Ahmed K. Abousamra, Rami G. Melhem University of Pittsburgh Computer Science Department {abousamra, melhem}@cs.pitt.edu Alex K. Jones University of Pittsburgh Electrical

More information

HARDWARE IMPLEMENTATION OF PIPELINE BASED ROUTER DESIGN FOR ON- CHIP NETWORK

HARDWARE IMPLEMENTATION OF PIPELINE BASED ROUTER DESIGN FOR ON- CHIP NETWORK DOI: 10.21917/ijct.2012.0092 HARDWARE IMPLEMENTATION OF PIPELINE BASED ROUTER DESIGN FOR ON- CHIP NETWORK U. Saravanakumar 1, R. Rangarajan 2 and K. Rajasekar 3 1,3 Department of Electronics and Communication

More information

Switching/Flow Control Overview. Interconnection Networks: Flow Control and Microarchitecture. Packets. Switching.

Switching/Flow Control Overview. Interconnection Networks: Flow Control and Microarchitecture. Packets. Switching. Switching/Flow Control Overview Interconnection Networks: Flow Control and Microarchitecture Topology: determines connectivity of network Routing: determines paths through network Flow Control: determine

More information

Microprocessors and Microsystems

Microprocessors and Microsystems Microprocessors and Microsystems () 9 9 Contents lists available at ScienceDirect Microprocessors and Microsystems journal homepage: www.elsevier.com/locate/micpro On an efficient NoC multicasting scheme

More information

VLPW: THE VERY LONG PACKET WINDOW ARCHITECTURE FOR HIGH THROUGHPUT NETWORK-ON-CHIP ROUTER DESIGNS

VLPW: THE VERY LONG PACKET WINDOW ARCHITECTURE FOR HIGH THROUGHPUT NETWORK-ON-CHIP ROUTER DESIGNS VLPW: THE VERY LONG PACKET WINDOW ARCHITECTURE FOR HIGH THROUGHPUT NETWORK-ON-CHIP ROUTER DESIGNS A Thesis by HAIYIN GU Submitted to the Office of Graduate Studies of Texas A&M University in partial fulfillment

More information

Prevention Flow-Control for Low Latency Torus Networks-on-Chip

Prevention Flow-Control for Low Latency Torus Networks-on-Chip revention Flow-Control for Low Latency Torus Networks-on-Chip Arpit Joshi Computer Architecture and Systems Lab Department of Computer Science & Engineering Indian Institute of Technology, Madras arpitj@cse.iitm.ac.in

More information

Modeling NoC Traffic Locality and Energy Consumption with Rent s Communication Probability Distribution

Modeling NoC Traffic Locality and Energy Consumption with Rent s Communication Probability Distribution Modeling NoC Traffic Locality and Energy Consumption with Rent s Communication Distribution George B. P. Bezerra gbezerra@cs.unm.edu Al Davis University of Utah Salt Lake City, Utah ald@cs.utah.edu Stephanie

More information

Lecture 15: PCM, Networks. Today: PCM wrap-up, projects discussion, on-chip networks background

Lecture 15: PCM, Networks. Today: PCM wrap-up, projects discussion, on-chip networks background Lecture 15: PCM, Networks Today: PCM wrap-up, projects discussion, on-chip networks background 1 Hard Error Tolerance in PCM PCM cells will eventually fail; important to cause gradual capacity degradation

More information

Dynamic Packet Fragmentation for Increased Virtual Channel Utilization in On-Chip Routers

Dynamic Packet Fragmentation for Increased Virtual Channel Utilization in On-Chip Routers Dynamic Packet Fragmentation for Increased Virtual Channel Utilization in On-Chip Routers Young Hoon Kang, Taek-Jun Kwon, and Jeff Draper {youngkan, tjkwon, draper}@isi.edu University of Southern California

More information

Noc Evolution and Performance Optimization by Addition of Long Range Links: A Survey. By Naveen Choudhary & Vaishali Maheshwari

Noc Evolution and Performance Optimization by Addition of Long Range Links: A Survey. By Naveen Choudhary & Vaishali Maheshwari Global Journal of Computer Science and Technology: E Network, Web & Security Volume 15 Issue 6 Version 1.0 Year 2015 Type: Double Blind Peer Reviewed International Research Journal Publisher: Global Journals

More information

Deadlock-free XY-YX router for on-chip interconnection network

Deadlock-free XY-YX router for on-chip interconnection network LETTER IEICE Electronics Express, Vol.10, No.20, 1 5 Deadlock-free XY-YX router for on-chip interconnection network Yeong Seob Jeong and Seung Eun Lee a) Dept of Electronic Engineering Seoul National Univ

More information

Module 17: "Interconnection Networks" Lecture 37: "Introduction to Routers" Interconnection Networks. Fundamentals. Latency and bandwidth

Module 17: Interconnection Networks Lecture 37: Introduction to Routers Interconnection Networks. Fundamentals. Latency and bandwidth Interconnection Networks Fundamentals Latency and bandwidth Router architecture Coherence protocol and routing [From Chapter 10 of Culler, Singh, Gupta] file:///e /parallel_com_arch/lecture37/37_1.htm[6/13/2012

More information

EECS 570. Lecture 19 Interconnects: Flow Control. Winter 2018 Subhankar Pal

EECS 570. Lecture 19 Interconnects: Flow Control. Winter 2018 Subhankar Pal Lecture 19 Interconnects: Flow Control Winter 2018 Subhankar Pal http://www.eecs.umich.edu/courses/eecs570/ Slides developed in part by Profs. Adve, Falsafi, Hill, Lebeck, Martin, Narayanasamy, Nowatzyk,

More information

The Design and Implementation of a Low-Latency On-Chip Network

The Design and Implementation of a Low-Latency On-Chip Network The Design and Implementation of a Low-Latency On-Chip Network Robert Mullins 11 th Asia and South Pacific Design Automation Conference (ASP-DAC), Jan 24-27 th, 2006, Yokohama, Japan. Introduction Current

More information

SIGNET: NETWORK-ON-CHIP FILTERING FOR COARSE VECTOR DIRECTORIES. Natalie Enright Jerger University of Toronto

SIGNET: NETWORK-ON-CHIP FILTERING FOR COARSE VECTOR DIRECTORIES. Natalie Enright Jerger University of Toronto SIGNET: NETWORK-ON-CHIP FILTERING FOR COARSE VECTOR DIRECTORIES University of Toronto Interaction of Coherence and Network 2 Cache coherence protocol drives network-on-chip traffic Scalable coherence protocols

More information

4. Networks. in parallel computers. Advances in Computer Architecture

4. Networks. in parallel computers. Advances in Computer Architecture 4. Networks in parallel computers Advances in Computer Architecture System architectures for parallel computers Control organization Single Instruction stream Multiple Data stream (SIMD) All processors

More information

Quest for High-Performance Bufferless NoCs with Single-Cycle Express Paths and Self-Learning Throttling

Quest for High-Performance Bufferless NoCs with Single-Cycle Express Paths and Self-Learning Throttling Quest for High-Performance Bufferless NoCs with Single-Cycle Express Paths and Self-Learning Throttling Bhavya K. Daya, Li-Shiuan Peh, Anantha P. Chandrakasan Dept. of Electrical Engineering and Computer

More information

Express Virtual Channels: Towards the Ideal Interconnection Fabric

Express Virtual Channels: Towards the Ideal Interconnection Fabric Express Virtual Channels: Towards the Ideal Interconnection Fabric Amit Kumar, Li-Shiuan Peh, Partha Kundu and Niraj K. Jha Dept. of Electrical Engineering, Princeton University, Princeton, NJ 8544 Microprocessor

More information

Basic Low Level Concepts

Basic Low Level Concepts Course Outline Basic Low Level Concepts Case Studies Operation through multiple switches: Topologies & Routing v Direct, indirect, regular, irregular Formal models and analysis for deadlock and livelock

More information

A Novel Approach to Reduce Packet Latency Increase Caused by Power Gating in Network-on-Chip

A Novel Approach to Reduce Packet Latency Increase Caused by Power Gating in Network-on-Chip A Novel Approach to Reduce Packet Latency Increase Caused by Power Gating in Network-on-Chip Peng Wang Sobhan Niknam Zhiying Wang Todor Stefanov Leiden Institute of Advanced Computer Science, Leiden University,

More information

Frequent Value Compression in Packet-based NoC Architectures

Frequent Value Compression in Packet-based NoC Architectures Frequent Value Compression in Packet-based NoC Architectures Ping Zhou, BoZhao, YuDu, YiXu, Youtao Zhang, Jun Yang, Li Zhao ECE Department CS Department University of Pittsburgh University of Pittsburgh

More information

FCUDA-NoC: A Scalable and Efficient Network-on-Chip Implementation for the CUDA-to-FPGA Flow

FCUDA-NoC: A Scalable and Efficient Network-on-Chip Implementation for the CUDA-to-FPGA Flow FCUDA-NoC: A Scalable and Efficient Network-on-Chip Implementation for the CUDA-to-FPGA Flow Abstract: High-level synthesis (HLS) of data-parallel input languages, such as the Compute Unified Device Architecture

More information

A Survey of Techniques for Power Aware On-Chip Networks.

A Survey of Techniques for Power Aware On-Chip Networks. A Survey of Techniques for Power Aware On-Chip Networks. Samir Chopra Ji Young Park May 2, 2005 1. Introduction On-chip networks have been proposed as a solution for challenges from process technology

More information

Packet Switch Architecture

Packet Switch Architecture Packet Switch Architecture 3. Output Queueing Architectures 4. Input Queueing Architectures 5. Switching Fabrics 6. Flow and Congestion Control in Sw. Fabrics 7. Output Scheduling for QoS Guarantees 8.

More information

Packet Switch Architecture

Packet Switch Architecture Packet Switch Architecture 3. Output Queueing Architectures 4. Input Queueing Architectures 5. Switching Fabrics 6. Flow and Congestion Control in Sw. Fabrics 7. Output Scheduling for QoS Guarantees 8.

More information

A Thermal-aware Application specific Routing Algorithm for Network-on-chip Design

A Thermal-aware Application specific Routing Algorithm for Network-on-chip Design A Thermal-aware Application specific Routing Algorithm for Network-on-chip Design Zhi-Liang Qian and Chi-Ying Tsui VLSI Research Laboratory Department of Electronic and Computer Engineering The Hong Kong

More information

Efficient Multicast Communication using 3d Router

Efficient Multicast Communication using 3d Router International Journal of Emerging Engineering Research and Technology Volume 3, Issue 12, December 2015, PP 38-49 ISSN 2349-4395 (Print) & ISSN 2349-4409 (Online) Efficient Multicast Communication using

More information

A VERIOG-HDL IMPLEMENTATION OF VIRTUAL CHANNELS IN A NETWORK-ON-CHIP ROUTER. A Thesis SUNGHO PARK

A VERIOG-HDL IMPLEMENTATION OF VIRTUAL CHANNELS IN A NETWORK-ON-CHIP ROUTER. A Thesis SUNGHO PARK A VERIOG-HDL IMPLEMENTATION OF VIRTUAL CHANNELS IN A NETWORK-ON-CHIP ROUTER A Thesis by SUNGHO PARK Submitted to the Office of Graduate Studies of Texas A&M University in partial fulfillment of the requirements

More information

Interconnection Networks

Interconnection Networks Lecture 17: Interconnection Networks Parallel Computer Architecture and Programming A comment on web site comments It is okay to make a comment on a slide/topic that has already been commented on. In fact

More information

Dynamic Stress Wormhole Routing for Spidergon NoC with effective fault tolerance and load distribution

Dynamic Stress Wormhole Routing for Spidergon NoC with effective fault tolerance and load distribution Dynamic Stress Wormhole Routing for Spidergon NoC with effective fault tolerance and load distribution Nishant Satya Lakshmikanth sailtosatya@gmail.com Krishna Kumaar N.I. nikrishnaa@gmail.com Sudha S

More information

Adaptive Data Compression for High-Performance Low-Power On-Chip Networks

Adaptive Data Compression for High-Performance Low-Power On-Chip Networks Adaptive Data Compression for High-Performance Low-Power On-Chip Networks Yuho Jin Ki Hwan Yum Eun Jung Kim Department of Computer Science Department of Computer Science Texas A&M University University

More information

Design of Adaptive Communication Channel Buffers for Low-Power Area- Efficient Network-on. on-chip Architecture

Design of Adaptive Communication Channel Buffers for Low-Power Area- Efficient Network-on. on-chip Architecture Design of Adaptive Communication Channel Buffers for Low-Power Area- Efficient Network-on on-chip Architecture Avinash Kodi, Ashwini Sarathy * and Ahmed Louri * Department of Electrical Engineering and

More information

A Novel 3D Layer-Multiplexed On-Chip Network

A Novel 3D Layer-Multiplexed On-Chip Network A Novel D Layer-Multiplexed On-Chip Network Rohit Sunkam Ramanujam University of California, San Diego rsunkamr@ucsdedu Bill Lin University of California, San Diego billlin@eceucsdedu ABSTRACT Recently,

More information

FT-Z-OE: A Fault Tolerant and Low Overhead Routing Algorithm on TSV-based 3D Network on Chip Links

FT-Z-OE: A Fault Tolerant and Low Overhead Routing Algorithm on TSV-based 3D Network on Chip Links FT-Z-OE: A Fault Tolerant and Low Overhead Routing Algorithm on TSV-based 3D Network on Chip Links Hoda Naghibi Jouybari College of Electrical Engineering, Iran University of Science and Technology, Tehran,

More information

Design and Implementation of Buffer Loan Algorithm for BiNoC Router

Design and Implementation of Buffer Loan Algorithm for BiNoC Router Design and Implementation of Buffer Loan Algorithm for BiNoC Router Deepa S Dev Student, Department of Electronics and Communication, Sree Buddha College of Engineering, University of Kerala, Kerala, India

More information

FPGA based Design of Low Power Reconfigurable Router for Network on Chip (NoC)

FPGA based Design of Low Power Reconfigurable Router for Network on Chip (NoC) FPGA based Design of Low Power Reconfigurable Router for Network on Chip (NoC) D.Udhayasheela, pg student [Communication system],dept.ofece,,as-salam engineering and technology, N.MageshwariAssistant Professor

More information

CCNoC: Specializing On-Chip Interconnects for Energy Efficiency in Cache-Coherent Servers

CCNoC: Specializing On-Chip Interconnects for Energy Efficiency in Cache-Coherent Servers CCNoC: Specializing On-Chip Interconnects for Energy Efficiency in Cache-Coherent Servers Stavros Volos, Ciprian Seiculescu, Boris Grot, Naser Khosro Pour, Babak Falsafi, and Giovanni De Micheli Toward

More information

An Energy-Efficient Reconfigurable NoC Architecture with RF-Interconnects

An Energy-Efficient Reconfigurable NoC Architecture with RF-Interconnects 2013 16th Euromicro Conference on Digital System Design An Energy-Efficient Reconfigurable NoC Architecture with RF-Interconnects Majed ValadBeigi, Farshad Safaei and Bahareh Pourshirazi Faculty of ECE,

More information

Lecture 26: Interconnects. James C. Hoe Department of ECE Carnegie Mellon University

Lecture 26: Interconnects. James C. Hoe Department of ECE Carnegie Mellon University 18 447 Lecture 26: Interconnects James C. Hoe Department of ECE Carnegie Mellon University 18 447 S18 L26 S1, James C. Hoe, CMU/ECE/CALCM, 2018 Housekeeping Your goal today get an overview of parallel

More information

Architecture and Design of Efficient 3D Network-on-Chip for Custom Multi-Core SoC

Architecture and Design of Efficient 3D Network-on-Chip for Custom Multi-Core SoC BWCCA 2010 Fukuoka, Japan November 4-6 2010 Architecture and Design of Efficient 3D Network-on-Chip for Custom Multi-Core SoC Akram Ben Ahmed, Abderazek Ben Abdallah, Kenichi Kuroda The University of Aizu

More information

A 256-Radix Crossbar Switch Using Mux-Matrix-Mux Folded-Clos Topology

A 256-Radix Crossbar Switch Using Mux-Matrix-Mux Folded-Clos Topology http://dx.doi.org/10.5573/jsts.014.14.6.760 JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.14, NO.6, DECEMBER, 014 A 56-Radix Crossbar Switch Using Mux-Matrix-Mux Folded-Clos Topology Sung-Joon Lee

More information

MinBD: Minimally-Buffered Deflection Routing for Energy-Efficient Interconnect

MinBD: Minimally-Buffered Deflection Routing for Energy-Efficient Interconnect MinBD: Minimally-Buffered Deflection Routing for Energy-Efficient Interconnect Chris Fallin, Greg Nazario, Xiangyao Yu*, Kevin Chang, Rachata Ausavarungnirun, Onur Mutlu Carnegie Mellon University *CMU

More information

Energy Efficient and Congestion-Aware Router Design for Future NoCs

Energy Efficient and Congestion-Aware Router Design for Future NoCs 216 29th International Conference on VLSI Design and 216 15th International Conference on Embedded Systems Energy Efficient and Congestion-Aware Router Design for Future NoCs Wazir Singh and Sujay Deb

More information

Buffer Pocketing and Pre-Checking on Buffer Utilization

Buffer Pocketing and Pre-Checking on Buffer Utilization Journal of Computer Science 8 (6): 987-993, 2012 ISSN 1549-3636 2012 Science Publications Buffer Pocketing and Pre-Checking on Buffer Utilization 1 Saravanan, K. and 2 R.M. Suresh 1 Department of Information

More information

MinRoot and CMesh: Interconnection Architectures for Network-on-Chip Systems

MinRoot and CMesh: Interconnection Architectures for Network-on-Chip Systems MinRoot and CMesh: Interconnection Architectures for Network-on-Chip Systems Mohammad Ali Jabraeil Jamali, Ahmad Khademzadeh Abstract The success of an electronic system in a System-on- Chip is highly

More information

Interconnection Networks: Flow Control. Prof. Natalie Enright Jerger

Interconnection Networks: Flow Control. Prof. Natalie Enright Jerger Interconnection Networks: Flow Control Prof. Natalie Enright Jerger Switching/Flow Control Overview Topology: determines connectivity of network Routing: determines paths through network Flow Control:

More information

Lecture 14: Large Cache Design III. Topics: Replacement policies, associativity, cache networks, networking basics

Lecture 14: Large Cache Design III. Topics: Replacement policies, associativity, cache networks, networking basics Lecture 14: Large Cache Design III Topics: Replacement policies, associativity, cache networks, networking basics 1 LIN Qureshi et al., ISCA 06 Memory level parallelism (MLP): number of misses that simultaneously

More information

CAD System Lab Graduate Institute of Electronics Engineering National Taiwan University Taipei, Taiwan, ROC

CAD System Lab Graduate Institute of Electronics Engineering National Taiwan University Taipei, Taiwan, ROC QoS Aware BiNoC Architecture Shih-Hsin Lo, Ying-Cherng Lan, Hsin-Hsien Hsien Yeh, Wen-Chung Tsai, Yu-Hen Hu, and Sao-Jie Chen Ying-Cherng Lan CAD System Lab Graduate Institute of Electronics Engineering

More information

JUNCTION BASED ROUTING: A NOVEL TECHNIQUE FOR LARGE NETWORK ON CHIP PLATFORMS

JUNCTION BASED ROUTING: A NOVEL TECHNIQUE FOR LARGE NETWORK ON CHIP PLATFORMS 1 JUNCTION BASED ROUTING: A NOVEL TECHNIQUE FOR LARGE NETWORK ON CHIP PLATFORMS Shabnam Badri THESIS WORK 2011 ELECTRONICS JUNCTION BASED ROUTING: A NOVEL TECHNIQUE FOR LARGE NETWORK ON CHIP PLATFORMS

More information

Connection-oriented Multicasting in Wormhole-switched Networks on Chip

Connection-oriented Multicasting in Wormhole-switched Networks on Chip Connection-oriented Multicasting in Wormhole-switched Networks on Chip Zhonghai Lu, Bei Yin and Axel Jantsch Laboratory of Electronics and Computer Systems Royal Institute of Technology, Sweden fzhonghai,axelg@imit.kth.se,

More information

Phastlane: A Rapid Transit Optical Routing Network

Phastlane: A Rapid Transit Optical Routing Network Phastlane: A Rapid Transit Optical Routing Network Mark Cianchetti, Joseph Kerekes, and David Albonesi Computer Systems Laboratory Cornell University The Interconnect Bottleneck Future processors: tens

More information

SigNet: Network-on-Chip Filtering for Coarse Vector Directories

SigNet: Network-on-Chip Filtering for Coarse Vector Directories SigNet: Network-on-Chip Filtering for Coarse Vector Directories Natalie Enright Jerger Department of Electrical and Computer Engineering University of Toronto Toronto, ON enright@eecg.toronto.edu Abstract

More information

Design and Implementation of a Packet Switched Dynamic Buffer Resize Router on FPGA Vivek Raj.K 1 Prasad Kumar 2 Shashi Raj.K 3

Design and Implementation of a Packet Switched Dynamic Buffer Resize Router on FPGA Vivek Raj.K 1 Prasad Kumar 2 Shashi Raj.K 3 IJSRD - International Journal for Scientific Research & Development Vol. 2, Issue 02, 2014 ISSN (online): 2321-0613 Design and Implementation of a Packet Switched Dynamic Buffer Resize Router on FPGA Vivek

More information

LOW POWER REDUCED ROUTER NOC ARCHITECTURE DESIGN WITH CLASSICAL BUS BASED SYSTEM

LOW POWER REDUCED ROUTER NOC ARCHITECTURE DESIGN WITH CLASSICAL BUS BASED SYSTEM Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 5, May 2015, pg.705

More information

A Domain-Specific On-Chip Network Design for Large Scale Cache Systems

A Domain-Specific On-Chip Network Design for Large Scale Cache Systems A Domain-Specific On-Chip Network Design for Large Scale Cache Systems Yuho Jin Eun Jung Kim Department of Computer Science Texas A&M University {yuho,ejkim}@cs.tamu.edu Ki Hwan Yum Department of Computer

More information

Lecture 23: Router Design

Lecture 23: Router Design Lecture 23: Router Design Papers: A Gracefully Degrading and Energy-Efficient Modular Router Architecture for On-Chip Networks, ISCA 06, Penn-State ViChaR: A Dynamic Virtual Channel Regulator for Network-on-Chip

More information

Design and Implementation of Low Complexity Router for 2D Mesh Topology using FPGA

Design and Implementation of Low Complexity Router for 2D Mesh Topology using FPGA Design and Implementation of Low Complexity Router for 2D Mesh Topology using FPGA Maheswari Murali * and Seetharaman Gopalakrishnan # * Assistant professor, J. J. College of Engineering and Technology,

More information

HANDSHAKE AND CIRCULATION FLOW CONTROL IN NANOPHOTONIC INTERCONNECTS

HANDSHAKE AND CIRCULATION FLOW CONTROL IN NANOPHOTONIC INTERCONNECTS HANDSHAKE AND CIRCULATION FLOW CONTROL IN NANOPHOTONIC INTERCONNECTS A Thesis by JAGADISH CHANDAR JAYABALAN Submitted to the Office of Graduate Studies of Texas A&M University in partial fulfillment of

More information

Bandwidth Aware Routing Algorithms for Networks-on-Chip

Bandwidth Aware Routing Algorithms for Networks-on-Chip 1 Bandwidth Aware Routing Algorithms for Networks-on-Chip G. Longo a, S. Signorino a, M. Palesi a,, R. Holsmark b, S. Kumar b, and V. Catania a a Department of Computer Science and Telecommunications Engineering

More information

Virtual Circuit Tree Multicasting: A Case for On-Chip Hardware Multicast Support

Virtual Circuit Tree Multicasting: A Case for On-Chip Hardware Multicast Support Virtual Circuit Tree Multicasting: A Case for On-Chip Hardware Multicast Support Natalie Enright Jerger, Li-Shiuan Peh, and Mikko Lipasti Electrical and Computer Engineering Department, University of Wisconsin-Madison

More information

Design of Low-Power and Low-Latency 256-Radix Crossbar Switch Using Hyper-X Network Topology

Design of Low-Power and Low-Latency 256-Radix Crossbar Switch Using Hyper-X Network Topology JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.15, NO.1, FEBRUARY, 2015 http://dx.doi.org/10.5573/jsts.2015.15.1.077 Design of Low-Power and Low-Latency 256-Radix Crossbar Switch Using Hyper-X Network

More information

Pointers & Arrays. CS2023 Winter 2004

Pointers & Arrays. CS2023 Winter 2004 Pointers & Arrays CS2023 Winter 2004 Outcomes: Pointers & Arrays C for Java Programmers, Chapter 8, section 8.12, and Chapter 10, section 10.2 Other textbooks on C on reserve After the conclusion of this

More information

MULTIPATH ROUTER ARCHITECTURES TO REDUCE LATENCY IN NETWORK-ON-CHIPS. A Thesis HRISHIKESH NANDKISHOR DESHPANDE

MULTIPATH ROUTER ARCHITECTURES TO REDUCE LATENCY IN NETWORK-ON-CHIPS. A Thesis HRISHIKESH NANDKISHOR DESHPANDE MULTIPATH ROUTER ARCHITECTURES TO REDUCE LATENCY IN NETWORK-ON-CHIPS A Thesis by HRISHIKESH NANDKISHOR DESHPANDE Submitted to the Office of Graduate Studies of Texas A&M University in partial fulfillment

More information

Design and Implementation of Multistage Interconnection Networks for SoC Networks

Design and Implementation of Multistage Interconnection Networks for SoC Networks International Journal of Computer Science, Engineering and Information Technology (IJCSEIT), Vol.2, No.5, October 212 Design and Implementation of Multistage Interconnection Networks for SoC Networks Mahsa

More information

On-Die Interconnects for next generation CMPs

On-Die Interconnects for next generation CMPs On-Die Interconnects for next generation CMPs Partha Kundu Corporate Technology Group (MTL) Intel Corporation OCIN Workshop, Stanford University December 6, 2006 1 Multi- Transition Accelerating We notified

More information

Lecture 7: Flow Control - I

Lecture 7: Flow Control - I ECE 8823 A / CS 8803 - ICN Interconnection Networks Spring 2017 http://tusharkrishna.ece.gatech.edu/teaching/icn_s17/ Lecture 7: Flow Control - I Tushar Krishna Assistant Professor School of Electrical

More information

Latency Criticality Aware On-Chip Communication

Latency Criticality Aware On-Chip Communication Latency Criticality Aware On-Chip Communication Zheng Li, Jie Wu, Li Shang, Robert P. Dick, and Yihe Sun Tsinghua National Laboratory for Information Science and Technology, Inst. of Microelectronics,

More information

Destination-Based Adaptive Routing on 2D Mesh Networks

Destination-Based Adaptive Routing on 2D Mesh Networks Destination-Based Adaptive Routing on 2D Mesh Networks Rohit Sunkam Ramanujam University of California, San Diego rsunkamr@ucsdedu Bill Lin University of California, San Diego billlin@eceucsdedu ABSTRACT

More information

342 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 17, NO. 3, MARCH /$ IEEE

342 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 17, NO. 3, MARCH /$ IEEE 342 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 17, NO. 3, MARCH 2009 Custom Networks-on-Chip Architectures With Multicast Routing Shan Yan, Student Member, IEEE, and Bill Lin,

More information