A thesis presented to. the faculty of. In partial fulfillment. of the requirements for the degree. Master of Science. Yixuan Zhang.

Size: px
Start display at page:

Download "A thesis presented to. the faculty of. In partial fulfillment. of the requirements for the degree. Master of Science. Yixuan Zhang."

Transcription

1 High-Performance Crossbar Designs for Network-on-Chips (NoCs) A thesis presented to the faculty of the Russ College of Engineering and Technology of Ohio University In partial fulfillment of the requirements for the degree Master of Science Yixuan Zhang August Yixuan Zhang. All Rights Reserved.

2 2 This thesis titled High-Performance Crossbar Designs for Network-on-Chips (NoCs) by YIXUAN ZHANG has been approved for the School of Electrical Engineering and Computer Science and the Russ College of Engineering and Technology by Avinash Kodi Assistant Professor of Electrical Engineering and Computer Science Dennis Irwin Dean of Russ College of Engineering and Technology

3 Abstract 3 ZHANG, YIXUAN, M.S., August 2010, Electrical Engineering High-Performance Crossbar Designs for Network-on-Chips (NoCs) (60 pp.) Director of Thesis: Avinash Kodi The packet-switched Network-on-Chip (NoC) architecture is considered to be an attractive approach for overcoming bottlenecks such as wire delay and scalability seen in Chip-Multiprocessors (CMPs). However, current NoC designs heavily rely on in-router buffers, which is a major source of power consumption. Eliminating all router input buffers would result in either dropped or deflected packets when conflicts are frequent, thereby increasing the power consumption. In this thesis, with both enhancing performance and decreasing power consumption as our goals, we propose an innovative dual-crossbar design called DXbar, which combines the advantages of bufferless networks and traditional networks with buffers, to enable low-latency routing at low network load and limited buffering capability to handle excessive packets at high network load. DXbar design not only has superior performance compared to the state-of-the-art design, but also achieves significant power savings. The simulation results of the proposed methodology show that DXbar achieves at least 15% performance improvement and power saving over the baseline designs. Additionally, DXbar outperforms the currently proposed bufferless designs with huge decrease in power consumption when the network utilization is high. To overcome the area overhead, an innovative crossbar design named dual-directional crossbar is also proposed in this thesis. The crossbar is able to provide the same connectivity of the dualcrossbar, and occupies only half of the area. Approved: Avinash Kodi Assistant Professor of Electrical Engineering and Computer Science

4 Acknowledgements 4 I would like to sincerely thank my parents who provide me the chance to obtain my bachelor degree in the most privilege university in my home country, which in turn helps me to let my dream of studying in United States come true. I would also like to thank my advisor Dr. Avinash Kodi, an undoubtedly responsible and helpful professor, who have been providing me countless guidances throughout my years in the graduate school. Additionally, I would like to thank Randy W. Morris, Jr., my colleague and a senior student in the lab, who acquaints me with the working environment and helps me start my project. Special thanks to my beloved fiance Yu Song, a MPA student in Ohio State University. Her care and love play an important role for me to adapt to the life in this country. Finally, I would like to thank Ohio University for giving me the opportunity of studying for my master degree, and National Science Foundation for generously being the sponsor of my scholarship.

5 Table of Contents 5 Abstract Acknowledgements List of Tables List of Figures Introduction Chip-Multiprocessors (CMPs) Network-on-Chips (NoCs) Related Work Proposed Approach Organization of Thesis Background on Network-on-Chips Terminology Topology Ring Mesh Torus Cmesh Flattened Butterfly Router Architecture Pipeline Stages Flow Control Routing Algorithm DXBar: A Power-Efficient and High-Performance Crossbar Design Introduction of DXbar Architecture Router Implementation Walkthrough Examples Fairness Maintenance Fault Tolerance Dual-Directional Crossbar Performance Evaluation Methodology Area and Power Estimation Synthetic Traffic

6 4.4 Real-Application Traces Fault Tolerance Conclusion and Future Work References

7 List of Tables Processor parameters used for Splash-2 suite simulations Cache and memory parameters used for Splash-2 suite simulation Area and power estimation for different designs in a 8 8 2D mesh network. The area of one 2 2 crossbar is 9.8µm. The crossbar power is mW, the 2 2 crossbar power is mW, and the link power is mW. Power values are for one flit traversal

8 List of Figures The trend of uniprocessor performance improvement since 1978 [1] The trend of Intel uniprocessor power of several generations [2] A network using ring topology A network using (a) mesh (b) torus topology A network using (a) Cmesh (b) Flattened Butterfly topology The architecture of a generic router (a) A 5-stage router pipeline and timing for different flits in a packet. (b) A 4-stage router pipeline with look-ahead routing. (c) A 4-stage router pipeline with speculative SA. (d) A 3-stage router pipeline with both look-ahead routing and speculative SA An illustration of credit-based flow control. After sending flit 0, Router 1 has to halt the sending of flit 1 and wait for a credit from router 2. At the time flit 0 departs from router 2, a credit is sent back from router 2 to router 1. The sending of flit 1 from router 1 to router 2 is resumed after router 1 receives the credit [3] (a) The path from source node (0, 0) to destination node (2, 2) using dimensionorder routing algorithm (b) One possible path from source node (0, 0) to destination node (2, 2) using west-first adaptive routing algorithm The architecture of a DXbar router The pipeline of a DXbar router The illustration of dual xbar: (a) No conflicts appear, the network operates as bufferless network. (b) Conflict appears, one flit is sent to the secondary crossbar and buffered first. (c) The flit coming after the buffered flit sees no delay and could proceed without being buffered, thus ensuring that no instant back pressure happens. The injection port could send a flit whenever the desired output port is not occupied (d) Even though there is still a flit coming in through the same input port, the buffered flit could proceed when the desired output port is not occupied. The incoming flit could proceed simultaneously, encountering no delay The architecture of a DXbar router with fault tolerance capability (a) The structure of a dual-directional crossbar. (b) An example showing how signal could enter from both directions. Buffers with a color of grey are turned off (a) The architecture of a DXbar router with dual-directional crossbar. (b) An example showing a small crossbar alters the two flits. Without the small crossbars, the routes of the displayed flits will overlap on the same input line of the crossbar, so that the flits cannot progress together Throughput of Uniform Random traffic pattern

9 4.2 Latency of (a) Uniform Random (b) Non-Uniform Random (c) Complement (d) Perfect Shuffle traffic patterns Power of (a) Uniform Random (b) Non-Uniform Random (c) Complement (d) Perfect Shuffle traffic patterns Throughput at an offered load = 0.5 of all synthetic traces Power at an offered load = 0.5 of all synthetic traces Normalized time of simulation of all SPLASH-2 traces Power of all SPLASH-2 traces Throughput of fault tolerant DXbar with (a) deterministic (b) adaptive routing algorithms Latency of fault tolerant DXbar with (a) deterministic (b) adaptive routing algorithms Power of fault tolerant DXbar with (a) deterministic (b) adaptive routing algorithms

10 1 Introduction Chip-Multiprocessors (CMPs) During the past several decades, the semiconductor industry has been developing and growing at an exponential pace [1]. The ever shrinking clock cycle coupled with transistor scaling has resulted in high performance and complex processing units [4]. High clocks rates are achieved by technology scaling and by the use of pipelining [2]. Pipelining divides the execution of an instruction into smaller parts which are allowed to run faster [5]. Another method for increasing processor performance is instruction level parallelism (ILP). ILP overlaps execution of several instructions into the same clock cycle by fully utilizing all the functional units [5]. The combination of pipelining, faster clock rate, and ILP taken together, provide significant performance improvements in uniprocessor designs. In recent years, increasing processor performance through traditional approaches has switched to a slower pace [1, 4]. Figure 1.1 shows the performance of an individual processor over the time period from 1978 to 2006 [1]. It is clear that the increase in processor performance has declined from 52% from 1986 to 2002, to its present value of 25% improvement in performance per year. This is partially caused by the fact that shrinking transistor size has increased the leakage current, crosstalk, clock skew, preventing the processor from working at a higher operating frequencies [1, 4, 6, 7]. Power dissipation has become another critical factor contributing to the slower improvement rate in processor speeds [2]. This increase in power dissipation is contributed to the increase in the number of transistors due to technology scaling, combined with disproportional scaling of the transistor power due to increased leakage current [2]. The maximum clock rate is also affected by power dissipation, as the overall power consumption of a digital circuit is directly proportional to frequency. The dynamic power

11 ??%/year Performance (vs. VAX-11/780) %/year 52%/year Figure 1.1: The trend of uniprocessor performance improvement since 1978 [1]. consumption of a digital circuit is given by [5]: P CV 2 f (1.1) where P is the total power, C is the capacitance, V is the operating voltage, and f is the frequency. Another problem with increased power dissipation is the generation of heat, as silicon could only dissipate a maximum power of 200 Watts [8]. As a result, the performance of a processor is dependent on how well it manages and dissipates power. Two good examples to illustrate the importance of power dissipation are the Intel Tejas processor [9] and the 4-GHz Pentium 4 processor [10], which were canceled due to their tremendous power consumption. It should be noted that previous generations of Intel uniprocessors achieved better performance at the cost of higher power dissipation, as shown in Figure 1.2 [2]. The limitations seen in uniprocessor designs led to a new era in computer architecture, where engineers switched from designing more sophisticated uniprocessors to exploring

12 12 Figure 1.2: The trend of Intel uniprocessor power of several generations [2]. the benefits of placing multiple simpler cores on the same chip. Chip-Multiprocessors (CMPs) provide a power-efficient and cost-efficient method by placing multiple cores on the same chip [11]. In addition, replicating cores on a processor is a known method for increasing processor performance within the allowed power budget of modern processors [5, 12]. Moreover, CMPs have been thoroughly studied by researchers for high-performance computing, and their benefits are well known. There are several implementations of high-performance CMPs which are designed by leading institutions and companies. RAW, developed by MIT, is a 16-core CMP, which was implemented in the IBM SA-27E ASIC processor in 150 nm technology [13 15]. Intel TeraFlops is a 80-

13 13 core CMP designed using 65 nm technology [16]. Sun Microsystems designed the 16-core Niagara and Rock CMPs, which are mainly used for data processing [17, 18]. 1.2 Network-on-Chips (NoCs) As the number of cores increases, the interconnection network between cores would play a leading role in the CMPs overall performance [19 22]. Traditionally, shared buses have been used as communication fabric for CMPs with few number of cores. As transistors continue to scale resulting in higher clock frequencies, the number of clock cycles required to traverse from one end of the chip to the other increases due to the length of the chip remaining the same [23]. This results in a bus speed slower than the potential clock frequency of cores, as the delay of cores depends on local wires which scale with transistors [24]. As simple shared buses are unsuitable for future CMPs with hundreds to thousands of cores [25], the promising Network-on-Chips (NoCs) have emerged as a solution for designing interconnection fabric in CMPs [19 21, 26 29]. NoCs provide a low-latency, low-power, high bandwidth communication medium for CMPs by using the benefits of a modular packet-switched network. The key components which form NoCs are routers and links. A router is responsible to receive and forward communication traffic, and a link provides point-to-point connection between two routers. Each node in NoCs is directly connected to a limited number of neighboring nodes, therefore the wires in a link are short and able to achieve high speed. In today s NoCs field, researchers are continuously confronted by several major challenges: reducing power dissipation in the network, improving performance and reducing area overhead. The power consumed by modern NoCs consists of a large portion of total chip power. For example, 28% of the total chip power of Intel TeraFLOPS processors is spent for communication, while the expectation was only 10%

14 14 [16]. Moreover, the interconnection network of MIT s RAW processor consumes 36% of the total chip power [29, 30]. Power dissipation in modern NoC architectures is mainly characterized by the power consumed in the links, crossbars, and input buffers, Amongst these components, input buffers alone could consume up to 35% percent of the total power of the whole interconnection network [29]. As a result, reducing the size of input buffers or completely eliminating input buffers is a natural approach to design low-power NoCs. On the other hand, increasing the size of an input buffer would lead to more upstream packets being transmitted and buffered, thus increasing the throughput [3]. Therefore, simply reducing the size of input buffers in each router may result in a degraded performance, as the performance of an interconnection network is primarily affected by the size of the buffers. 1.3 Related Work Different types of techniques have been proposed to reduce the size of input buffers or to eliminate them outright. Recently, ideal (inter-router Dual-function Energy and Area-efficient Links) proposed to reduce the size of the input buffers and utilize repeaters along inter-router channels as storage units when needed by appropriately controlling the dual-functional repeaters [31]. As a result, the size of input buffers could be reduced without significantly reducing the performance, because packets could be buffered along the links. The design could maintain similar performance with only half the number of input buffers, so that area saving is huge, and power is also saved. The cost is increased latency due to Head-of-Line (HoL) blocking, and complexity due to extra control units. Another approach utilizing channel buffering is the Elastic Channel Buffers (ECB), which replaces all the repeaters with flip-flops, and eliminates the input buffers [32, 33]. In this design, the flip-flops along the links function as single buffer queues, such that ECB router operates similarly to a generic router. All packets are transmitted from one channel buffer

15 15 to the next by a handshaking protocol, which means that packets need to waste a clock cycle each time they traverse a flip-flop. This is because that in edge-triggered flip-flops, signals need a full cycle to go from input to output. This clock-based nature of flip-flops results in excessive link delay even if all the buffers along the channels are unoccupied. Bufferless routing is another novel and unique approach which eliminates all input buffers without utilizing channel buffering. Flit-Bless proposed a routing scheme to send all incoming packets to output ports, irrespective of the fact whether those output ports are productive [34]. The age-based priority for arbitration indicates that the oldest incoming packet is guaranteed to be routed to its most productive output port, while younger packets may be deflected to their non-productive output ports and take nonminimal numbers of hops before reaching their destinations. Due to the full adaptivity of Flit-Bless, each packet carries several desired output ports to compete in switch allocation. Excessive computation workload to find the final allocation results limits the performance of switch allocator, so that routers in Flit-Bless could only work with a frequency lower than other NoCs. Another design which improves upon bufferless routing is SCARAB (Single Cycle Adaptive Routing and Bufferless Network) [35]. In this design, packets are minimally-adaptively routed in order to ease the workload of switch allocation. If none of the productive output ports are available, the packet will be dropped, and a NACK signal will be transmitted through a dedicated circuit-switched NACK network to trigger a retransmission. Without the dedicated NACK network, a portion of the bandwidth of the data network would be consumed by NACK packets, which could lead to data packets being blocked, or retransmission being delayed. Both designs could achieve substantial power saving and latency reduction at low network load. Whereas at higher network load, conflicts are more frequent and the rate of deflections or retransmissions is drastically increased. Lacking buffers to temporarily stay, packets in both designs need to keep progressing in undesired directions or be dropped, thus experiencing much more link and crossbar

16 16 traversals. This leads to significantly higher power consumption for bufferless networks at high network load, and lower saturation throughput than a generic router network. Another approach to save power is to modify crossbar. Segmented crossbar is a powerdriven design, where each input line of a crossbar is divided into several segments by tristate buffers [29]. When not required, part of the input lines could be turned off by shutting down the tri-state buffers. Because the input lines become shorter, less energy is needed to drive those lines, thereby saving crossbar power. The interconnection network of Inter TeraFlop processor uses a double-pumped crossbar [16, 36]. Consequently, the crossbar area is reduced by half, which in turn saves crossbar power by requiring less energy to drive. Nonetheless, performance of these two designs is not improved. Other than power and area saving, some designs targeting performance enhancement in different approaches. A dynamic buffering resources allocation design named ViChaR (dynamic Virtual Channel Regulator) focuses on efficiently allocating buffers to all virtual channels, by deploying a unified buffering unit instead of a series of separated buffers [37]. Because HoL blocking is eliminated, and buffers could be shared by all packets, the required size of input buffers is minimized. The design is able to achieve good performance while saving area, at the cost of significantly increased complexity in buffer management. Express Virtual-Channels (EVCs) deploy special VCs which enable packets to bypass buffer operations in the intermediate nodes [38]. Namely, packets could traverse the intermediate nodes with minimum delay in each router. Because buffer operations are less frequent, power is saved and latency is reduced. The problem is that routing is complicated, since packets need to choose between taking express VC to experience more hops, and taking regular VC to encounter more delay. A low-cost router microarchitecture has been proposed to partition the crossbar into two decoupled low-radix subcrossbars for the two dimensions, and connect them with an intermediate buffer queue [39]. Packets traversing in their own dimensions are prioritized to encounter minimal delay,

17 17 while turning packets need to wait in the intermediate buffer queues until no more in-flight packets occupy the desired output ports. Similar to bufferless designs, low network load could yield satisfactory performance on this design, while the intermediate buffer queues become the prominent bottleneck as there would be more turning packets when network load increases. Another design which utilizes dual-crossbar is called Distributed Shared- Buffer [40]. In this design, buffer slots are sandwiched between two crossbars in order to rearrange packet sequences, so that conflicts could be avoided and packets would be switched more quickly. This design of dual-crossbar targets improving throughput, yet increases power consumption considerably. 1.4 Proposed Approach In this thesis, we propose DXBar, a high-performance and power-efficient dualcrossbar design targeting both performance enhancement and power reduction. The major contributions of this work are as following: We implement a primary crossbar with no input buffers. The bufferless network could provide simple and efficient switching function for packets, without any buffer operations. Packets traversing the primary crossbar only need one cycle to be switched. The elimination of buffers also result in a decreased power consumption and latency. Compared to the baseline designs, the proposed approach reduces power by 15%, and latency by 50%, for Uniform Random traffic pattern. We also implement a secondary crossbar with buffers. Packets do not need to take extra hops to solve conflicts because of the secondary crossbar, therefore the power consumption at either low or high network load is reduced. Compared to bufferless networks at high network load, the proposed approach reduces power by at least 30%.

18 18 We utilize a dual-crossbar design to improve the performance. Cycle-accurate simulation on 8 8 2D mesh topology shows that DXbar has at least 15% performance improvement over the baseline designs with synthetic traces and realapplication traces from SPLASH-2. The proposed design is also able to outperform current bufferless networks. We further evaluate the performance of hardware-level fault tolerance with faultdetecting units and control units deployed. The failure of either crossbar would not disable the router, and all packets could be routed through the working crossbar. The performance degradation is at most 10% when all of the nodes have one permanent crossbar failure. 1.5 Organization of Thesis The rest of the thesis is organized as following. Chapter 2 introduces the background knowledge of NoCs, including commonly used terms, router architecture, and router implementation, to provide necessary information for understanding the proposed DXbar design. Chapter 3 illustrates the architecture of DXbar and how it works in details, and covers several key features which expand the functionality of DXbar. Chapter 4 starts with the introduction of the methodology used in performance analysis, and presents different types of evaluation results to illustrate the performance of the proposed DXbar design, compared to other related designs. Chapter 5 concludes the thesis and states the future work which could lead to further performance improvement.

19 2 Background on Network-on-Chips 19 This chapter introduces NoCs by covering some of the fundamental knowledge, which are used to demonstrate the proposed design in the next chapter. First of all, several commonly used terms in designing NoCs are listed and explained. Secondly, topology basics are introduced, and common topologies used in various NoCs are presented and compared. Finally, the architecture of a NoC router is illustrated, as well as different flow control techniques and routing algorithms. 2.1 Terminology There are several terms which are largely used in the research field of NoCs. Node A node is the combination of processing element (PE) and router. Router A router is the network interface block of a node. It is capable of receiving and forwarding information from the input ports to the correct output ports. Bandwidth Bandwidth is the data rate between two adjacent nodes, and is one of the major parameters of NoCs which affects the overall performance. Flit Flit is the basic unit of information which NoCs use to allocate channel and storage. The data width of a flit is the width of channels, and a flit requires one cycle to transmit. Packet Packet is the unit of information which are transmitted among nodes. Packets are composed of one or multiple flits, depending on the packet type. Latency Latency is the time for a packet to traverse the network from its source node to its destination node, and is calculated as [3] T = T h + L b (2.1)

20 20 where T h is the time for the head flit to traverse the network, L is the length of the packet, and b is the bandwidth. It is measured from the time at which the head flit is injected, to the time at which the tail flit is ejected from the network. Throughput Throughput is the data rate that the network accepts per input port [3]. The saturation throughput is reached when throughput does not increase with offered load. Capacity Capacity of the network is the ideal throughput of the network on uniform traffic. Power Power is the amount of energy dissipated in both the router and channels when a packet traverse the network from its source core to destination core. The power for a flit to traverse the network is calculated as [3] P f lit = P router H + P channel H (2.2) where P router is the power consumed by a router, P channel is the power consumed by a channel, and H is the number of hops from the source core to the destination core. The total power to transmit a packet equals P f lit multiplied by the number of flits in the packet. Livelock Livelock implies a packet never reach its destination core, even though it is always progressing forward in the network. Deadlock Deadlock implies a cyclic dependency that all packets in that cycle are waiting for the packet in front to release its resource, and none of them could progress. Deterministic Routing Deterministic routing algorithm only routes packets along the same path between two nodes.

21 21 Adaptive Routing Adaptive routing algorithm could provide multiple paths for a packet to reach its destination. The choice among possible paths is made based on the load of the network. 2.2 Topology Topology of a NoC refers to the pattern by which nodes and channels are arranged. There are different topologies used in building NoCs, providing a variety of networks with different characteristics; such as diameter, radix, and bisection bandwidth [3]. Diameter is the length of the longest route from any source node to destination node. Larger diameter could result in more hops for a packet to travel before reaching the destination node, and usually leads to larger latency. Radix is the number of input/output ports of each router, which indicates the complexity of the router architecture. Larger radix could increase the connectivity of individual nodes and potentially reduce the diameter, but could also increase the area overhead and power dissipation of the entire network. Bisection bandwidth is the aggregated bandwidth of the minimum number of wires to partition the network in half. Larger bisection bandwidth could indicate a better worst case performance. The commonly used topologies are ring, mesh, torus, cmesh, and flattened butterfly Ring Figure 2.1 shows a ring topology with 8 nodes [3]. The routers in a ring network have a very low radix which is only two. Moreover, the network is simple and easy to add more nodes. The major drawback of ring topology is that when the network size becomes large, the diameter would increase and packets need to travel more hops. Furthermore, bandwidth of ring topology is limited, as each node only has two channels with which it could connect to other nodes. As a result, ring topology is suitable for small network sizes. An example of ring topology is the NoC of IBM cell processor [41].

22 Figure 2.1: A network using ring topology Mesh Figure 2.2(a) shows a mesh topology with 16 nodes [3]. Each router is identified as (x, y), and is connected to four other neighboring routers in all of the cardinal directions, x+, x, y+, and y. The topology could provide a relatively high bandwidth with a moderate radix of four. The major drawback of mesh topology is the high diameter. For example, the diameter of the shown network is six, from node (0, 0) to node (3, 3). Another drawback is that nodes at the center of the network are busier than nodes on the fringes, as all of the side-to-side traffic have to traverse the center of the network [3, 20]. Mesh topology is widely used in different NoCs, and good examples are Intel TeraFlop processor [16] and MIT s RAW processor [13 15] Torus Figure 2.2(b) shows a torus topology with 16 nodes [3]. The topology has a similar grid shape such as the mesh topology, with additional wrap-around channels for nodes on the fringes. The topology could yield a smaller diameter than mesh topology with the same network size. For example, a torus topology for a network size of 16 has a diameter of four. The extra channels also provide better path diversity, and alleviate the burden on the

23 23 (3,0) (3,1) (3,2) (3,3) (3,0) (3,1) (3,2) (3,3) (2,0) (2,1) (2,2) (2,3) (2,0) (2,1) (2,2) (2,3) (1,0) (1,1) (1,2) (1,3) (1,0) (1,1) (1,2) (1,3) (0,0) (0,1) (0,2) (0,3) (0,0) (0,1) (0,2) (0,3) (a) (b) Figure 2.2: A network using (a) mesh (b) torus topology. center nodes [3, 20]. The major drawback is the length of the long wrap-around channels, which leads to higher power dissipation and latency. Folding the network could reduce the negative impact, but increases the difficulty in the manufacturing process [42]. Torus has been used extensively for off-chip networks, such as the SeaStar interconnect found in Cray XT5m [43] Cmesh Figure 2.3(a) shows the Cmesh topology with 64 cores [20]. Every four nodes are concentrated to one high radix router to form a tile. Diameter is reduced so that packets need less number of hops to traverse the network, thereby decreasing the latency. The major drawback is that a node is shared by multiple cores. When packets from different cores need to go to other nodes, the possibility of conflict is increased, which leads to a reduction in throughput Flattened Butterfly Figure 2.3(b) shows a Flattened Butterfly topology with 64 nodes. Similar to Cmesh topology, every four nodes are concentrated to form a tile. The difference is that a tile is connected not only to its neighbors, but also to all the other tiles in its row and column. Therefore, the diameter of a flattened butterfly network is reduced to two, and the capacity

24 (a) (b) Figure 2.3: A network using (a) Cmesh (b) Flattened Butterfly topology. of the network is largely increased [44]. The major drawback is the high radix of the nodes, which could grow fast and become costly if network size is increased. In addition, the long channels which connect non-adjacent tiles may require several cycles for data to travel, and could consume a significant amount of power. 2.3 Router Architecture A router in NoCs consists of buffers, switches, and control units which are required to store and forward flits from the input ports to the desired output ports. The architecture is actually similar to that of modern ethernet routers, but with smaller area and buffer size. Figure 2.4 shows a generic 5 5 NoCs router of a mesh network with 16 buffer slots per input port. The buffer slots are divided into four queues, and each queue is called a virtualchannel (VC) [3, 45]. There are four cardinal input ports and output ports connected from and to x+, x, y+, and y directions. The last pair of input/output ports are connected from and to the processing element (PE). 2.4 Pipeline Stages NoCs routers are pipelined at the flit level to better utilize all the control units and improve the throughput. Figure 2.5(a) shows the pipeline of a generic 5-stage router and timing of different flits in a packet [3]. The stages are: Routing Computation (RC), VC

25 25 Routing Computation Virtual-Channel Allocator VC0 Switch Allocator X+ DEMUX VC1 VC2 MUX X+ VC3 X- X- Y+ Y Y+ Y-... Processing Element (PE) Figure 2.4: The architecture of a generic router. Allocation (VA), Switch Allocation (SA), Switch Traversal (ST), and Link Traversal (LT). RC, VA, and SA are performed by routing computation, VC allocator, and switch allocator, respectively. RC When the head flit of a packet is stored in a virtual-channel, the routing information carried by the head flit is input to the router to determine the output port of the packet. Once the result is calculated, all of the flits in the same packet must use the same output port. VA When the output port is determined, the result is input to the virtual-channel allocator to assign a single output virtual-channel on the corresponding output port. If the

26 26 Head Flit RC VA SA ST LT Head Flit VA RC SA ST LT Body Flit SA ST LT Body Flit SA ST LT Body Flit SA ST LT Body Flit SA ST LT Tail Flit SA ST LT Tail Flit SA ST LT (a) (b) Head Flit RC VA SA ST LT Head Flit VA SA ST LT RC Body Flit SA ST LT Body Flit SA ST LT Body Flit SA ST LT Body Flit SA ST LT Tail Flit SA ST LT Tail Flit SA ST LT (c) (d) Figure 2.5: (a) A 5-stage router pipeline and timing for different flits in a packet. (b) A 4- stage router pipeline with look-ahead routing. (c) A 4-stage router pipeline with speculative SA. (d) A 3-stage router pipeline with both look-ahead routing and speculative SA. allocation fails, the head flit needs to wait until the output port has a free VC to assign. The allocation is performed for the head flit only. SA When the output VC is assigned, per-packet operations are completed and switch allocation is performed flit-by-flit. All of the flits in a packet will consecutively bid for a single-flit time slot to traverse the switch. ST When the switch is allocated to a flit, the flit uses one cycle to traverse the switch to the desired output port. LT When the switch is traversed by a flit, the flit uses another cycle to traverse the channel and reach its downstream router. Each pipeline stage requires one cycle to perform. Therefore, a flit needs five cycles to traverse the router. Different techniques have been proposed to reduce the number of stages in order to increase throughput and decrease latency. Figure 2.5(b) shows the reduced

27 27 pipeline of the generic router using look-ahead routing. RC is done at the previous router, and the result is written to the flit and transmitted to the current router so that no RC is needed for switching the flit to the next router, reducing the number of stages from 5 to 4 [46, 47]. Figure 2.5(c) shows the reduced pipeline of the generic router via speculation. Speculative SA is performed in order to finish both VA and SA in the same cycle, also reducing the number of stages from 5 to 4 [48]. Figure 2.5(d) shows the further reduced pipeline of the generic router with both look-ahead routing and speculation [3]. This reduces the number of stages from 4 to 3. Eliminating pipeline stages result in reduced delay for packets in each router, which in turn decreases the average latency. Because packets spend less time traveling in the network, the now-unoccupied cycles enable more packets to be transmitted and increase the throughput of the network. 2.5 Flow Control Flow control is the key feature to maintain the efficiency of allocating network resources, such as buffers and channel bandwidth, to the traversing packets in the network. Using different flow control methods could result in vastly varying throughput from the same physical network, since good flow control methods could make full use of resources and arrange packets in the most productive way, but poor flow control methods waste resources by allocating them to idle packets and blocking active packets from progressing, thus causing more conflicts [3]. There are two types of flow control methods, bufferless and buffered. Bufferless networks are typically simpler and more power-efficient than buffered networks, since all of the buffers and associated control units are eliminated. However, the lack of buffering indicates that packets would be either misrouted or dropped in case of conflicts. As a result, bufferless networks cannot outperform networks with buffers in terms of saturation throughput [3].

28 28 routing delay processing delay Router 1 Router 2 Router 3 Figure 2.6: An illustration of credit-based flow control. After sending flit 0, Router 1 has to halt the sending of flit 1 and wait for a credit from router 2. At the time flit 0 departs from router 2, a credit is sent back from router 2 to router 1. The sending of flit 1 from router 1 to router 2 is resumed after router 1 receives the credit [3]. On the other hand, buffered networks utilize buffers to store packets when conflicts occur. Although it is possible for buffers to consume a significant portion of total network power, buffers could largely improve performance. In order to keep track of the downstream buffer utilization in each router, credit-based flow control is commonly implemented as the method of buffer management. When credit-based flow control is used, the number of available buffer slots in each downstream VC is maintained in the upstream router as credits. Figure 2.6 illustrates a scenario where each router has one credit. When flit 0 departs from router 1, the flit will consume a buffer slot in router 2. Therefore, the number

29 29 of credit in router 1 is decremented by one. Flit 1 cannot be transmitted instantly, as the number of credits is zero. After flit 0 leaves router 2, a credit is returned to router 1 to increment the number of credits by one, and flit 1 could be transmitted to router 2. Creditbased flow control ensures that all transmitted flits will be successfully buffered [3]. 2.6 Routing Algorithm Different routing algorithms could significantly affect the overall performance of the network by changing the way RC computes the next output port. In general, deterministic routing algorithms are simpler to implement, and they could provide better performance with random traffic patterns. In contrast, adaptive routing algorithms need more control units to gather information, in order to make routing decisions that could improve the utilization of less used resources. Good adaptive routing algorithms could provide comparable performance with random traffic patterns, compared to deterministic routing algorithms. For permutation traffic patterns where each node transmits data to a fixed destination node, adaptive routing algorithms could outperform almost all deterministic routing algorithms [49]. This is explained by the fact that adaptive routing algorithms could provide changeable paths for fixed source/destination pairs to balance the load. Therefore, congestion is less likely to appear. The most common deterministic routing algorithm is dimension-order routing (DOR) [3]. Figure 2.7(a) shows the path from (0, 0) to (2, 2) using DOR. Packets traverse the x-dimension first, and then traverse the y-dimension. As packets could not turn from y- dimension to x-dimension, forming a cyclic dependency is impossible. Therefore DOR is deadlock-free. The algorithm could be explained by the following pseudo-code where (X cur, Y cur ) represents the current node, and (X dest, Y dest ) represents the destination node [50]:

30 30 (3,0) (3,1) (3,2) (3,3) (3,0) (3,1) (3,2) (3,3) (2,0) (2,1) (2,2) (2,3) (2,0) (2,1) (2,2) (2,3) (1,0) (1,1) (1,2) (1,3) (1,0) (1,1) (1,2) (1,3) (0,0) (0,1) (0,2) (0,3) (0,0) (0,1) (0,2) (0,3) Congested (a) (b) Figure 2.7: (a) The path from source node (0, 0) to destination node (2, 2) using dimensionorder routing algorithm (b) One possible path from source node (0, 0) to destination node (2, 2) using west-first adaptive routing algorithm. if (X cur X dest ) if (X cur > X dest ) go to x- else if (X cur < X dest ) go to x+ else if (Y cur Y dest ) if (Y cur > Y dest ) go to y- else if (Y cur < Y dest ) go to y+ else go to Processing Element

31 31 Adaptive routing algorithms allow packets to make any kind of turns, which could result in a cyclic dependency and create deadlocks [3]. West-first adaptive routing (WF) overcomes the deadlock problem by prohibiting turning from y+ to x, therefore no cyclic dependency could be formed [50]. Figure 2.7(b) shows the path from (0, 0) to (2, 2). A packet have two situations using this routing algorithm. When the destination node is on the left hand side of the source node, boundary included, packet traverse the x-dimension first, as if DOR is used. Only when the destination node is on the right hand side of the source node, boundary excluded, packet could turn freely along the way from the source. The algorithm could be explained by the following pseudo-code [50]: if (X cur > X dest ) go to x- else if (X cur = X dest ) if (Y cur > Y dest ) go to y- else if (Y cur < Y dest ) go to y+ else if (Y cur = Y dest ) go to Processing Element else if (X cur < X dest ) if (Y cur < Y dest ) choose between y+ and x+ else if (Y cur > Y dest ) choose between y- and x+ else if (Y cur = Y dest ) go to x+

32 3 DXBar: A Power-Efficient and High-Performance 32 Crossbar Design This chapter discusses the proposed DXbar design in details. The first section describes the architecture of DXbar. The following three sections illustrate the implementation of DXbar routers, provide several scenarios demonstrating how DXbar routers work, and show how fairness between two crossbars is maintained. The fifth section explains the fault tolerance feature of the modified DXbar design. The last section presents the dual-directional crossbar to reduce the area overhead of DXbar. 3.1 Introduction of DXbar The proposed DXBar is a high-performance and power-efficient dual-crossbar design targeting both performance enhancement and power reduction. It is well known that bufferless networks could provide power-efficiency and low-latency at low network load, and buffered networks could provide cost-efficient conflict-handling at high network load [3, 34, 35]. In order to take the advantage of both bufferless and buffered networks, the proposed design is a combination of a bufferless primary crossbar and a buffered secondary crossbar. At a low network load, almost all packets would only traverse the primary crossbar and take minimal routes, experiencing similar delays of what the packets would experience in a bufferless network. Whenever there is a conflict, the losing packet will be buffered in the secondary crossbar, and wait until its desired output port is free. As majority of the packets would not be buffered, VCs are eliminated to reduce the complexity of buffer organization and control overhead. The buffered secondary crossbar enables blocked packets to be removed from the critical path of the bufferless primary crossbar, thus maintaining the continuous flow of packets and achieving low-latency and low-power. Moreover, packets which are not

33 33 Look-ahead Signal Routing Computation Switch Allocator X+ X- Y+ Y- DEMUX DEMUX DEMUX DEMUX Primary Crossbar Secondary Crossbar MUX MUX MUX MUX MUX X+ X- Y+ Y- Processing Element (PE) Figure 3.1: The architecture of a DXbar router. buffered simply traverse a switch, therefore the power of buffer read/write is saved. As packets do not take extra hops when conflicts arise, there are no extra link traversals when the network load keeps increasing, ensuring that the design is power-efficient for any network load. 3.2 Architecture Figure 3.1 illustrates the router architecture of DXbar, which is different from the generic router in Figure 2.4. First of all, there are two crossbars within each router, with the primary crossbar having four input ports, the secondary crossbar having five input ports, and both of them having five output ports. The four input ports of the router are connected to both crossbars via input de-multiplexers (DEMUX), and the injection port of the PE is

34 34 connected to the last input port of the secondary crossbar. All of the five output ports of the two crossbars are connected via output multiplexers (MUX), which are then connected to the output ports of the router and the ejection port of the processing element (PE). Secondly, buffer slots are deployed in front of the first four input ports of the secondary crossbar. No buffer slots are deployed in front of the input ports of the primary crossbar, and the last input port of the secondary crossbar. The buffer slots are connected serially, thus eliminating VCs and the corresponding virtual-channel allocator. Finally, look-ahead signal from upstream router is connected directly to router, which is then connected to the switch allocator. Switch allocator is modified to control the de-multiplexers, the crossbars, and the multiplexers to maintain the correct packet flow in both crossbars. 3.3 Router Implementation The elimination of VCs eliminates the VA stage and simplifies SA stage, so that SA and ST could be performed in the same cycle. By using look-ahead signal, the routing information could traverse a separate narrow link to the downstream router in the lookahead link-traversal (LA LT) stage, and look-ahead routing computation (LA RC) stage could be performed in parallel with the LT stage. As a result, the reduced pipeline consists of two stages, as shown in Figure 3.2. All the stages shown in dark fill color are lookahead stages. This proposed pipeline is similar to the pipelines used in Flit-Bless [34] and SCARAB [35]. Credit-based flow control is adopted in this design. The arbitration of the switch allocation is performed at the beginning of the cycle. The flits which arrive at the input ports of the router are called incoming flits, and flits which are in the buffer slots and PE are called buffered flits. Before the arbitration starts, all of the contending flits are ranked based on their ages. An older flit is ranked higher than a younger flit, but the incoming flits have higher ranks than buffered flits. For example, there are two incoming flits A and B, and one buffered flit C. A is 24 cycles old, B is 20 cycles old, and

35 35 Router 1 RC SA ST LT Router 2 LA LT RC SA ST LT Router 3 LA LT RC SA ST LT Figure 3.2: The pipeline of a DXbar router. C is 22 cycles old. C is older than B, but is ranked lower than B, as an incoming flit has a higher rank than a buffered flit. Therefore, the final rank is A, B, and then C. Once the rank is calculated, the switch allocator tries to assign to each flit the desired output port, which has not been assigned to other higher-ranked flits. Using the same example, the desired output ports of A, B, and C are x+, x, and x, respectively. The switch allocator assigns x+ to A, and x to B. As x has been assigned to other flits, C loses in arbitration. As implementing an NoC with age-based arbitration might be difficult, the arbitration could be based on the number of hops the flits have traversed. In this way, only several bits in the look-ahead signal are required to represent the number of hops of a flit. If an incoming flit wins in the arbitration, that flit could traverse the primary crossbar and go to its designated output port without being buffered. Otherwise, it will be sent to the secondary crossbar and be buffered. On the other hand, a buffered flit waits in the buffer slot or PE until it wins in arbitration. The de-multiplexers and multiplexers are set to select the primary crossbar inputs/outputs by default. Switch allocator uses the arbitration result to set the two crossbars correctly, control all of the multiplexers to link each output port of the router to the correct crossbar, and control all the de-multiplexers to direct incoming flits between the primary and the secondary crossbar.

36 36 Look-ahead Signal Routing Computation Switch Allocator Look-ahead Signal Routing Computation Switch Allocator X+ X+ X+ X+ X- X- X- X- Y+ Y+ Y+ Y+ Y- Y- Y- Y- Processing Element (PE) Processing Element (PE) (a) (b) Look-ahead Signal Routing Computation Switch Allocator Look-ahead Signal Routing Computation Switch Allocator X+ X+ X+ X+ X- X- X- X- Y+ Y+ Y+ Y+ Y- Y- Y- Y- Processing Element (PE) Processing Element (PE) (c) (d) Figure 3.3: The illustration of dual xbar: (a) No conflicts appear, the network operates as bufferless network. (b) Conflict appears, one flit is sent to the secondary crossbar and buffered first. (c) The flit coming after the buffered flit sees no delay and could proceed without being buffered, thus ensuring that no instant back pressure happens. The injection port could send a flit whenever the desired output port is not occupied (d) Even though there is still a flit coming in through the same input port, the buffered flit could proceed when the desired output port is not occupied. The incoming flit could proceed simultaneously, encountering no delay. 3.4 Walkthrough Examples Figure 3.3 shows several possible scenarios to illustrate how flits are switched in DXbar design. Figure 3.3(a) shows the scenario when there is no conflict among flits. Flits from the x+ and the x want to go to the x output port and x+ output port, respectively. At the same time, flits from the y+ and the y want to go to the y output port and y+ output

37 37 port, respectively. All of the four incoming flits are switched simultaneously, and the network operates similarly to bufferless networks, because no flits are buffered and latency is minimized. This is the best scenario. Figure 3.3(b) shows the scenario when conflict appears. A flit from the y input port and another flit from the x+ input port compete for the x output port. Assume the flit from x+ is older, then that flit is switched from x+ input port to the x output port through the primary crossbar. The flit from the y input port loses in arbitration, and the y input de-multiplexer receives a control signal from switch allocator, routing the flit from the y input port to the secondary crossbar and buffering it. Figure 3.3(c) shows the scenario when another flit comes from the y input port in the next cycle. Because the flit coming from the y input port one cycle ago is currently buffered in front of the secondary crossbar, the path from the y input port to the primary crossbar is unoccupied. Therefore, the incoming flit in the current cycle could be switched to the desired output port. Otherwise, it needs to wait in buffer until the previous flit is switched. The figure also illustrates how the injection ports work. Since the injection ports have the same level of priority as buffers, a flit could be injected whenever the desired output port is not occupied. Figure 3.3(d) shows the scenario when no more flits from the input ports need the x output port. The buffered flit could then traverse the secondary crossbar and be routed downstream. At the same time, a flit from the y input port could traverse the primary crossbar to a different output port. This is not feasible in other designs, because there could be only one flit coming from one particular input port at any single time.

38 Fairness Maintenance Because the arbitration is age-based, flits coming from the nodes on the edges of the mesh network would have higher priority when they pass through the nodes in the center. As a result, the flits injected by center nodes are more likely to lose in arbitration and traverse the secondary crossbar. More importantly, this scenario could limit the injection of flits of center nodes, as injection ports and buffers have lower priority than those of input ports. A fairness issue thereby arises, since flits could be stalled in the injection ports or buffers for numerous cycles, when the desired output ports are taken by other flits from the input ports. In order to maintain fairness between the primary crossbar and the secondary crossbar, a fairness counter is maintained in each router to count the number of times flits from the primary crossbar win consecutively in arbitration. The counter works only when there are flits waiting in the buffers or in the injection port, and it is reset every time a waiting flit wins. Once the number exceeds a threshold, the priority flips so that flits from the secondary crossbar have higher priority, and they will be assigned output ports ahead of the flits from the input ports. Although the continuous flow of flits on the primary crossbar is interrupted, the nodes in the center are able to inject new flits even if the network is heavily saturated, and the buffered flits are guaranteed to leave in finite number of cycles. Setting the threshold too small could lead to excessive buffered flits and wasted power by more buffer operations, while setting the number too large does not help to solve the fairness issue. After testing with different traffic patterns, the threshold is set to four to bring the overall best performance. 3.6 Fault Tolerance The duplication of crossbar enables hardware-level fault tolerance. The router could avoid being completely disabled, even in case of a permanent crossbar failure. Figure 3.4

39 39 Look-ahead Signal Routing Computation Switch Allocator X+ X- Y+ Y- DEMUX DEMUX DEMUX DEMUX X X X X X Primary Crossbar Secondary Crossbar MUX MUX MUX MUX MUX X+ X- Y+ Y- Processing Element (PE) Figure 3.4: The architecture of a DXbar router with fault tolerance capability. depicts the revised router architecture of DXbar. When a crossbar failure is detected, switch allocator signals all of the input de-multiplexers to direct incoming flits to the buffer slots. The router then functions similarly to the routers of a buffered network. Because both the primary and the secondary crossbar may fail, there should be a mechanism to direct the buffered flits to the working crossbar. Therefore, a set of 2 2 crossbars, controlled by the signals from the switch allocator, are deployed between the buffer slots and the crossbars. A 1-bit signal could control these small crossbars, which indicates that the required modification of switch allocator is minimal. As a result, flits could traverse the router through the buffer slots and the working crossbar, even if the other crossbar does not function.

40 40 I0 I 0 I0 I 0 I1 I 1 I1 I 1 I2 I 2 I2 I 2 I3 I 3 I3 I 3 I4 I 4 I4 I 4 O0 O1 O2 O3 O4 (a) O0 O1 O2 O3 O4 (b) Figure 3.5: (a) The structure of a dual-directional crossbar. (b) An example showing how signal could enter from both directions. Buffers with a color of grey are turned off. Please note that fault detection is not implemented in this work. In order to determine the performance degradation of DXbar when running with crossbar faults, the fault detection mechanism is assumed to be available and works without affecting performance. 3.7 Dual-Directional Crossbar The original DXbar design requires two crossbars, and the area for the crossbars is doubled compared to a generic router. Although area overhead would become less with future technology scaling, modifications should still be made to minimize the negative impact on crossbar area. A better solution is to change the structure of crossbar so that the dual-crossbar could be replaced by a single modified crossbar. Segmented crossbar has been proposed in [29], which deploys tri-state buffers along input lines of the crossbar. Turning the tri-state buffers off could divide the input lines into several isolated segments. Using a similar technique, we design a dual-directional crossbar to enable data flow from both sides of the input lines, so that two flits from the same input port could go to different output ports in the same cycle. The proposed dual-directional crossbar is able to replace the dual-crossbar and provide the same functionality.

Lecture 13: Interconnection Networks. Topics: lots of background, recent innovations for power and performance

Lecture 13: Interconnection Networks. Topics: lots of background, recent innovations for power and performance Lecture 13: Interconnection Networks Topics: lots of background, recent innovations for power and performance 1 Interconnection Networks Recall: fully connected network, arrays/rings, meshes/tori, trees,

More information

Lecture 12: Interconnection Networks. Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E)

Lecture 12: Interconnection Networks. Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E) Lecture 12: Interconnection Networks Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E) 1 Topologies Internet topologies are not very regular they grew

More information

Thomas Moscibroda Microsoft Research. Onur Mutlu CMU

Thomas Moscibroda Microsoft Research. Onur Mutlu CMU Thomas Moscibroda Microsoft Research Onur Mutlu CMU CPU+L1 CPU+L1 CPU+L1 CPU+L1 Multi-core Chip Cache -Bank Cache -Bank Cache -Bank Cache -Bank CPU+L1 CPU+L1 CPU+L1 CPU+L1 Accelerator, etc Cache -Bank

More information

Lecture: Interconnection Networks

Lecture: Interconnection Networks Lecture: Interconnection Networks Topics: Router microarchitecture, topologies Final exam next Tuesday: same rules as the first midterm 1 Packets/Flits A message is broken into multiple packets (each packet

More information

Design of Adaptive Communication Channel Buffers for Low-Power Area- Efficient Network-on. on-chip Architecture

Design of Adaptive Communication Channel Buffers for Low-Power Area- Efficient Network-on. on-chip Architecture Design of Adaptive Communication Channel Buffers for Low-Power Area- Efficient Network-on on-chip Architecture Avinash Kodi, Ashwini Sarathy * and Ahmed Louri * Department of Electrical Engineering and

More information

Packet Switch Architecture

Packet Switch Architecture Packet Switch Architecture 3. Output Queueing Architectures 4. Input Queueing Architectures 5. Switching Fabrics 6. Flow and Congestion Control in Sw. Fabrics 7. Output Scheduling for QoS Guarantees 8.

More information

Packet Switch Architecture

Packet Switch Architecture Packet Switch Architecture 3. Output Queueing Architectures 4. Input Queueing Architectures 5. Switching Fabrics 6. Flow and Congestion Control in Sw. Fabrics 7. Output Scheduling for QoS Guarantees 8.

More information

Quest for High-Performance Bufferless NoCs with Single-Cycle Express Paths and Self-Learning Throttling

Quest for High-Performance Bufferless NoCs with Single-Cycle Express Paths and Self-Learning Throttling Quest for High-Performance Bufferless NoCs with Single-Cycle Express Paths and Self-Learning Throttling Bhavya K. Daya, Li-Shiuan Peh, Anantha P. Chandrakasan Dept. of Electrical Engineering and Computer

More information

Lecture 16: On-Chip Networks. Topics: Cache networks, NoC basics

Lecture 16: On-Chip Networks. Topics: Cache networks, NoC basics Lecture 16: On-Chip Networks Topics: Cache networks, NoC basics 1 Traditional Networks Huh et al. ICS 05, Beckmann MICRO 04 Example designs for contiguous L2 cache regions 2 Explorations for Optimality

More information

Interconnection Networks: Topology. Prof. Natalie Enright Jerger

Interconnection Networks: Topology. Prof. Natalie Enright Jerger Interconnection Networks: Topology Prof. Natalie Enright Jerger Topology Overview Definition: determines arrangement of channels and nodes in network Analogous to road map Often first step in network design

More information

Low-Power Interconnection Networks

Low-Power Interconnection Networks Low-Power Interconnection Networks Li-Shiuan Peh Associate Professor EECS, CSAIL & MTL MIT 1 Moore s Law: Double the number of transistors on chip every 2 years 1970: Clock speed: 108kHz No. transistors:

More information

Lecture 22: Router Design

Lecture 22: Router Design Lecture 22: Router Design Papers: Power-Driven Design of Router Microarchitectures in On-Chip Networks, MICRO 03, Princeton A Gracefully Degrading and Energy-Efficient Modular Router Architecture for On-Chip

More information

NETWORK-ON-CHIPS (NoCs) [1], [2] design paradigm

NETWORK-ON-CHIPS (NoCs) [1], [2] design paradigm IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS 1 Extending the Energy Efficiency and Performance With Channel Buffers, Crossbars, and Topology Analysis for Network-on-Chips Dominic DiTomaso,

More information

4. Networks. in parallel computers. Advances in Computer Architecture

4. Networks. in parallel computers. Advances in Computer Architecture 4. Networks in parallel computers Advances in Computer Architecture System architectures for parallel computers Control organization Single Instruction stream Multiple Data stream (SIMD) All processors

More information

Lecture 3: Flow-Control

Lecture 3: Flow-Control High-Performance On-Chip Interconnects for Emerging SoCs http://tusharkrishna.ece.gatech.edu/teaching/nocs_acaces17/ ACACES Summer School 2017 Lecture 3: Flow-Control Tushar Krishna Assistant Professor

More information

Lecture 15: PCM, Networks. Today: PCM wrap-up, projects discussion, on-chip networks background

Lecture 15: PCM, Networks. Today: PCM wrap-up, projects discussion, on-chip networks background Lecture 15: PCM, Networks Today: PCM wrap-up, projects discussion, on-chip networks background 1 Hard Error Tolerance in PCM PCM cells will eventually fail; important to cause gradual capacity degradation

More information

Energy-Efficient and Fault-Tolerant Unified Buffer and Bufferless Crossbar Architecture for NoCs

Energy-Efficient and Fault-Tolerant Unified Buffer and Bufferless Crossbar Architecture for NoCs Energy-Efficient and Fault-Tolerant Unified Buffer and Bufferless Crossbar Architecture for NoCs Yixuan Zhang, Randy Morris, Dominic DiTomaso and Avinash Kodi School of Electrical Engineering and Computer

More information

Lecture: Interconnection Networks. Topics: TM wrap-up, routing, deadlock, flow control, virtual channels

Lecture: Interconnection Networks. Topics: TM wrap-up, routing, deadlock, flow control, virtual channels Lecture: Interconnection Networks Topics: TM wrap-up, routing, deadlock, flow control, virtual channels 1 TM wrap-up Eager versioning: create a log of old values Handling problematic situations with a

More information

Pseudo-Circuit: Accelerating Communication for On-Chip Interconnection Networks

Pseudo-Circuit: Accelerating Communication for On-Chip Interconnection Networks Department of Computer Science and Engineering, Texas A&M University Technical eport #2010-3-1 seudo-circuit: Accelerating Communication for On-Chip Interconnection Networks Minseon Ahn, Eun Jung Kim Department

More information

ACCELERATING COMMUNICATION IN ON-CHIP INTERCONNECTION NETWORKS. A Dissertation MIN SEON AHN

ACCELERATING COMMUNICATION IN ON-CHIP INTERCONNECTION NETWORKS. A Dissertation MIN SEON AHN ACCELERATING COMMUNICATION IN ON-CHIP INTERCONNECTION NETWORKS A Dissertation by MIN SEON AHN Submitted to the Office of Graduate Studies of Texas A&M University in partial fulfillment of the requirements

More information

Switching/Flow Control Overview. Interconnection Networks: Flow Control and Microarchitecture. Packets. Switching.

Switching/Flow Control Overview. Interconnection Networks: Flow Control and Microarchitecture. Packets. Switching. Switching/Flow Control Overview Interconnection Networks: Flow Control and Microarchitecture Topology: determines connectivity of network Routing: determines paths through network Flow Control: determine

More information

Lecture 24: Interconnection Networks. Topics: topologies, routing, deadlocks, flow control

Lecture 24: Interconnection Networks. Topics: topologies, routing, deadlocks, flow control Lecture 24: Interconnection Networks Topics: topologies, routing, deadlocks, flow control 1 Topology Examples Grid Torus Hypercube Criteria Bus Ring 2Dtorus 6-cube Fully connected Performance Bisection

More information

Interconnection Networks

Interconnection Networks Lecture 17: Interconnection Networks Parallel Computer Architecture and Programming A comment on web site comments It is okay to make a comment on a slide/topic that has already been commented on. In fact

More information

Lecture: Transactional Memory, Networks. Topics: TM implementations, on-chip networks

Lecture: Transactional Memory, Networks. Topics: TM implementations, on-chip networks Lecture: Transactional Memory, Networks Topics: TM implementations, on-chip networks 1 Summary of TM Benefits As easy to program as coarse-grain locks Performance similar to fine-grain locks Avoids deadlock

More information

A VERIOG-HDL IMPLEMENTATION OF VIRTUAL CHANNELS IN A NETWORK-ON-CHIP ROUTER. A Thesis SUNGHO PARK

A VERIOG-HDL IMPLEMENTATION OF VIRTUAL CHANNELS IN A NETWORK-ON-CHIP ROUTER. A Thesis SUNGHO PARK A VERIOG-HDL IMPLEMENTATION OF VIRTUAL CHANNELS IN A NETWORK-ON-CHIP ROUTER A Thesis by SUNGHO PARK Submitted to the Office of Graduate Studies of Texas A&M University in partial fulfillment of the requirements

More information

Lecture 26: Interconnects. James C. Hoe Department of ECE Carnegie Mellon University

Lecture 26: Interconnects. James C. Hoe Department of ECE Carnegie Mellon University 18 447 Lecture 26: Interconnects James C. Hoe Department of ECE Carnegie Mellon University 18 447 S18 L26 S1, James C. Hoe, CMU/ECE/CALCM, 2018 Housekeeping Your goal today get an overview of parallel

More information

MinBD: Minimally-Buffered Deflection Routing for Energy-Efficient Interconnect

MinBD: Minimally-Buffered Deflection Routing for Energy-Efficient Interconnect MinBD: Minimally-Buffered Deflection Routing for Energy-Efficient Interconnect Chris Fallin, Greg Nazario, Xiangyao Yu*, Kevin Chang, Rachata Ausavarungnirun, Onur Mutlu Carnegie Mellon University *CMU

More information

Interconnection Networks

Interconnection Networks Lecture 18: Interconnection Networks Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Credit: many of these slides were created by Michael Papamichael This lecture is partially

More information

Interconnection Networks: Routing. Prof. Natalie Enright Jerger

Interconnection Networks: Routing. Prof. Natalie Enright Jerger Interconnection Networks: Routing Prof. Natalie Enright Jerger Routing Overview Discussion of topologies assumed ideal routing In practice Routing algorithms are not ideal Goal: distribute traffic evenly

More information

Basic Low Level Concepts

Basic Low Level Concepts Course Outline Basic Low Level Concepts Case Studies Operation through multiple switches: Topologies & Routing v Direct, indirect, regular, irregular Formal models and analysis for deadlock and livelock

More information

A Survey of Techniques for Power Aware On-Chip Networks.

A Survey of Techniques for Power Aware On-Chip Networks. A Survey of Techniques for Power Aware On-Chip Networks. Samir Chopra Ji Young Park May 2, 2005 1. Introduction On-chip networks have been proposed as a solution for challenges from process technology

More information

Express Virtual Channels: Towards the Ideal Interconnection Fabric

Express Virtual Channels: Towards the Ideal Interconnection Fabric Express Virtual Channels: Towards the Ideal Interconnection Fabric Amit Kumar, Li-Shiuan Peh, Partha Kundu and Niraj K. Jha Dept. of Electrical Engineering, Princeton University, Princeton, NJ 8544 Microprocessor

More information

Routing Algorithm. How do I know where a packet should go? Topology does NOT determine routing (e.g., many paths through torus)

Routing Algorithm. How do I know where a packet should go? Topology does NOT determine routing (e.g., many paths through torus) Routing Algorithm How do I know where a packet should go? Topology does NOT determine routing (e.g., many paths through torus) Many routing algorithms exist 1) Arithmetic 2) Source-based 3) Table lookup

More information

Lecture 12: Interconnection Networks. Topics: dimension/arity, routing, deadlock, flow control

Lecture 12: Interconnection Networks. Topics: dimension/arity, routing, deadlock, flow control Lecture 12: Interconnection Networks Topics: dimension/arity, routing, deadlock, flow control 1 Interconnection Networks Recall: fully connected network, arrays/rings, meshes/tori, trees, butterflies,

More information

Achieving Lightweight Multicast in Asynchronous Networks-on-Chip Using Local Speculation

Achieving Lightweight Multicast in Asynchronous Networks-on-Chip Using Local Speculation Achieving Lightweight Multicast in Asynchronous Networks-on-Chip Using Local Speculation Kshitij Bhardwaj Dept. of Computer Science Columbia University Steven M. Nowick 2016 ACM/IEEE Design Automation

More information

Lecture 25: Interconnection Networks, Disks. Topics: flow control, router microarchitecture, RAID

Lecture 25: Interconnection Networks, Disks. Topics: flow control, router microarchitecture, RAID Lecture 25: Interconnection Networks, Disks Topics: flow control, router microarchitecture, RAID 1 Virtual Channel Flow Control Each switch has multiple virtual channels per phys. channel Each virtual

More information

TDT Appendix E Interconnection Networks

TDT Appendix E Interconnection Networks TDT 4260 Appendix E Interconnection Networks Review Advantages of a snooping coherency protocol? Disadvantages of a snooping coherency protocol? Advantages of a directory coherency protocol? Disadvantages

More information

EECS 570. Lecture 19 Interconnects: Flow Control. Winter 2018 Subhankar Pal

EECS 570. Lecture 19 Interconnects: Flow Control. Winter 2018 Subhankar Pal Lecture 19 Interconnects: Flow Control Winter 2018 Subhankar Pal http://www.eecs.umich.edu/courses/eecs570/ Slides developed in part by Profs. Adve, Falsafi, Hill, Lebeck, Martin, Narayanasamy, Nowatzyk,

More information

Lecture 14: Large Cache Design III. Topics: Replacement policies, associativity, cache networks, networking basics

Lecture 14: Large Cache Design III. Topics: Replacement policies, associativity, cache networks, networking basics Lecture 14: Large Cache Design III Topics: Replacement policies, associativity, cache networks, networking basics 1 LIN Qureshi et al., ISCA 06 Memory level parallelism (MLP): number of misses that simultaneously

More information

Network-on-chip (NOC) Topologies

Network-on-chip (NOC) Topologies Network-on-chip (NOC) Topologies 1 Network Topology Static arrangement of channels and nodes in an interconnection network The roads over which packets travel Topology chosen based on cost and performance

More information

Interconnection Network

Interconnection Network Interconnection Network Recap: Generic Parallel Architecture A generic modern multiprocessor Network Mem Communication assist (CA) $ P Node: processor(s), memory system, plus communication assist Network

More information

PRIORITY BASED SWITCH ALLOCATOR IN ADAPTIVE PHYSICAL CHANNEL REGULATOR FOR ON CHIP INTERCONNECTS. A Thesis SONALI MAHAPATRA

PRIORITY BASED SWITCH ALLOCATOR IN ADAPTIVE PHYSICAL CHANNEL REGULATOR FOR ON CHIP INTERCONNECTS. A Thesis SONALI MAHAPATRA PRIORITY BASED SWITCH ALLOCATOR IN ADAPTIVE PHYSICAL CHANNEL REGULATOR FOR ON CHIP INTERCONNECTS A Thesis by SONALI MAHAPATRA Submitted to the Office of Graduate and Professional Studies of Texas A&M University

More information

Interconnection Networks: Flow Control. Prof. Natalie Enright Jerger

Interconnection Networks: Flow Control. Prof. Natalie Enright Jerger Interconnection Networks: Flow Control Prof. Natalie Enright Jerger Switching/Flow Control Overview Topology: determines connectivity of network Routing: determines paths through network Flow Control:

More information

ECE 4750 Computer Architecture, Fall 2017 T06 Fundamental Network Concepts

ECE 4750 Computer Architecture, Fall 2017 T06 Fundamental Network Concepts ECE 4750 Computer Architecture, Fall 2017 T06 Fundamental Network Concepts School of Electrical and Computer Engineering Cornell University revision: 2017-10-17-12-26 1 Network/Roadway Analogy 3 1.1. Running

More information

CONGESTION AWARE ADAPTIVE ROUTING FOR NETWORK-ON-CHIP COMMUNICATION. Stephen Chui Bachelor of Engineering Ryerson University, 2012.

CONGESTION AWARE ADAPTIVE ROUTING FOR NETWORK-ON-CHIP COMMUNICATION. Stephen Chui Bachelor of Engineering Ryerson University, 2012. CONGESTION AWARE ADAPTIVE ROUTING FOR NETWORK-ON-CHIP COMMUNICATION by Stephen Chui Bachelor of Engineering Ryerson University, 2012 A thesis presented to Ryerson University in partial fulfillment of the

More information

Lecture 2: Topology - I

Lecture 2: Topology - I ECE 8823 A / CS 8803 - ICN Interconnection Networks Spring 2017 http://tusharkrishna.ece.gatech.edu/teaching/icn_s17/ Lecture 2: Topology - I Tushar Krishna Assistant Professor School of Electrical and

More information

Network on Chip Architecture: An Overview

Network on Chip Architecture: An Overview Network on Chip Architecture: An Overview Md Shahriar Shamim & Naseef Mansoor 12/5/2014 1 Overview Introduction Multi core chip Challenges Network on Chip Architecture Regular Topology Irregular Topology

More information

Chapter 1 Bufferless and Minimally-Buffered Deflection Routing

Chapter 1 Bufferless and Minimally-Buffered Deflection Routing Chapter 1 Bufferless and Minimally-Buffered Deflection Routing Chris Fallin, Greg Nazario, Xiangyao Yu, Kevin Chang, Rachata Ausavarungnirun, Onur Mutlu Abstract A conventional Network-on-Chip (NoC) router

More information

Fault Tolerant and Secure Architectures for On Chip Networks With Emerging Interconnect Technologies. Mohsin Y Ahmed Conlan Wesson

Fault Tolerant and Secure Architectures for On Chip Networks With Emerging Interconnect Technologies. Mohsin Y Ahmed Conlan Wesson Fault Tolerant and Secure Architectures for On Chip Networks With Emerging Interconnect Technologies Mohsin Y Ahmed Conlan Wesson Overview NoC: Future generation of many core processor on a single chip

More information

TECHNOLOGY scaling is expected to continue into the deep

TECHNOLOGY scaling is expected to continue into the deep IEEE TRANSACTIONS ON COMPUTERS, VOL. 57, NO. 9, SEPTEMBER 2008 1169 Adaptive Channel Buffers in On-Chip Interconnection Networks A Power and Performance Analysis Avinash Karanth Kodi, Member, IEEE, Ashwini

More information

Dynamic Packet Fragmentation for Increased Virtual Channel Utilization in On-Chip Routers

Dynamic Packet Fragmentation for Increased Virtual Channel Utilization in On-Chip Routers Dynamic Packet Fragmentation for Increased Virtual Channel Utilization in On-Chip Routers Young Hoon Kang, Taek-Jun Kwon, and Jeff Draper {youngkan, tjkwon, draper}@isi.edu University of Southern California

More information

Phastlane: A Rapid Transit Optical Routing Network

Phastlane: A Rapid Transit Optical Routing Network Phastlane: A Rapid Transit Optical Routing Network Mark Cianchetti, Joseph Kerekes, and David Albonesi Computer Systems Laboratory Cornell University The Interconnect Bottleneck Future processors: tens

More information

Lecture 3: Topology - II

Lecture 3: Topology - II ECE 8823 A / CS 8803 - ICN Interconnection Networks Spring 2017 http://tusharkrishna.ece.gatech.edu/teaching/icn_s17/ Lecture 3: Topology - II Tushar Krishna Assistant Professor School of Electrical and

More information

EECS 570 Final Exam - SOLUTIONS Winter 2015

EECS 570 Final Exam - SOLUTIONS Winter 2015 EECS 570 Final Exam - SOLUTIONS Winter 2015 Name: unique name: Sign the honor code: I have neither given nor received aid on this exam nor observed anyone else doing so. Scores: # Points 1 / 21 2 / 32

More information

FPGA based Design of Low Power Reconfigurable Router for Network on Chip (NoC)

FPGA based Design of Low Power Reconfigurable Router for Network on Chip (NoC) FPGA based Design of Low Power Reconfigurable Router for Network on Chip (NoC) D.Udhayasheela, pg student [Communication system],dept.ofece,,as-salam engineering and technology, N.MageshwariAssistant Professor

More information

Topologies. Maurizio Palesi. Maurizio Palesi 1

Topologies. Maurizio Palesi. Maurizio Palesi 1 Topologies Maurizio Palesi Maurizio Palesi 1 Network Topology Static arrangement of channels and nodes in an interconnection network The roads over which packets travel Topology chosen based on cost and

More information

Randomized Partially-Minimal Routing: Near-Optimal Oblivious Routing for 3-D Mesh Networks

Randomized Partially-Minimal Routing: Near-Optimal Oblivious Routing for 3-D Mesh Networks 2080 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 20, NO. 11, NOVEMBER 2012 Randomized Partially-Minimal Routing: Near-Optimal Oblivious Routing for 3-D Mesh Networks Rohit Sunkam

More information

SURVEY ON LOW-LATENCY AND LOW-POWER SCHEMES FOR ON-CHIP NETWORKS

SURVEY ON LOW-LATENCY AND LOW-POWER SCHEMES FOR ON-CHIP NETWORKS SURVEY ON LOW-LATENCY AND LOW-POWER SCHEMES FOR ON-CHIP NETWORKS Chandrika D.N 1, Nirmala. L 2 1 M.Tech Scholar, 2 Sr. Asst. Prof, Department of electronics and communication engineering, REVA Institute

More information

Lecture 23: Router Design

Lecture 23: Router Design Lecture 23: Router Design Papers: A Gracefully Degrading and Energy-Efficient Modular Router Architecture for On-Chip Networks, ISCA 06, Penn-State ViChaR: A Dynamic Virtual Channel Regulator for Network-on-Chip

More information

HANDSHAKE AND CIRCULATION FLOW CONTROL IN NANOPHOTONIC INTERCONNECTS

HANDSHAKE AND CIRCULATION FLOW CONTROL IN NANOPHOTONIC INTERCONNECTS HANDSHAKE AND CIRCULATION FLOW CONTROL IN NANOPHOTONIC INTERCONNECTS A Thesis by JAGADISH CHANDAR JAYABALAN Submitted to the Office of Graduate Studies of Texas A&M University in partial fulfillment of

More information

INTERCONNECTION NETWORKS LECTURE 4

INTERCONNECTION NETWORKS LECTURE 4 INTERCONNECTION NETWORKS LECTURE 4 DR. SAMMAN H. AMEEN 1 Topology Specifies way switches are wired Affects routing, reliability, throughput, latency, building ease Routing How does a message get from source

More information

WITH THE CONTINUED advance of Moore s law, ever

WITH THE CONTINUED advance of Moore s law, ever IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 30, NO. 11, NOVEMBER 2011 1663 Asynchronous Bypass Channels for Multi-Synchronous NoCs: A Router Microarchitecture, Topology,

More information

ES1 An Introduction to On-chip Networks

ES1 An Introduction to On-chip Networks December 17th, 2015 ES1 An Introduction to On-chip Networks Davide Zoni PhD mail: davide.zoni@polimi.it webpage: home.dei.polimi.it/zoni Sources Main Reference Book (for the examination) Designing Network-on-Chip

More information

Evaluating Bufferless Flow Control for On-Chip Networks

Evaluating Bufferless Flow Control for On-Chip Networks Evaluating Bufferless Flow Control for On-Chip Networks George Michelogiannakis, Daniel Sanchez, William J. Dally, Christos Kozyrakis Stanford University In a nutshell Many researchers report high buffer

More information

Joint consideration of performance, reliability and fault tolerance in regular Networks-on-Chip via multiple spatially-independent interface terminals

Joint consideration of performance, reliability and fault tolerance in regular Networks-on-Chip via multiple spatially-independent interface terminals Joint consideration of performance, reliability and fault tolerance in regular Networks-on-Chip via multiple spatially-independent interface terminals Philipp Gorski, Tim Wegner, Dirk Timmermann University

More information

Interconnection Networks

Interconnection Networks Lecture 15: Interconnection Networks Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2016 Credit: some slides created by Michael Papamichael, others based on slides from Onur Mutlu

More information

Destination-Based Adaptive Routing on 2D Mesh Networks

Destination-Based Adaptive Routing on 2D Mesh Networks Destination-Based Adaptive Routing on 2D Mesh Networks Rohit Sunkam Ramanujam University of California, San Diego rsunkamr@ucsdedu Bill Lin University of California, San Diego billlin@eceucsdedu ABSTRACT

More information

Lecture 15: NoC Innovations. Today: power and performance innovations for NoCs

Lecture 15: NoC Innovations. Today: power and performance innovations for NoCs Lecture 15: NoC Innovations Today: power and performance innovations for NoCs 1 Network Power Power-Driven Design of Router Microarchitectures in On-Chip Networks, MICRO 03, Princeton Energy for a flit

More information

TCEP: Traffic Consolidation for Energy-Proportional High-Radix Networks

TCEP: Traffic Consolidation for Energy-Proportional High-Radix Networks TCEP: Traffic Consolidation for Energy-Proportional High-Radix Networks Gwangsun Kim Arm Research Hayoung Choi, John Kim KAIST High-radix Networks Dragonfly network in Cray XC30 system 1D Flattened butterfly

More information

548 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 30, NO. 4, APRIL 2011

548 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 30, NO. 4, APRIL 2011 548 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 30, NO. 4, APRIL 2011 Extending the Effective Throughput of NoCs with Distributed Shared-Buffer Routers Rohit Sunkam

More information

A Layer-Multiplexed 3D On-Chip Network Architecture Rohit Sunkam Ramanujam and Bill Lin

A Layer-Multiplexed 3D On-Chip Network Architecture Rohit Sunkam Ramanujam and Bill Lin 50 IEEE EMBEDDED SYSTEMS LETTERS, VOL. 1, NO. 2, AUGUST 2009 A Layer-Multiplexed 3D On-Chip Network Architecture Rohit Sunkam Ramanujam and Bill Lin Abstract Programmable many-core processors are poised

More information

Power and Area Efficient NOC Router Through Utilization of Idle Buffers

Power and Area Efficient NOC Router Through Utilization of Idle Buffers Power and Area Efficient NOC Router Through Utilization of Idle Buffers Mr. Kamalkumar S. Kashyap 1, Prof. Bharati B. Sayankar 2, Dr. Pankaj Agrawal 3 1 Department of Electronics Engineering, GRRCE Nagpur

More information

Topologies. Maurizio Palesi. Maurizio Palesi 1

Topologies. Maurizio Palesi. Maurizio Palesi 1 Topologies Maurizio Palesi Maurizio Palesi 1 Network Topology Static arrangement of channels and nodes in an interconnection network The roads over which packets travel Topology chosen based on cost and

More information

ECE/CS 757: Advanced Computer Architecture II Interconnects

ECE/CS 757: Advanced Computer Architecture II Interconnects ECE/CS 757: Advanced Computer Architecture II Interconnects Instructor:Mikko H Lipasti Spring 2017 University of Wisconsin-Madison Lecture notes created by Natalie Enright Jerger Lecture Outline Introduction

More information

Recall: The Routing problem: Local decisions. Recall: Multidimensional Meshes and Tori. Properties of Routing Algorithms

Recall: The Routing problem: Local decisions. Recall: Multidimensional Meshes and Tori. Properties of Routing Algorithms CS252 Graduate Computer Architecture Lecture 16 Multiprocessor Networks (con t) March 14 th, 212 John Kubiatowicz Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~kubitron/cs252

More information

The final publication is available at

The final publication is available at Document downloaded from: http://hdl.handle.net/10251/82062 This paper must be cited as: Peñaranda Cebrián, R.; Gómez Requena, C.; Gómez Requena, ME.; López Rodríguez, PJ.; Duato Marín, JF. (2016). The

More information

Module 17: "Interconnection Networks" Lecture 37: "Introduction to Routers" Interconnection Networks. Fundamentals. Latency and bandwidth

Module 17: Interconnection Networks Lecture 37: Introduction to Routers Interconnection Networks. Fundamentals. Latency and bandwidth Interconnection Networks Fundamentals Latency and bandwidth Router architecture Coherence protocol and routing [From Chapter 10 of Culler, Singh, Gupta] file:///e /parallel_com_arch/lecture37/37_1.htm[6/13/2012

More information

Routerless Networks-on-Chip

Routerless Networks-on-Chip Routerless Networks-on-Chip Fawaz Alazemi, Arash Azizimazreah, Bella Bose, Lizhong Chen Oregon State University, USA {alazemif, azizimaa, bose, chenliz}@oregonstate.edu ABSTRACT Traditional bus-based interconnects

More information

Routing Algorithms, Process Model for Quality of Services (QoS) and Architectures for Two-Dimensional 4 4 Mesh Topology Network-on-Chip

Routing Algorithms, Process Model for Quality of Services (QoS) and Architectures for Two-Dimensional 4 4 Mesh Topology Network-on-Chip Routing Algorithms, Process Model for Quality of Services (QoS) and Architectures for Two-Dimensional 4 4 Mesh Topology Network-on-Chip Nauman Jalil, Adnan Qureshi, Furqan Khan, and Sohaib Ayyaz Qazi Abstract

More information

A Dynamic NOC Arbitration Technique using Combination of VCT and XY Routing

A Dynamic NOC Arbitration Technique using Combination of VCT and XY Routing 727 A Dynamic NOC Arbitration Technique using Combination of VCT and XY Routing 1 Bharati B. Sayankar, 2 Pankaj Agrawal 1 Electronics Department, Rashtrasant Tukdoji Maharaj Nagpur University, G.H. Raisoni

More information

MinBD: Minimally-Buffered Deflection Routing for Energy-Efficient Interconnect

MinBD: Minimally-Buffered Deflection Routing for Energy-Efficient Interconnect SAFARI Technical Report No. 211-8 (September 13, 211) MinBD: Minimally-Buffered Deflection Routing for Energy-Efficient Interconnect Chris Fallin Gregory Nazario Xiangyao Yu cfallin@cmu.edu gnazario@cmu.edu

More information

18-740/640 Computer Architecture Lecture 16: Interconnection Networks. Prof. Onur Mutlu Carnegie Mellon University Fall 2015, 11/4/2015

18-740/640 Computer Architecture Lecture 16: Interconnection Networks. Prof. Onur Mutlu Carnegie Mellon University Fall 2015, 11/4/2015 18-740/640 Computer Architecture Lecture 16: Interconnection Networks Prof. Onur Mutlu Carnegie Mellon University Fall 2015, 11/4/2015 Required Readings Required Reading Assignment: Dubois, Annavaram,

More information

CHIPPER: A Low-complexity Bufferless Deflection Router

CHIPPER: A Low-complexity Bufferless Deflection Router CHIPPER: A Low-complexity Bufferless Deflection Router Chris Fallin Chris Craik Onur Mutlu cfallin@cmu.edu craik@cmu.edu onur@cmu.edu Computer Architecture Lab (CALCM) Carnegie Mellon University SAFARI

More information

Extended Junction Based Source Routing Technique for Large Mesh Topology Network on Chip Platforms

Extended Junction Based Source Routing Technique for Large Mesh Topology Network on Chip Platforms Extended Junction Based Source Routing Technique for Large Mesh Topology Network on Chip Platforms Usman Mazhar Mirza Master of Science Thesis 2011 ELECTRONICS Postadress: Besöksadress: Telefon: Box 1026

More information

Design of a high-throughput distributed shared-buffer NoC router

Design of a high-throughput distributed shared-buffer NoC router Design of a high-throughput distributed shared-buffer NoC router The MIT Faculty has made this article openly available Please share how this access benefits you Your story matters Citation As Published

More information

International Journal of Advanced Research in Computer Science and Software Engineering

International Journal of Advanced Research in Computer Science and Software Engineering Volume 2, Issue 9, September 2012 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com A Heterogeneous

More information

18-447: Computer Architecture Lecture 25: Main Memory. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/3/2013

18-447: Computer Architecture Lecture 25: Main Memory. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/3/2013 18-447: Computer Architecture Lecture 25: Main Memory Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/3/2013 Reminder: Homework 5 (Today) Due April 3 (Wednesday!) Topics: Vector processing,

More information

Routing Algorithms. Review

Routing Algorithms. Review Routing Algorithms Today s topics: Deterministic, Oblivious Adaptive, & Adaptive models Problems: efficiency livelock deadlock 1 CS6810 Review Network properties are a combination topology topology dependent

More information

Introduction to ATM Technology

Introduction to ATM Technology Introduction to ATM Technology ATM Switch Design Switching network (N x N) Switching network (N x N) SP CP SP CP Presentation Outline Generic Switch Architecture Specific examples Shared Buffer Switch

More information

ECE 551 System on Chip Design

ECE 551 System on Chip Design ECE 551 System on Chip Design Introducing Bus Communications Garrett S. Rose Fall 2018 Emerging Applications Requirements Data Flow vs. Processing µp µp Mem Bus DRAMC Core 2 Core N Main Bus µp Core 1 SoCs

More information

Prediction Router: Yet another low-latency on-chip router architecture

Prediction Router: Yet another low-latency on-chip router architecture Prediction Router: Yet another low-latency on-chip router architecture Hiroki Matsutani Michihiro Koibuchi Hideharu Amano Tsutomu Yoshinaga (Keio Univ., Japan) (NII, Japan) (Keio Univ., Japan) (UEC, Japan)

More information

JUNCTION BASED ROUTING: A NOVEL TECHNIQUE FOR LARGE NETWORK ON CHIP PLATFORMS

JUNCTION BASED ROUTING: A NOVEL TECHNIQUE FOR LARGE NETWORK ON CHIP PLATFORMS 1 JUNCTION BASED ROUTING: A NOVEL TECHNIQUE FOR LARGE NETWORK ON CHIP PLATFORMS Shabnam Badri THESIS WORK 2011 ELECTRONICS JUNCTION BASED ROUTING: A NOVEL TECHNIQUE FOR LARGE NETWORK ON CHIP PLATFORMS

More information

Lecture 18: Communication Models and Architectures: Interconnection Networks

Lecture 18: Communication Models and Architectures: Interconnection Networks Design & Co-design of Embedded Systems Lecture 18: Communication Models and Architectures: Interconnection Networks Sharif University of Technology Computer Engineering g Dept. Winter-Spring 2008 Mehdi

More information

DLABS: a Dual-Lane Buffer-Sharing Router Architecture for Networks on Chip

DLABS: a Dual-Lane Buffer-Sharing Router Architecture for Networks on Chip DLABS: a Dual-Lane Buffer-Sharing Router Architecture for Networks on Chip Anh T. Tran and Bevan M. Baas Department of Electrical and Computer Engineering University of California - Davis, USA {anhtr,

More information

OFAR-CM: Efficient Dragonfly Networks with Simple Congestion Management

OFAR-CM: Efficient Dragonfly Networks with Simple Congestion Management Marina Garcia 22 August 2013 OFAR-CM: Efficient Dragonfly Networks with Simple Congestion Management M. Garcia, E. Vallejo, R. Beivide, M. Valero and G. Rodríguez Document number OFAR-CM: Efficient Dragonfly

More information

The Impact of Optics on HPC System Interconnects

The Impact of Optics on HPC System Interconnects The Impact of Optics on HPC System Interconnects Mike Parker and Steve Scott Hot Interconnects 2009 Manhattan, NYC Will cost-effective optics fundamentally change the landscape of networking? Yes. Changes

More information

FPGA BASED ADAPTIVE RESOURCE EFFICIENT ERROR CONTROL METHODOLOGY FOR NETWORK ON CHIP

FPGA BASED ADAPTIVE RESOURCE EFFICIENT ERROR CONTROL METHODOLOGY FOR NETWORK ON CHIP FPGA BASED ADAPTIVE RESOURCE EFFICIENT ERROR CONTROL METHODOLOGY FOR NETWORK ON CHIP 1 M.DEIVAKANI, 2 D.SHANTHI 1 Associate Professor, Department of Electronics and Communication Engineering PSNA College

More information

NetSpeed ORION: A New Approach to Design On-chip Interconnects. August 26 th, 2013

NetSpeed ORION: A New Approach to Design On-chip Interconnects. August 26 th, 2013 NetSpeed ORION: A New Approach to Design On-chip Interconnects August 26 th, 2013 INTERCONNECTS BECOMING INCREASINGLY IMPORTANT Growing number of IP cores Average SoCs today have 100+ IPs Mixing and matching

More information

Basic Switch Organization

Basic Switch Organization NOC Routing 1 Basic Switch Organization 2 Basic Switch Organization Link Controller Used for coordinating the flow of messages across the physical link of two adjacent switches 3 Basic Switch Organization

More information

MULTIPATH ROUTER ARCHITECTURES TO REDUCE LATENCY IN NETWORK-ON-CHIPS. A Thesis HRISHIKESH NANDKISHOR DESHPANDE

MULTIPATH ROUTER ARCHITECTURES TO REDUCE LATENCY IN NETWORK-ON-CHIPS. A Thesis HRISHIKESH NANDKISHOR DESHPANDE MULTIPATH ROUTER ARCHITECTURES TO REDUCE LATENCY IN NETWORK-ON-CHIPS A Thesis by HRISHIKESH NANDKISHOR DESHPANDE Submitted to the Office of Graduate Studies of Texas A&M University in partial fulfillment

More information