A thesis presented to. the faculty of. In partial fulfillment. of the requirements for the degree. Master of Science. Yixuan Zhang.

High-Performance Crossbar Designs for Network-on-Chips (NoCs) A thesis presented to the faculty of the Russ College of Engineering and Technology of Ohio University In partial fulfillment of the requirements for the degree Master of Science Yixuan Zhang August 2010 2010 Yixuan Zhang. All Rights Reserved.

2 This thesis titled High-Performance Crossbar Designs for Network-on-Chips (NoCs) by YIXUAN ZHANG has been approved for the School of Electrical Engineering and Computer Science and the Russ College of Engineering and Technology by Avinash Kodi Assistant Professor of Electrical Engineering and Computer Science Dennis Irwin Dean of Russ College of Engineering and Technology

Abstract 3 ZHANG, YIXUAN, M.S., August 2010, Electrical Engineering High-Performance Crossbar Designs for Network-on-Chips (NoCs) (60 pp.) Director of Thesis: Avinash Kodi The packet-switched Network-on-Chip (NoC) architecture is considered to be an attractive approach for overcoming bottlenecks such as wire delay and scalability seen in Chip-Multiprocessors (CMPs). However, current NoC designs heavily rely on in-router buffers, which is a major source of power consumption. Eliminating all router input buffers would result in either dropped or deflected packets when conflicts are frequent, thereby increasing the power consumption. In this thesis, with both enhancing performance and decreasing power consumption as our goals, we propose an innovative dual-crossbar design called DXbar, which combines the advantages of bufferless networks and traditional networks with buffers, to enable low-latency routing at low network load and limited buffering capability to handle excessive packets at high network load. DXbar design not only has superior performance compared to the state-of-the-art design, but also achieves significant power savings. The simulation results of the proposed methodology show that DXbar achieves at least 15% performance improvement and power saving over the baseline designs. Additionally, DXbar outperforms the currently proposed bufferless designs with huge decrease in power consumption when the network utilization is high. To overcome the area overhead, an innovative crossbar design named dual-directional crossbar is also proposed in this thesis. The crossbar is able to provide the same connectivity of the dualcrossbar, and occupies only half of the area. Approved: Avinash Kodi Assistant Professor of Electrical Engineering and Computer Science

Acknowledgements 4 I would like to sincerely thank my parents who provide me the chance to obtain my bachelor degree in the most privilege university in my home country, which in turn helps me to let my dream of studying in United States come true. I would also like to thank my advisor Dr. Avinash Kodi, an undoubtedly responsible and helpful professor, who have been providing me countless guidances throughout my years in the graduate school. Additionally, I would like to thank Randy W. Morris, Jr., my colleague and a senior student in the lab, who acquaints me with the working environment and helps me start my project. Special thanks to my beloved fiance Yu Song, a MPA student in Ohio State University. Her care and love play an important role for me to adapt to the life in this country. Finally, I would like to thank Ohio University for giving me the opportunity of studying for my master degree, and National Science Foundation for generously being the sponsor of my scholarship.

Table of Contents 5 Abstract......................................... 3 Acknowledgements................................... 4 List of Tables...................................... 7 List of Figures...................................... 8 1 Introduction..................................... 10 1.1 Chip-Multiprocessors (CMPs)........................ 10 1.2 Network-on-Chips (NoCs).......................... 13 1.3 Related Work................................. 14 1.4 Proposed Approach.............................. 17 1.5 Organization of Thesis............................ 18 2 Background on Network-on-Chips......................... 19 2.1 Terminology.................................. 19 2.2 Topology................................... 21 2.2.1 Ring.................................. 21 2.2.2 Mesh................................. 22 2.2.3 Torus................................. 22 2.2.4 Cmesh................................ 23 2.2.5 Flattened Butterfly.......................... 23 2.3 Router Architecture.............................. 24 2.4 Pipeline Stages................................ 24 2.5 Flow Control................................. 27 2.6 Routing Algorithm.............................. 29 3 DXBar: A Power-Efficient and High-Performance Crossbar Design........ 32 3.1 Introduction of DXbar............................ 32 3.2 Architecture.................................. 33 3.3 Router Implementation............................ 34 3.4 Walkthrough Examples............................ 36 3.5 Fairness Maintenance............................. 38 3.6 Fault Tolerance................................ 38 3.7 Dual-Directional Crossbar.......................... 40 4 Performance Evaluation............................... 43 4.1 Methodology................................. 43 4.2 Area and Power Estimation.......................... 45 4.3 Synthetic Traffic................................ 46

4.4 Real-Application Traces........................... 50 4.5 Fault Tolerance................................ 53 5 Conclusion and Future Work............................ 55 References........................................ 56 6

List of Tables 7 4.1 Processor parameters used for Splash-2 suite simulations............. 44 4.2 Cache and memory parameters used for Splash-2 suite simulation........ 45 4.3 Area and power estimation for different designs in a 8 8 2D mesh network. The area of one 2 2 crossbar is 9.8µm. The crossbar power is 159.02mW, the 2 2 crossbar power is 25.443mW, and the link power is 88.807mW. Power values are for one flit traversal........................... 46

List of Figures 8 1.1 The trend of uniprocessor performance improvement since 1978 [1]....... 11 1.2 The trend of Intel uniprocessor power of several generations [2]......... 12 2.1 A network using ring topology........................... 22 2.2 A network using (a) mesh (b) torus topology................... 23 2.3 A network using (a) Cmesh (b) Flattened Butterfly topology........... 24 2.4 The architecture of a generic router........................ 25 2.5 (a) A 5-stage router pipeline and timing for different flits in a packet. (b) A 4-stage router pipeline with look-ahead routing. (c) A 4-stage router pipeline with speculative SA. (d) A 3-stage router pipeline with both look-ahead routing and speculative SA................................. 26 2.6 An illustration of credit-based flow control. After sending flit 0, Router 1 has to halt the sending of flit 1 and wait for a credit from router 2. At the time flit 0 departs from router 2, a credit is sent back from router 2 to router 1. The sending of flit 1 from router 1 to router 2 is resumed after router 1 receives the credit [3]...................................... 28 2.7 (a) The path from source node (0, 0) to destination node (2, 2) using dimensionorder routing algorithm (b) One possible path from source node (0, 0) to destination node (2, 2) using west-first adaptive routing algorithm........ 30 3.1 The architecture of a DXbar router......................... 33 3.2 The pipeline of a DXbar router........................... 35 3.3 The illustration of dual xbar: (a) No conflicts appear, the network operates as bufferless network. (b) Conflict appears, one flit is sent to the secondary crossbar and buffered first. (c) The flit coming after the buffered flit sees no delay and could proceed without being buffered, thus ensuring that no instant back pressure happens. The injection port could send a flit whenever the desired output port is not occupied (d) Even though there is still a flit coming in through the same input port, the buffered flit could proceed when the desired output port is not occupied. The incoming flit could proceed simultaneously, encountering no delay............................... 36 3.4 The architecture of a DXbar router with fault tolerance capability........ 39 3.5 (a) The structure of a dual-directional crossbar. (b) An example showing how signal could enter from both directions. Buffers with a color of grey are turned off.......................................... 40 3.6 (a) The architecture of a DXbar router with dual-directional crossbar. (b) An example showing a small crossbar alters the two flits. Without the small crossbars, the routes of the displayed flits will overlap on the same input line of the crossbar, so that the flits cannot progress together............. 41 4.1 Throughput of Uniform Random traffic pattern.................. 47

4.2 Latency of (a) Uniform Random (b) Non-Uniform Random (c) Complement (d) Perfect Shuffle traffic patterns......................... 48 4.3 Power of (a) Uniform Random (b) Non-Uniform Random (c) Complement (d) Perfect Shuffle traffic patterns........................... 49 4.4 Throughput at an offered load = 0.5 of all synthetic traces............ 51 4.5 Power at an offered load = 0.5 of all synthetic traces............... 51 4.6 Normalized time of simulation of all SPLASH-2 traces............. 52 4.7 Power of all SPLASH-2 traces.......................... 52 4.8 Throughput of fault tolerant DXbar with (a) deterministic (b) adaptive routing algorithms..................................... 54 4.9 Latency of fault tolerant DXbar with (a) deterministic (b) adaptive routing algorithms..................................... 54 4.10 Power of fault tolerant DXbar with (a) deterministic (b) adaptive routing algorithms..................................... 54 9

1 Introduction 10 1.1 Chip-Multiprocessors (CMPs) During the past several decades, the semiconductor industry has been developing and growing at an exponential pace [1]. The ever shrinking clock cycle coupled with transistor scaling has resulted in high performance and complex processing units [4]. High clocks rates are achieved by technology scaling and by the use of pipelining [2]. Pipelining divides the execution of an instruction into smaller parts which are allowed to run faster [5]. Another method for increasing processor performance is instruction level parallelism (ILP). ILP overlaps execution of several instructions into the same clock cycle by fully utilizing all the functional units [5]. The combination of pipelining, faster clock rate, and ILP taken together, provide significant performance improvements in uniprocessor designs. In recent years, increasing processor performance through traditional approaches has switched to a slower pace [1, 4]. Figure 1.1 shows the performance of an individual processor over the time period from 1978 to 2006 [1]. It is clear that the increase in processor performance has declined from 52% from 1986 to 2002, to its present value of 25% improvement in performance per year. This is partially caused by the fact that shrinking transistor size has increased the leakage current, crosstalk, clock skew, preventing the processor from working at a higher operating frequencies [1, 4, 6, 7]. Power dissipation has become another critical factor contributing to the slower improvement rate in processor speeds [2]. This increase in power dissipation is contributed to the increase in the number of transistors due to technology scaling, combined with disproportional scaling of the transistor power due to increased leakage current [2]. The maximum clock rate is also affected by power dissipation, as the overall power consumption of a digital circuit is directly proportional to frequency. The dynamic power

11 10000??%/year Performance (vs. VAX-11/780) 1000 100 10 25%/year 52%/year 1 1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 Figure 1.1: The trend of uniprocessor performance improvement since 1978 [1]. consumption of a digital circuit is given by [5]: P CV 2 f (1.1) where P is the total power, C is the capacitance, V is the operating voltage, and f is the frequency. Another problem with increased power dissipation is the generation of heat, as silicon could only dissipate a maximum power of 200 Watts [8]. As a result, the performance of a processor is dependent on how well it manages and dissipates power. Two good examples to illustrate the importance of power dissipation are the Intel Tejas processor [9] and the 4-GHz Pentium 4 processor [10], which were canceled due to their tremendous power consumption. It should be noted that previous generations of Intel uniprocessors achieved better performance at the cost of higher power dissipation, as shown in Figure 1.2 [2]. The limitations seen in uniprocessor designs led to a new era in computer architecture, where engineers switched from designing more sophisticated uniprocessors to exploring

12 Figure 1.2: The trend of Intel uniprocessor power of several generations [2]. the benefits of placing multiple simpler cores on the same chip. Chip-Multiprocessors (CMPs) provide a power-efficient and cost-efficient method by placing multiple cores on the same chip [11]. In addition, replicating cores on a processor is a known method for increasing processor performance within the allowed power budget of modern processors [5, 12]. Moreover, CMPs have been thoroughly studied by researchers for high-performance computing, and their benefits are well known. There are several implementations of high-performance CMPs which are designed by leading institutions and companies. RAW, developed by MIT, is a 16-core CMP, which was implemented in the IBM SA-27E ASIC processor in 150 nm technology [13 15]. Intel TeraFlops is a 80-

13 core CMP designed using 65 nm technology [16]. Sun Microsystems designed the 16-core Niagara and Rock CMPs, which are mainly used for data processing [17, 18]. 1.2 Network-on-Chips (NoCs) As the number of cores increases, the interconnection network between cores would play a leading role in the CMPs overall performance [19 22]. Traditionally, shared buses have been used as communication fabric for CMPs with few number of cores. As transistors continue to scale resulting in higher clock frequencies, the number of clock cycles required to traverse from one end of the chip to the other increases due to the length of the chip remaining the same [23]. This results in a bus speed slower than the potential clock frequency of cores, as the delay of cores depends on local wires which scale with transistors [24]. As simple shared buses are unsuitable for future CMPs with hundreds to thousands of cores [25], the promising Network-on-Chips (NoCs) have emerged as a solution for designing interconnection fabric in CMPs [19 21, 26 29]. NoCs provide a low-latency, low-power, high bandwidth communication medium for CMPs by using the benefits of a modular packet-switched network. The key components which form NoCs are routers and links. A router is responsible to receive and forward communication traffic, and a link provides point-to-point connection between two routers. Each node in NoCs is directly connected to a limited number of neighboring nodes, therefore the wires in a link are short and able to achieve high speed. In today s NoCs field, researchers are continuously confronted by several major challenges: reducing power dissipation in the network, improving performance and reducing area overhead. The power consumed by modern NoCs consists of a large portion of total chip power. For example, 28% of the total chip power of Intel TeraFLOPS processors is spent for communication, while the expectation was only 10%

14 [16]. Moreover, the interconnection network of MIT s RAW processor consumes 36% of the total chip power [29, 30]. Power dissipation in modern NoC architectures is mainly characterized by the power consumed in the links, crossbars, and input buffers, Amongst these components, input buffers alone could consume up to 35% percent of the total power of the whole interconnection network [29]. As a result, reducing the size of input buffers or completely eliminating input buffers is a natural approach to design low-power NoCs. On the other hand, increasing the size of an input buffer would lead to more upstream packets being transmitted and buffered, thus increasing the throughput [3]. Therefore, simply reducing the size of input buffers in each router may result in a degraded performance, as the performance of an interconnection network is primarily affected by the size of the buffers. 1.3 Related Work Different types of techniques have been proposed to reduce the size of input buffers or to eliminate them outright. Recently, ideal (inter-router Dual-function Energy and Area-efficient Links) proposed to reduce the size of the input buffers and utilize repeaters along inter-router channels as storage units when needed by appropriately controlling the dual-functional repeaters [31]. As a result, the size of input buffers could be reduced without significantly reducing the performance, because packets could be buffered along the links. The design could maintain similar performance with only half the number of input buffers, so that area saving is huge, and power is also saved. The cost is increased latency due to Head-of-Line (HoL) blocking, and complexity due to extra control units. Another approach utilizing channel buffering is the Elastic Channel Buffers (ECB), which replaces all the repeaters with flip-flops, and eliminates the input buffers [32, 33]. In this design, the flip-flops along the links function as single buffer queues, such that ECB router operates similarly to a generic router. All packets are transmitted from one channel buffer

15 to the next by a handshaking protocol, which means that packets need to waste a clock cycle each time they traverse a flip-flop. This is because that in edge-triggered flip-flops, signals need a full cycle to go from input to output. This clock-based nature of flip-flops results in excessive link delay even if all the buffers along the channels are unoccupied. Bufferless routing is another novel and unique approach which eliminates all input buffers without utilizing channel buffering. Flit-Bless proposed a routing scheme to send all incoming packets to output ports, irrespective of the fact whether those output ports are productive [34]. The age-based priority for arbitration indicates that the oldest incoming packet is guaranteed to be routed to its most productive output port, while younger packets may be deflected to their non-productive output ports and take nonminimal numbers of hops before reaching their destinations. Due to the full adaptivity of Flit-Bless, each packet carries several desired output ports to compete in switch allocation. Excessive computation workload to find the final allocation results limits the performance of switch allocator, so that routers in Flit-Bless could only work with a frequency lower than other NoCs. Another design which improves upon bufferless routing is SCARAB (Single Cycle Adaptive Routing and Bufferless Network) [35]. In this design, packets are minimally-adaptively routed in order to ease the workload of switch allocation. If none of the productive output ports are available, the packet will be dropped, and a NACK signal will be transmitted through a dedicated circuit-switched NACK network to trigger a retransmission. Without the dedicated NACK network, a portion of the bandwidth of the data network would be consumed by NACK packets, which could lead to data packets being blocked, or retransmission being delayed. Both designs could achieve substantial power saving and latency reduction at low network load. Whereas at higher network load, conflicts are more frequent and the rate of deflections or retransmissions is drastically increased. Lacking buffers to temporarily stay, packets in both designs need to keep progressing in undesired directions or be dropped, thus experiencing much more link and crossbar

16 traversals. This leads to significantly higher power consumption for bufferless networks at high network load, and lower saturation throughput than a generic router network. Another approach to save power is to modify crossbar. Segmented crossbar is a powerdriven design, where each input line of a crossbar is divided into several segments by tristate buffers [29]. When not required, part of the input lines could be turned off by shutting down the tri-state buffers. Because the input lines become shorter, less energy is needed to drive those lines, thereby saving crossbar power. The interconnection network of Inter TeraFlop processor uses a double-pumped crossbar [16, 36]. Consequently, the crossbar area is reduced by half, which in turn saves crossbar power by requiring less energy to drive. Nonetheless, performance of these two designs is not improved. Other than power and area saving, some designs targeting performance enhancement in different approaches. A dynamic buffering resources allocation design named ViChaR (dynamic Virtual Channel Regulator) focuses on efficiently allocating buffers to all virtual channels, by deploying a unified buffering unit instead of a series of separated buffers [37]. Because HoL blocking is eliminated, and buffers could be shared by all packets, the required size of input buffers is minimized. The design is able to achieve good performance while saving area, at the cost of significantly increased complexity in buffer management. Express Virtual-Channels (EVCs) deploy special VCs which enable packets to bypass buffer operations in the intermediate nodes [38]. Namely, packets could traverse the intermediate nodes with minimum delay in each router. Because buffer operations are less frequent, power is saved and latency is reduced. The problem is that routing is complicated, since packets need to choose between taking express VC to experience more hops, and taking regular VC to encounter more delay. A low-cost router microarchitecture has been proposed to partition the crossbar into two decoupled low-radix subcrossbars for the two dimensions, and connect them with an intermediate buffer queue [39]. Packets traversing in their own dimensions are prioritized to encounter minimal delay,

17 while turning packets need to wait in the intermediate buffer queues until no more in-flight packets occupy the desired output ports. Similar to bufferless designs, low network load could yield satisfactory performance on this design, while the intermediate buffer queues become the prominent bottleneck as there would be more turning packets when network load increases. Another design which utilizes dual-crossbar is called Distributed Shared- Buffer [40]. In this design, buffer slots are sandwiched between two crossbars in order to rearrange packet sequences, so that conflicts could be avoided and packets would be switched more quickly. This design of dual-crossbar targets improving throughput, yet increases power consumption considerably. 1.4 Proposed Approach In this thesis, we propose DXBar, a high-performance and power-efficient dualcrossbar design targeting both performance enhancement and power reduction. The major contributions of this work are as following: We implement a primary crossbar with no input buffers. The bufferless network could provide simple and efficient switching function for packets, without any buffer operations. Packets traversing the primary crossbar only need one cycle to be switched. The elimination of buffers also result in a decreased power consumption and latency. Compared to the baseline designs, the proposed approach reduces power by 15%, and latency by 50%, for Uniform Random traffic pattern. We also implement a secondary crossbar with buffers. Packets do not need to take extra hops to solve conflicts because of the secondary crossbar, therefore the power consumption at either low or high network load is reduced. Compared to bufferless networks at high network load, the proposed approach reduces power by at least 30%.

18 We utilize a dual-crossbar design to improve the performance. Cycle-accurate simulation on 8 8 2D mesh topology shows that DXbar has at least 15% performance improvement over the baseline designs with synthetic traces and realapplication traces from SPLASH-2. The proposed design is also able to outperform current bufferless networks. We further evaluate the performance of hardware-level fault tolerance with faultdetecting units and control units deployed. The failure of either crossbar would not disable the router, and all packets could be routed through the working crossbar. The performance degradation is at most 10% when all of the nodes have one permanent crossbar failure. 1.5 Organization of Thesis The rest of the thesis is organized as following. Chapter 2 introduces the background knowledge of NoCs, including commonly used terms, router architecture, and router implementation, to provide necessary information for understanding the proposed DXbar design. Chapter 3 illustrates the architecture of DXbar and how it works in details, and covers several key features which expand the functionality of DXbar. Chapter 4 starts with the introduction of the methodology used in performance analysis, and presents different types of evaluation results to illustrate the performance of the proposed DXbar design, compared to other related designs. Chapter 5 concludes the thesis and states the future work which could lead to further performance improvement.

2 Background on Network-on-Chips 19 This chapter introduces NoCs by covering some of the fundamental knowledge, which are used to demonstrate the proposed design in the next chapter. First of all, several commonly used terms in designing NoCs are listed and explained. Secondly, topology basics are introduced, and common topologies used in various NoCs are presented and compared. Finally, the architecture of a NoC router is illustrated, as well as different flow control techniques and routing algorithms. 2.1 Terminology There are several terms which are largely used in the research field of NoCs. Node A node is the combination of processing element (PE) and router. Router A router is the network interface block of a node. It is capable of receiving and forwarding information from the input ports to the correct output ports. Bandwidth Bandwidth is the data rate between two adjacent nodes, and is one of the major parameters of NoCs which affects the overall performance. Flit Flit is the basic unit of information which NoCs use to allocate channel and storage. The data width of a flit is the width of channels, and a flit requires one cycle to transmit. Packet Packet is the unit of information which are transmitted among nodes. Packets are composed of one or multiple flits, depending on the packet type. Latency Latency is the time for a packet to traverse the network from its source node to its destination node, and is calculated as [3] T = T h + L b (2.1)

20 where T h is the time for the head flit to traverse the network, L is the length of the packet, and b is the bandwidth. It is measured from the time at which the head flit is injected, to the time at which the tail flit is ejected from the network. Throughput Throughput is the data rate that the network accepts per input port [3]. The saturation throughput is reached when throughput does not increase with offered load. Capacity Capacity of the network is the ideal throughput of the network on uniform traffic. Power Power is the amount of energy dissipated in both the router and channels when a packet traverse the network from its source core to destination core. The power for a flit to traverse the network is calculated as [3] P f lit = P router H + P channel H (2.2) where P router is the power consumed by a router, P channel is the power consumed by a channel, and H is the number of hops from the source core to the destination core. The total power to transmit a packet equals P f lit multiplied by the number of flits in the packet. Livelock Livelock implies a packet never reach its destination core, even though it is always progressing forward in the network. Deadlock Deadlock implies a cyclic dependency that all packets in that cycle are waiting for the packet in front to release its resource, and none of them could progress. Deterministic Routing Deterministic routing algorithm only routes packets along the same path between two nodes.

21 Adaptive Routing Adaptive routing algorithm could provide multiple paths for a packet to reach its destination. The choice among possible paths is made based on the load of the network. 2.2 Topology Topology of a NoC refers to the pattern by which nodes and channels are arranged. There are different topologies used in building NoCs, providing a variety of networks with different characteristics; such as diameter, radix, and bisection bandwidth [3]. Diameter is the length of the longest route from any source node to destination node. Larger diameter could result in more hops for a packet to travel before reaching the destination node, and usually leads to larger latency. Radix is the number of input/output ports of each router, which indicates the complexity of the router architecture. Larger radix could increase the connectivity of individual nodes and potentially reduce the diameter, but could also increase the area overhead and power dissipation of the entire network. Bisection bandwidth is the aggregated bandwidth of the minimum number of wires to partition the network in half. Larger bisection bandwidth could indicate a better worst case performance. The commonly used topologies are ring, mesh, torus, cmesh, and flattened butterfly. 2.2.1 Ring Figure 2.1 shows a ring topology with 8 nodes [3]. The routers in a ring network have a very low radix which is only two. Moreover, the network is simple and easy to add more nodes. The major drawback of ring topology is that when the network size becomes large, the diameter would increase and packets need to travel more hops. Furthermore, bandwidth of ring topology is limited, as each node only has two channels with which it could connect to other nodes. As a result, ring topology is suitable for small network sizes. An example of ring topology is the NoC of IBM cell processor [41].

22 7 6 5 4 0 1 2 3 Figure 2.1: A network using ring topology. 2.2.2 Mesh Figure 2.2(a) shows a mesh topology with 16 nodes [3]. Each router is identified as (x, y), and is connected to four other neighboring routers in all of the cardinal directions, x+, x, y+, and y. The topology could provide a relatively high bandwidth with a moderate radix of four. The major drawback of mesh topology is the high diameter. For example, the diameter of the shown network is six, from node (0, 0) to node (3, 3). Another drawback is that nodes at the center of the network are busier than nodes on the fringes, as all of the side-to-side traffic have to traverse the center of the network [3, 20]. Mesh topology is widely used in different NoCs, and good examples are Intel TeraFlop processor [16] and MIT s RAW processor [13 15]. 2.2.3 Torus Figure 2.2(b) shows a torus topology with 16 nodes [3]. The topology has a similar grid shape such as the mesh topology, with additional wrap-around channels for nodes on the fringes. The topology could yield a smaller diameter than mesh topology with the same network size. For example, a torus topology for a network size of 16 has a diameter of four. The extra channels also provide better path diversity, and alleviate the burden on the

23 (3,0) (3,1) (3,2) (3,3) (3,0) (3,1) (3,2) (3,3) (2,0) (2,1) (2,2) (2,3) (2,0) (2,1) (2,2) (2,3) (1,0) (1,1) (1,2) (1,3) (1,0) (1,1) (1,2) (1,3) (0,0) (0,1) (0,2) (0,3) (0,0) (0,1) (0,2) (0,3) (a) (b) Figure 2.2: A network using (a) mesh (b) torus topology. center nodes [3, 20]. The major drawback is the length of the long wrap-around channels, which leads to higher power dissipation and latency. Folding the network could reduce the negative impact, but increases the difficulty in the manufacturing process [42]. Torus has been used extensively for off-chip networks, such as the SeaStar interconnect found in Cray XT5m [43]. 2.2.4 Cmesh Figure 2.3(a) shows the Cmesh topology with 64 cores [20]. Every four nodes are concentrated to one high radix router to form a tile. Diameter is reduced so that packets need less number of hops to traverse the network, thereby decreasing the latency. The major drawback is that a node is shared by multiple cores. When packets from different cores need to go to other nodes, the possibility of conflict is increased, which leads to a reduction in throughput. 2.2.5 Flattened Butterfly Figure 2.3(b) shows a Flattened Butterfly topology with 64 nodes. Similar to Cmesh topology, every four nodes are concentrated to form a tile. The difference is that a tile is connected not only to its neighbors, but also to all the other tiles in its row and column. Therefore, the diameter of a flattened butterfly network is reduced to two, and the capacity

24 0 1 4 5 8 9 12 13 0 1 4 5 8 9 12 13 2 3 6 7 10 11 14 15 2 3 6 7 10 11 14 15 16 17 20 21 24 25 28 29 16 17 20 21 24 25 28 29 18 19 22 23 26 27 30 31 18 19 22 23 26 27 30 31 32 33 36 7 40 41 44 45 32 33 36 7 40 41 44 45 34 35 38 39 42 43 46 47 34 35 38 39 42 43 46 47 48 49 52 53 56 57 60 61 48 49 52 53 56 57 60 61 50 51 54 55 58 59 62 63 50 51 54 55 58 59 62 63 (a) (b) Figure 2.3: A network using (a) Cmesh (b) Flattened Butterfly topology. of the network is largely increased [44]. The major drawback is the high radix of the nodes, which could grow fast and become costly if network size is increased. In addition, the long channels which connect non-adjacent tiles may require several cycles for data to travel, and could consume a significant amount of power. 2.3 Router Architecture A router in NoCs consists of buffers, switches, and control units which are required to store and forward flits from the input ports to the desired output ports. The architecture is actually similar to that of modern ethernet routers, but with smaller area and buffer size. Figure 2.4 shows a generic 5 5 NoCs router of a mesh network with 16 buffer slots per input port. The buffer slots are divided into four queues, and each queue is called a virtualchannel (VC) [3, 45]. There are four cardinal input ports and output ports connected from and to x+, x, y+, and y directions. The last pair of input/output ports are connected from and to the processing element (PE). 2.4 Pipeline Stages NoCs routers are pipelined at the flit level to better utilize all the control units and improve the throughput. Figure 2.5(a) shows the pipeline of a generic 5-stage router and timing of different flits in a packet [3]. The stages are: Routing Computation (RC), VC

25 Routing Computation Virtual-Channel Allocator VC0 Switch Allocator X+ DEMUX VC1 VC2 MUX X+ VC3 X- X- Y+ Y-......... Y+ Y-... Processing Element (PE) Figure 2.4: The architecture of a generic router. Allocation (VA), Switch Allocation (SA), Switch Traversal (ST), and Link Traversal (LT). RC, VA, and SA are performed by routing computation, VC allocator, and switch allocator, respectively. RC When the head flit of a packet is stored in a virtual-channel, the routing information carried by the head flit is input to the router to determine the output port of the packet. Once the result is calculated, all of the flits in the same packet must use the same output port. VA When the output port is determined, the result is input to the virtual-channel allocator to assign a single output virtual-channel on the corresponding output port. If the

26 Head Flit RC VA SA ST LT Head Flit VA RC SA ST LT Body Flit SA ST LT Body Flit SA ST LT Body Flit SA ST LT Body Flit SA ST LT Tail Flit SA ST LT Tail Flit SA ST LT (a) (b) Head Flit RC VA SA ST LT Head Flit VA SA ST LT RC Body Flit SA ST LT Body Flit SA ST LT Body Flit SA ST LT Body Flit SA ST LT Tail Flit SA ST LT Tail Flit SA ST LT (c) (d) Figure 2.5: (a) A 5-stage router pipeline and timing for different flits in a packet. (b) A 4- stage router pipeline with look-ahead routing. (c) A 4-stage router pipeline with speculative SA. (d) A 3-stage router pipeline with both look-ahead routing and speculative SA. allocation fails, the head flit needs to wait until the output port has a free VC to assign. The allocation is performed for the head flit only. SA When the output VC is assigned, per-packet operations are completed and switch allocation is performed flit-by-flit. All of the flits in a packet will consecutively bid for a single-flit time slot to traverse the switch. ST When the switch is allocated to a flit, the flit uses one cycle to traverse the switch to the desired output port. LT When the switch is traversed by a flit, the flit uses another cycle to traverse the channel and reach its downstream router. Each pipeline stage requires one cycle to perform. Therefore, a flit needs five cycles to traverse the router. Different techniques have been proposed to reduce the number of stages in order to increase throughput and decrease latency. Figure 2.5(b) shows the reduced

27 pipeline of the generic router using look-ahead routing. RC is done at the previous router, and the result is written to the flit and transmitted to the current router so that no RC is needed for switching the flit to the next router, reducing the number of stages from 5 to 4 [46, 47]. Figure 2.5(c) shows the reduced pipeline of the generic router via speculation. Speculative SA is performed in order to finish both VA and SA in the same cycle, also reducing the number of stages from 5 to 4 [48]. Figure 2.5(d) shows the further reduced pipeline of the generic router with both look-ahead routing and speculation [3]. This reduces the number of stages from 4 to 3. Eliminating pipeline stages result in reduced delay for packets in each router, which in turn decreases the average latency. Because packets spend less time traveling in the network, the now-unoccupied cycles enable more packets to be transmitted and increase the throughput of the network. 2.5 Flow Control Flow control is the key feature to maintain the efficiency of allocating network resources, such as buffers and channel bandwidth, to the traversing packets in the network. Using different flow control methods could result in vastly varying throughput from the same physical network, since good flow control methods could make full use of resources and arrange packets in the most productive way, but poor flow control methods waste resources by allocating them to idle packets and blocking active packets from progressing, thus causing more conflicts [3]. There are two types of flow control methods, bufferless and buffered. Bufferless networks are typically simpler and more power-efficient than buffered networks, since all of the buffers and associated control units are eliminated. However, the lack of buffering indicates that packets would be either misrouted or dropped in case of conflicts. As a result, bufferless networks cannot outperform networks with buffers in terms of saturation throughput [3].

28 routing delay processing delay Router 1 Router 2 Router 3 Figure 2.6: An illustration of credit-based flow control. After sending flit 0, Router 1 has to halt the sending of flit 1 and wait for a credit from router 2. At the time flit 0 departs from router 2, a credit is sent back from router 2 to router 1. The sending of flit 1 from router 1 to router 2 is resumed after router 1 receives the credit [3]. On the other hand, buffered networks utilize buffers to store packets when conflicts occur. Although it is possible for buffers to consume a significant portion of total network power, buffers could largely improve performance. In order to keep track of the downstream buffer utilization in each router, credit-based flow control is commonly implemented as the method of buffer management. When credit-based flow control is used, the number of available buffer slots in each downstream VC is maintained in the upstream router as credits. Figure 2.6 illustrates a scenario where each router has one credit. When flit 0 departs from router 1, the flit will consume a buffer slot in router 2. Therefore, the number

29 of credit in router 1 is decremented by one. Flit 1 cannot be transmitted instantly, as the number of credits is zero. After flit 0 leaves router 2, a credit is returned to router 1 to increment the number of credits by one, and flit 1 could be transmitted to router 2. Creditbased flow control ensures that all transmitted flits will be successfully buffered [3]. 2.6 Routing Algorithm Different routing algorithms could significantly affect the overall performance of the network by changing the way RC computes the next output port. In general, deterministic routing algorithms are simpler to implement, and they could provide better performance with random traffic patterns. In contrast, adaptive routing algorithms need more control units to gather information, in order to make routing decisions that could improve the utilization of less used resources. Good adaptive routing algorithms could provide comparable performance with random traffic patterns, compared to deterministic routing algorithms. For permutation traffic patterns where each node transmits data to a fixed destination node, adaptive routing algorithms could outperform almost all deterministic routing algorithms [49]. This is explained by the fact that adaptive routing algorithms could provide changeable paths for fixed source/destination pairs to balance the load. Therefore, congestion is less likely to appear. The most common deterministic routing algorithm is dimension-order routing (DOR) [3]. Figure 2.7(a) shows the path from (0, 0) to (2, 2) using DOR. Packets traverse the x-dimension first, and then traverse the y-dimension. As packets could not turn from y- dimension to x-dimension, forming a cyclic dependency is impossible. Therefore DOR is deadlock-free. The algorithm could be explained by the following pseudo-code where (X cur, Y cur ) represents the current node, and (X dest, Y dest ) represents the destination node [50]:

30 (3,0) (3,1) (3,2) (3,3) (3,0) (3,1) (3,2) (3,3) (2,0) (2,1) (2,2) (2,3) (2,0) (2,1) (2,2) (2,3) (1,0) (1,1) (1,2) (1,3) (1,0) (1,1) (1,2) (1,3) (0,0) (0,1) (0,2) (0,3) (0,0) (0,1) (0,2) (0,3) Congested (a) (b) Figure 2.7: (a) The path from source node (0, 0) to destination node (2, 2) using dimensionorder routing algorithm (b) One possible path from source node (0, 0) to destination node (2, 2) using west-first adaptive routing algorithm. if (X cur X dest ) if (X cur > X dest ) go to x- else if (X cur < X dest ) go to x+ else if (Y cur Y dest ) if (Y cur > Y dest ) go to y- else if (Y cur < Y dest ) go to y+ else go to Processing Element

31 Adaptive routing algorithms allow packets to make any kind of turns, which could result in a cyclic dependency and create deadlocks [3]. West-first adaptive routing (WF) overcomes the deadlock problem by prohibiting turning from y+ to x, therefore no cyclic dependency could be formed [50]. Figure 2.7(b) shows the path from (0, 0) to (2, 2). A packet have two situations using this routing algorithm. When the destination node is on the left hand side of the source node, boundary included, packet traverse the x-dimension first, as if DOR is used. Only when the destination node is on the right hand side of the source node, boundary excluded, packet could turn freely along the way from the source. The algorithm could be explained by the following pseudo-code [50]: if (X cur > X dest ) go to x- else if (X cur = X dest ) if (Y cur > Y dest ) go to y- else if (Y cur < Y dest ) go to y+ else if (Y cur = Y dest ) go to Processing Element else if (X cur < X dest ) if (Y cur < Y dest ) choose between y+ and x+ else if (Y cur > Y dest ) choose between y- and x+ else if (Y cur = Y dest ) go to x+

3 DXBar: A Power-Efficient and High-Performance 32 Crossbar Design This chapter discusses the proposed DXbar design in details. The first section describes the architecture of DXbar. The following three sections illustrate the implementation of DXbar routers, provide several scenarios demonstrating how DXbar routers work, and show how fairness between two crossbars is maintained. The fifth section explains the fault tolerance feature of the modified DXbar design. The last section presents the dual-directional crossbar to reduce the area overhead of DXbar. 3.1 Introduction of DXbar The proposed DXBar is a high-performance and power-efficient dual-crossbar design targeting both performance enhancement and power reduction. It is well known that bufferless networks could provide power-efficiency and low-latency at low network load, and buffered networks could provide cost-efficient conflict-handling at high network load [3, 34, 35]. In order to take the advantage of both bufferless and buffered networks, the proposed design is a combination of a bufferless primary crossbar and a buffered secondary crossbar. At a low network load, almost all packets would only traverse the primary crossbar and take minimal routes, experiencing similar delays of what the packets would experience in a bufferless network. Whenever there is a conflict, the losing packet will be buffered in the secondary crossbar, and wait until its desired output port is free. As majority of the packets would not be buffered, VCs are eliminated to reduce the complexity of buffer organization and control overhead. The buffered secondary crossbar enables blocked packets to be removed from the critical path of the bufferless primary crossbar, thus maintaining the continuous flow of packets and achieving low-latency and low-power. Moreover, packets which are not

33 Look-ahead Signal Routing Computation Switch Allocator X+ X- Y+ Y- DEMUX DEMUX DEMUX DEMUX Primary Crossbar Secondary Crossbar MUX MUX MUX MUX MUX X+ X- Y+ Y- Processing Element (PE) Figure 3.1: The architecture of a DXbar router. buffered simply traverse a switch, therefore the power of buffer read/write is saved. As packets do not take extra hops when conflicts arise, there are no extra link traversals when the network load keeps increasing, ensuring that the design is power-efficient for any network load. 3.2 Architecture Figure 3.1 illustrates the router architecture of DXbar, which is different from the generic router in Figure 2.4. First of all, there are two crossbars within each router, with the primary crossbar having four input ports, the secondary crossbar having five input ports, and both of them having five output ports. The four input ports of the router are connected to both crossbars via input de-multiplexers (DEMUX), and the injection port of the PE is

34 connected to the last input port of the secondary crossbar. All of the five output ports of the two crossbars are connected via output multiplexers (MUX), which are then connected to the output ports of the router and the ejection port of the processing element (PE). Secondly, buffer slots are deployed in front of the first four input ports of the secondary crossbar. No buffer slots are deployed in front of the input ports of the primary crossbar, and the last input port of the secondary crossbar. The buffer slots are connected serially, thus eliminating VCs and the corresponding virtual-channel allocator. Finally, look-ahead signal from upstream router is connected directly to router, which is then connected to the switch allocator. Switch allocator is modified to control the de-multiplexers, the crossbars, and the multiplexers to maintain the correct packet flow in both crossbars. 3.3 Router Implementation The elimination of VCs eliminates the VA stage and simplifies SA stage, so that SA and ST could be performed in the same cycle. By using look-ahead signal, the routing information could traverse a separate narrow link to the downstream router in the lookahead link-traversal (LA LT) stage, and look-ahead routing computation (LA RC) stage could be performed in parallel with the LT stage. As a result, the reduced pipeline consists of two stages, as shown in Figure 3.2. All the stages shown in dark fill color are lookahead stages. This proposed pipeline is similar to the pipelines used in Flit-Bless [34] and SCARAB [35]. Credit-based flow control is adopted in this design. The arbitration of the switch allocation is performed at the beginning of the cycle. The flits which arrive at the input ports of the router are called incoming flits, and flits which are in the buffer slots and PE are called buffered flits. Before the arbitration starts, all of the contending flits are ranked based on their ages. An older flit is ranked higher than a younger flit, but the incoming flits have higher ranks than buffered flits. For example, there are two incoming flits A and B, and one buffered flit C. A is 24 cycles old, B is 20 cycles old, and

35 Router 1 RC SA ST LT Router 2 LA LT RC SA ST LT Router 3 LA LT RC SA ST LT Figure 3.2: The pipeline of a DXbar router. C is 22 cycles old. C is older than B, but is ranked lower than B, as an incoming flit has a higher rank than a buffered flit. Therefore, the final rank is A, B, and then C. Once the rank is calculated, the switch allocator tries to assign to each flit the desired output port, which has not been assigned to other higher-ranked flits. Using the same example, the desired output ports of A, B, and C are x+, x, and x, respectively. The switch allocator assigns x+ to A, and x to B. As x has been assigned to other flits, C loses in arbitration. As implementing an NoC with age-based arbitration might be difficult, the arbitration could be based on the number of hops the flits have traversed. In this way, only several bits in the look-ahead signal are required to represent the number of hops of a flit. If an incoming flit wins in the arbitration, that flit could traverse the primary crossbar and go to its designated output port without being buffered. Otherwise, it will be sent to the secondary crossbar and be buffered. On the other hand, a buffered flit waits in the buffer slot or PE until it wins in arbitration. The de-multiplexers and multiplexers are set to select the primary crossbar inputs/outputs by default. Switch allocator uses the arbitration result to set the two crossbars correctly, control all of the multiplexers to link each output port of the router to the correct crossbar, and control all the de-multiplexers to direct incoming flits between the primary and the secondary crossbar.

36 Look-ahead Signal Routing Computation Switch Allocator Look-ahead Signal Routing Computation Switch Allocator X+ X+ X+ X+ X- X- X- X- Y+ Y+ Y+ Y+ Y- Y- Y- Y- Processing Element (PE) Processing Element (PE) (a) (b) Look-ahead Signal Routing Computation Switch Allocator Look-ahead Signal Routing Computation Switch Allocator X+ X+ X+ X+ X- X- X- X- Y+ Y+ Y+ Y+ Y- Y- Y- Y- Processing Element (PE) Processing Element (PE) (c) (d) Figure 3.3: The illustration of dual xbar: (a) No conflicts appear, the network operates as bufferless network. (b) Conflict appears, one flit is sent to the secondary crossbar and buffered first. (c) The flit coming after the buffered flit sees no delay and could proceed without being buffered, thus ensuring that no instant back pressure happens. The injection port could send a flit whenever the desired output port is not occupied (d) Even though there is still a flit coming in through the same input port, the buffered flit could proceed when the desired output port is not occupied. The incoming flit could proceed simultaneously, encountering no delay. 3.4 Walkthrough Examples Figure 3.3 shows several possible scenarios to illustrate how flits are switched in DXbar design. Figure 3.3(a) shows the scenario when there is no conflict among flits. Flits from the x+ and the x want to go to the x output port and x+ output port, respectively. At the same time, flits from the y+ and the y want to go to the y output port and y+ output

37 port, respectively. All of the four incoming flits are switched simultaneously, and the network operates similarly to bufferless networks, because no flits are buffered and latency is minimized. This is the best scenario. Figure 3.3(b) shows the scenario when conflict appears. A flit from the y input port and another flit from the x+ input port compete for the x output port. Assume the flit from x+ is older, then that flit is switched from x+ input port to the x output port through the primary crossbar. The flit from the y input port loses in arbitration, and the y input de-multiplexer receives a control signal from switch allocator, routing the flit from the y input port to the secondary crossbar and buffering it. Figure 3.3(c) shows the scenario when another flit comes from the y input port in the next cycle. Because the flit coming from the y input port one cycle ago is currently buffered in front of the secondary crossbar, the path from the y input port to the primary crossbar is unoccupied. Therefore, the incoming flit in the current cycle could be switched to the desired output port. Otherwise, it needs to wait in buffer until the previous flit is switched. The figure also illustrates how the injection ports work. Since the injection ports have the same level of priority as buffers, a flit could be injected whenever the desired output port is not occupied. Figure 3.3(d) shows the scenario when no more flits from the input ports need the x output port. The buffered flit could then traverse the secondary crossbar and be routed downstream. At the same time, a flit from the y input port could traverse the primary crossbar to a different output port. This is not feasible in other designs, because there could be only one flit coming from one particular input port at any single time.

38 3.5 Fairness Maintenance Because the arbitration is age-based, flits coming from the nodes on the edges of the mesh network would have higher priority when they pass through the nodes in the center. As a result, the flits injected by center nodes are more likely to lose in arbitration and traverse the secondary crossbar. More importantly, this scenario could limit the injection of flits of center nodes, as injection ports and buffers have lower priority than those of input ports. A fairness issue thereby arises, since flits could be stalled in the injection ports or buffers for numerous cycles, when the desired output ports are taken by other flits from the input ports. In order to maintain fairness between the primary crossbar and the secondary crossbar, a fairness counter is maintained in each router to count the number of times flits from the primary crossbar win consecutively in arbitration. The counter works only when there are flits waiting in the buffers or in the injection port, and it is reset every time a waiting flit wins. Once the number exceeds a threshold, the priority flips so that flits from the secondary crossbar have higher priority, and they will be assigned output ports ahead of the flits from the input ports. Although the continuous flow of flits on the primary crossbar is interrupted, the nodes in the center are able to inject new flits even if the network is heavily saturated, and the buffered flits are guaranteed to leave in finite number of cycles. Setting the threshold too small could lead to excessive buffered flits and wasted power by more buffer operations, while setting the number too large does not help to solve the fairness issue. After testing with different traffic patterns, the threshold is set to four to bring the overall best performance. 3.6 Fault Tolerance The duplication of crossbar enables hardware-level fault tolerance. The router could avoid being completely disabled, even in case of a permanent crossbar failure. Figure 3.4

39 Look-ahead Signal Routing Computation Switch Allocator X+ X- Y+ Y- DEMUX DEMUX DEMUX DEMUX X X X X X Primary Crossbar Secondary Crossbar MUX MUX MUX MUX MUX X+ X- Y+ Y- Processing Element (PE) Figure 3.4: The architecture of a DXbar router with fault tolerance capability. depicts the revised router architecture of DXbar. When a crossbar failure is detected, switch allocator signals all of the input de-multiplexers to direct incoming flits to the buffer slots. The router then functions similarly to the routers of a buffered network. Because both the primary and the secondary crossbar may fail, there should be a mechanism to direct the buffered flits to the working crossbar. Therefore, a set of 2 2 crossbars, controlled by the signals from the switch allocator, are deployed between the buffer slots and the crossbars. A 1-bit signal could control these small crossbars, which indicates that the required modification of switch allocator is minimal. As a result, flits could traverse the router through the buffer slots and the working crossbar, even if the other crossbar does not function.

40 I0 I 0 I0 I 0 I1 I 1 I1 I 1 I2 I 2 I2 I 2 I3 I 3 I3 I 3 I4 I 4 I4 I 4 O0 O1 O2 O3 O4 (a) O0 O1 O2 O3 O4 (b) Figure 3.5: (a) The structure of a dual-directional crossbar. (b) An example showing how signal could enter from both directions. Buffers with a color of grey are turned off. Please note that fault detection is not implemented in this work. In order to determine the performance degradation of DXbar when running with crossbar faults, the fault detection mechanism is assumed to be available and works without affecting performance. 3.7 Dual-Directional Crossbar The original DXbar design requires two crossbars, and the area for the crossbars is doubled compared to a generic router. Although area overhead would become less with future technology scaling, modifications should still be made to minimize the negative impact on crossbar area. A better solution is to change the structure of crossbar so that the dual-crossbar could be replaced by a single modified crossbar. Segmented crossbar has been proposed in [29], which deploys tri-state buffers along input lines of the crossbar. Turning the tri-state buffers off could divide the input lines into several isolated segments. Using a similar technique, we design a dual-directional crossbar to enable data flow from both sides of the input lines, so that two flits from the same input port could go to different output ports in the same cycle. The proposed dual-directional crossbar is able to replace the dual-crossbar and provide the same functionality.