Design of High-Radix Clos Network-on-Chip

Size: px

Start display at page:

Download "Design of High-Radix Clos Network-on-Chip"

Cuthbert Hood
6 years ago
Views:

2010 Fourth ACM/IEEE International Symposium on Networks-on-Chip Design of High-Radix Clos Network-on-Chip Yu-Hsiang Kao, Najla Alfaraj, Ming Yang, and H.

1 2010 Fourth ACM/IEEE International Symposium on Networks-on-Chip Design of High-Radix Clos Network-on-Chip Yu-Hsiang Kao, Najla Alfaraj, Ming Yang, and H. Jonathan Chao Department of Electrical and Computer Engineering Polytechnic Institute of New York University Brooklyn, NY, USA Abstract Many high-radix Network-on-Chip (NOC) topologies have been proposed to improve network performance with an ever-growing number of processing elements (PEs) on a chip. We believe Clos Network-on-Chip (CNOC) is the most promising with its low average hop counts and good load-balancing characteristics. In this paper, we propose (1) a high-radix router architecture with Virtual Output Queue (VOQ) buffer structure and Packet Mode Dual Round-Robin Matching (PDRRM) scheduling algorithm to achieve high speed and high throughput in CNOC, (2) a heuristic floor-planning algorithm to minimize the power consumption caused by the long wires. Experimental results show that the throughput of a 64-node 3-stage CNOC under uniform traffic increases from 62% to 78% by replacing the baseline routers with PDRRM VOQ routers. We also compared CNOC with other NOC topologies, and found that using the new design techniques, CNOC has the highest throughput, lowest zero-load latency, and best power efficiency. Keywords - Network on Chip; Chip Multiprocessor; Clos network; High radix NOC I. INTRODUCTION Chip multiprocessors (CMP) have become favored over traditional superscalar processors for efficiently exploiting single-chip computational potential. One major factor motivating CMP development is that computation speed can be increased with only a modest increase in power. CMP systems consist of regular processing elements (PE) and memory modules. In a CMP system, the chip area is normally divided into a number of tiles, each containing a PE or a memory module. Tiles are interconnected through a Network-on-Chip (NOC), which has a great influence on CMP performance. The performance of an NOC is determined by many factors, including topologies, routing algorithms, and flow control mechanisms. Topology determines the capacity of an NOC. The capacity is defined as the best possible throughput, assuming perfect routing and flow control, that could be achieved by the network under the given traffic pattern [20]. After the topology is decided, the routing algorithm and flow control are developed to arrange the management of physical resources in an efficient way to approach the performance bound. Many NOC topologies proposed to date consist of regular interconnection structures and low-radix routers, such as 2-D Mesh and 2-D Torus for ease of implementation. However, as more tiles or network nodes are put on the same chip, latency is increased due to the rapidly growing queueing delay in each router along with a large network diameter. Also, due to the nature of the network configuration, the capacity of a low-radix network is usually low, causing low throughput and low power efficiency. A high-radix Clos network [1] provides much better scalability than a low-radix network in terms of zero-load latency and throughput. A 3- or 5-stage Clos network can easily accommodate several hundred nodes with a reasonable router radix. The number of hops a packet traverses in the Clos network is limited to three or five. Thus, a high-radix Clos Network-on-Chip (CNOC) provides smaller zero-load latency as compared to a low-radix NOC. Another advantage of CNOC is its good load-balancing nature, from the multiple paths available between any pair of PEs. One major concern for the Clos network is its large number of long interconnects, which may lead to an increased routing area and power dissipation. With Routing over logic layout style [2], the area overhead caused by the long wires can be eliminated. One method to overcome the energy problem caused by the long interconnects is to minimize the average power consumption under uniform traffic by carefully designing the floor plan of an NOC. In other words, given an arbitrary NOC topology, by placing the routers in a proper way the average power consumption under uniform traffic on the long wires can be significantly reduced. In this paper, we propose: (1) the design of the high-radix CNOC with Virtual Output Queue (VOQ) buffer structure and Packet Mode Dual Round-Robin Matching (PDRRM) scheduling algorithm in the routers to achieve high speed and high throughput, (2) a heuristic floor-planning algorithm to minimize the power consumption caused by the long wires. Based on the design, we describe a study comparing a 64-node CNOC composed of radix-8 routers to other topologies with the same router radix upper bound, including 2-D Mesh, Fat Tree [3], and Concentrated MeshX2 [4]. We show that with the PDRRM VOQ router, CNOC has the smallest zero-load latency, highest throughput, and best power efficiency compared to other high-radix topologies and the 2D Mesh network. II. RELATED WORK Low-radix NOCs, such as 2-D mesh or torus network, have the advantage of modest design complexity with a regular interconnection structure and short wires. However, these networks suffer several disadvantages, including large network diameter and energy inefficiencies due to higher hop counts. To overcome this issue, several high-radix NOC topologies have been proposed during the past few years. Balfour and Dally [4] proposed Concentrated Mesh (CMesh) with express channels, which adapts the 2-D Mesh by allowing several tiles connecting to the same router, and adding express channels on the edge of the Mesh. Kim et al. [5] proposed Flattened /10 $ IEEE DOI /NOCS

2 Butterfly, which has a regular floor plan as CMesh, with an even higher router radix to achieve lower zero-load latency and better throughput. Some studies have been done to evaluate Fat Tree [6], which could be configured as a high-radix network as well. Clos network topology has been considered for NOC design in the SPIN [7] and Reduced Unidirectional Fat Tree (RUFT) [8] networks. In the SPIN network, the local traffic between PEs connected to the same router does not need to travel more than one hop. While in the 3-stage CNOC, any packet needs to travel exactly three hops to reach its destination port. [8] showed that by replacing Fat Tree with RUFT, which does not allow local traffic, hardware complexity and power consumption are significantly reduced. However, [8] did not address the high-radix Clos network configuration and implementation issues. Both [7] and [8] did not provide latency and power comparisons between Clos network and other NOC topologies. III. CNOC ARCHITECTURE Figure 1. shows the configuration of 64-node, 3-stage CNOC that we adopted in this paper. The switch modules (SMs) in the first, second, and third stages are denoted as input modules (IMs), center modules (CMs), and output modules (OMs). In this configuration, there are 24 routers, 8 for each stage, with the network capacity at 100% under uniform traffic. One thing to notice is that all the links shown in the figure are uni-directional, rather than bi-directional. Hence, in CNOC every flit has to travel exactly three hops from source to destination. In the Fat Tree network, the localized traffic does not need to travel all the way to the root router, because the links between routers and PEs are bi-directional. Figure 2. shows the configuration of a 64-node Fat Tree, in which there are 32 radix-8 routers and 16 radix-4 routers. The two localized traffic streams are illustrated by red lines. Compared to the 3-stage Clos network configuration, Fat Tree network requires more routers and has a higher wiring complexity. As a result, Fat Tree has much lower power efficiency compared to CNOC, which will be shown in section VI. OMs CMs IMs c1 c2 c3 c4 c5 c6 c7 c8 b1 b2 b3 b4 b5 b6 b7 b8 a1 a2 a3 a4 a5 a6 a7 a8 Input ports of the PEs Output ports of the PEs Figure 1. Configuration of a 64-node 3-stage CNOC. All links are unidirectional, and packets can only travel upward. All routers are of radix-8. Figure 2. Configuration of a 64-node Fat Tree. All links are bi-directional. Localized traffics are shown as the red lines. In CNOC, each router is implemented as an input-queued crossbar switch. CNOC applies wormhole switching, in which packets are divided into a number of fixed-length flits. Also a credit-based flow control mechanism is applied in which on two sides of a link, the upstream entity has to keep track of the available buffer space in the downstream entity to prevent buffer overflow. The architecture of the SM can be the same as the baseline Virtual Channel (VC) router described in [14], while in this paper we propose a VOQ-based NOC router to increase throughput and reduce average latency. To route a packet in CNOC with the destination address decided, there are multiple choices of different paths, each corresponding to a CM. In this paper, we forward the packets in a round-robin manner for the ease of implementation and its good load-balancing feature. IV. PDRRM VOQ ROUTER DESIGN The input buffer structure and switching allocation scheme are two main factors that affect NOC router performance. To resolve the well-known Head of Line (HOL) blocking problem caused by single FIFO structure in the input-queued routers, [13] first proposed the concept of VC, which improves performance by decoupling the physical channels from the buffer resource. Today, the VC structure has become the dominant input buffer structure in NOC for its simplicity and efficiency in improving the throughput and avoiding deadlock. The choice of Switch Allocation (SA) schemes in a VC router is the other factor that greatly affect the router throughput. Different allocation schemes, such as inputfirst/output-first separable allocator, wavefront allocator, are proposed to increase the matching quality in a VC router. However, without any speedup in the switch or multiple iterations in SA, none of these methods can give satisfying throughput with the conventional single crossbar, input-queued VC router design. According to our simulation results, the throughput of an 8 8 router with 8 VCs can only achieve 65% throughput, with input-first separable allocator and roundrobin arbiters. There are sophisticated designs proposed in the past few years that aim at high throughput router design [15][14][16]. These methods require a lot more hardware resources compared to the canonical VC based, input queued, single crossbar router design with separable allocators. In a high-radix router the critical path usually resides in the allocator, and the input-first/output-first separable allocators provide shorter critical path compared to other allocation schemes. This is why [15] chose input-first separable as their allocation scheme. In this paper, we try to find a solution for the high-radix highthroughput router design that maintains the simplicity of the input-first separable allocation scheme for the configuration of a Clos network. A. VOQ Buffer Structure The concept of VOQ has been applied to lossless networks to alleviate HOL blocking problem. In the NOC environment with wormhole switching, the VOQ structure can be regarded as a special case of the VC structure. In a VC router, an input VC can store packets that target on any output port. In a VOQ 182

3 router, each VC has an associated output port, and can only store packets destined to that output port. So VOQs are required by each input port in an switch. A phenomenon, which we termed tail of line (TOL) blocking effect, hinders the performance of VOQ routers if we connect the VOQ routers to form a multi-stage network. This problem appears with packet lengths larger than one flit. NOC routers apply wormhole routing, which forces flits of the same packet concatenated as a worm in any logical queue of the network. In a VOQ router, from an output port s perspective, there is one VOQ in each input port that targets at this output port. If we have an switch, for each output port there are corresponding VOQs, each in an input port, that could potentially make a request to send a flit to this output port. In other words, there could be at most packets targeting the same output port at the same time. If there are two packets destined for the same output port, and both of them aim at the same VOQ in the downstream input port, the flits of the two packets cannot interleave on the output physical channel of the current router. Therefore, one of the packets has to wait until the other packet finishes transmitting before it can send out its first flit. We termed this phenomenon TOL blocking because the winning packet s tail flit blocks other packets with the same next hop VOQ. TOL blocking reduces the throughput of a multi-stage network for two reasons. First, the packets which are blocked by the tail flits of other packets will cause HOL blocking to the successive packets in their logical queues. Second, if a packet is broken for some reason, like back pressure caused by credit based flow control, and spreads in several routers, those packets which are blocked by the broken packet s tail flit cannot utilize the unused physical channels. Figure 3. shows an example of TOL blocking. There are two input/output ports in each router, and two VOQs in each input port. If there are two packets in the upstream router, both in VOQ2 but different input ports, targeting VOQ1 in the downstream router2, one of them will block the other until the tail flit of the winning packet enters the downstream router. In this example packet2 has to wait until the tail flit of packet1 enters the downstream router. If the routers employ roundrobin arbiters to form separable allocators, packet3 and packet1 will receive grants alternately. When packet3 receive grants, flits of packet2 cannot take advantage of the idle output physical link since it is blocked by the tail flit of packet1. VOQ1 VOQ2 VOQ1 VOQ2 Upstream Router B2 B3 T1 B2 B3 B1 H2 : input arbiter : output arbiter Downstream Router1 B3 B3 B1 B1 B1 Downstream Router2 Figure 3. TOL blocking example. B. PDRRM Algorithm To alleviate the TOL blocking involved in the multi-stage network where VOQ routers are used, we propose a scheduling algorithm, called PDRRM. Besides the property of alleviating TOL blocking, we hope PDRRM also has the properties of sustaining high speed operation and providing good matching quality with one iteration of arbitration in each clock cycle. To meet the above requirements, we design the algorithm based on Dual Round-Robin Matching (DRRM) [17], which is similar to the input-first separable allocation scheme in VC routers for its shorter critical path. Also we take the concept of SA+VA from [14] for reducing the pipeline stages of the router and illuminating the need for Virtual channel Allocation (VA). In the SA+VA scheme, VA and SA are done in the same clock cycle, but VCs are assigned non-speculatively after SA. VA simply involves finding a free VC for each output port from a pool of free VCs every clock cycle and assigning it to the flits that win SA. To alleviate the TOL blocking problem, we change the original DRRM algorithm into packet mode, meaning that the pointers of the input arbiters (s) and output arbiters (s) in a separable allocator are updated after a packet has been transferred to the next hop, unless there is no buffer space in the input port of downstream router or the flits in the winning VOQ drain out. In the description of the algorithm, the flit types are abbreviated as follows: Single flit (S), Head flit (H), Body flit (B), and Tail flit (T). A VOQ is called an eligible if it contains a packet that is destined for an available downstream VOQ with free buffer space. Step 1: Request. Each sends an output request corresponding to the first eligible nonempty VOQ in a fixed round-robin order, starting from the current position of the pointer in Step 2. The pointer of the is advanced by one location beyond its current position if (1) its request is not granted in Step 2, or (2) the request is granted but after one flit is served, this VOQ becomes empty, or (3) the granted flit is a T or S. Otherwise, the pointer remains unchanged. Step 2: Grant. If an receives one or more requests, it chooses the one that appears next in a fixed round-robin schedule starting from the current position of the pointer. The output notifies each requesting input whether or not its request was granted. The pointer of the is incremented to one location beyond the granted input if an S or T is granted. The pointer of the remains at the granted input if an H or B is granted. The pointer of the remains unchanged if there is no request. VOQ PDRRM routers achieve high performance in multistage network for three reasons. First, VOQ eradicates the HOL blocking problem in an input-queued switch. On the other hand, VC can only reduce the frequency of the occurrence of HOL blocking. Second, by applying the packet mode switching concept the TOL blocking problem is effectively alleviated, since the scheduler tends not to interleave flits of different packets on the output physical links. Third, packet mode scheduling allows the existing matching between input/output pairs to last for a longer time compared to flit mode scheduling, and the unmatched input ports can gradually augment the 183

4 matching pattern by sending requests to the available output ports. This is somewhat similar to the concept of multiple iterations employed in DRRM and islip [18]. C. PDRRM VOQ Router Pipeline Figure 4. shows our proposed VOQ router pipeline, which consists of 3 stages: (1) Switch+VOQ Allocation (SA+VA), Buffer Write (BW), (2) Buffer Read (BR), (3) Switch Traversal (ST). In CNOC Routing computation (RC) is done in the PEs and the RC information for each packet is attached in the head flits. Hence there is no RC or next hop routing computation (NRC) in the pipeline stages. Figure 4. The proposed PDRRM VOQ router pipeline. D. Post-layout Simulation Result Methodology: We designed a VOQ router with the following configuration: 8 input ports, 8 output ports, and the shared memory size varies among 64, 128, and 256 flits. The VHDL code of the router design is synthesized and analyzed by the Cadence Encounter RTL Compiler on the STMicroelectronics Company 65nm technology. Then we use SOC Encounter to do automatic place and route. Power evaluation is performed based on an activity factor of 10% for each sub-block. For memory read/write operations, we use the popular memory model CACTI 5.3 from HP Labs to characterize the memory delay, and area/power consumption. TABLE I. Sub-Block Input Buffer (64-flit) Input Buffer (128-flit) Input Buffer (256-flit) 8X8 VOQ ROUTER PERFORMANCE Critical Path Delay (ps) Area (um 2 ) Power 1GHz Allocator Crossbar TABLE I. summarizes the simulation results (delay, area and power) for the proposed 8x8 VOQ router. It can be seen that the input buffer and the allocator have comparable critical path delays and the power consumption is mainly determined by these two sub-blocks. From the simulation results we know that the PDRRM router can achieve 1 GHz. V. CNOC FLOOR PLAN One common doubt about CNOC is the considerable number of long global interconnects on chip. With the shrinkage of process dimension, the delay and power consumption issues of global interconnects become more critical. To deal with the delay and power issues of long interconnects on CMPs, there are mainly two orthogonal directions. One direction is to reduce the delay and power consumption on long wires. Examples are inserting repeaters and buffers on long wires, using low swing technique, encoding the transmitted data to reduce hazardous signal transitions, or introducing new technologies such as optical and radio frequency signaling. The other direction is to carefully design the floor plan of the NOC so that the latency on the longest wire does not exceed a certain value, and the average power consumption under uniform traffic is minimized. In this paper, we use conventional global wires with repeaters, and assume that on each link the latency for one flit is always one clock cycle. This is to illustrate the capability of CNOC in the current technology. What we propose in this section is a heuristic floor planning algorithm for CNOC that can minimize the average power consumption under uniform traffic, and at the same time limit the maximum wire length so that the latency on long wires does not exceed the target value. A. Routing over resources or routing on dedicated channels In [2] two kinds of layout styles of on-chip global interconnects are identified. The first one is called routing over resources and the second one is called routing on dedicated channels as illustrated in Figure 5.. In routing over resources, several metal layers, usually two, are reserved for interconnects between the routers and tiles. In this way, the area overhead of the NOC is completely contributed by the area of the routers. In routing on dedicated channels, no specific metal layers are reserved, but spaces on the tile edges are reserved for NOC wires. The space between two tiles is usually equal to the dimension of a router, for the reason that an intersection of tile edges usually holds one router. In this paper, we adopt the layout methodology of routing over logic, for its elasticity in global wiring. Also, we assume that only one router can be put in an intersection of tile edges. Figure 5. Two different NOC layout styles. B. Wiring density In evaluating an NOC floor plan, despite its power characteristics, we also need to consider its feasibility in terms of wiring density. According to [9], wiring density is the maximum number of tile-to-tile wires routable across a tile edge. In the scenario of routing over logic, this means that there are some parts of the tiles that are not routable for global interconnects, like the regions occupied by core logics and first level caches of the PEs. In [9], it is assumed that 40-60% of the length of a tile edge is routable for global interconnects. In this paper, we assume the dimensions of the tiles are 1.5mm by 1.5mm. Hence, the total wire width on a tile edge cannot exceed mm. In this paper, we assume the worst case, so only 0.6mm of a tile edge is routable for CNOC. Also, we assume the wire width to be 400nm for the global interconnect. Hence, each tile edge can accommodate 1500 wires. In other words, each tile edge can accommodate 1500/W links for the NOC, where W is the bus width. 184

5 C. Problem Definition We are given a set of identical tiles, a set of routers, 1,2,3, and a set of links, 1,2,3 that connect and to form a 3-stage Clos network. The positions of the tiles are fixed, while the positions of the routers are variable. A router can be located in any intersection of the tile edges. Once an intersection has been taken, it is not available to other routers. Before describing the objective function, we first define as the Manhattan distance between the two nodes that connects, and is the energy that a flit consumes when it travels through a link with Manhattan distance. The Manhattan distance between a router and a tile is defined as the Manhattan distance between the two closest points of the tile and router. The floor planning problem is to determine the location of each router, so that the objective function Φ, is minimized, the maximum wire length is within the boundary to achieve the operating frequency without pipelining, and at the same time the wiring density constraint is met. The physical meaning of Φ, is the sum of energy consumption for all links to transfer one flit, given a placement of the routers. In CNOC, under uniform traffic load, every link has the same probability to transfer a flit if round-robin or random-routing algorithms are applied. Hence Φ, can be regarded as an indicator of total power consumption of CNOC under uniform traffic. Figure 6. shows an example for placing two routers on a chip with 3 2 tiles. The Manhattan distance between router A and B is 2. The Manhattan distance between router A and tile k is 3, calculating from A to the upper left corner of tile k. Figure Tiles with two routers. The Manhattan distance between router A and tile C is defined as 3. D. Complexity of the problem To solve the objective function, the size of the solution space is. To the best of our knowledge only Ye and Micheli [10] have tried to tackle this problem in the tiled CMP NOC context. What they did was to solve a similar objective function to Φ,, and obtain a real valued solution. Then they map the real valued position into integer coordinates, which is the final floor plan, by sorting the real valued solution. However, in [10] the wiring density was not considered, which may cause the only solution to be infeasible. Also the objective function only considers the total wire length, instead of the total power consumption on the wires. For a 3-stage CNOC with 24 routers on a 64-tile CMP, there are possible floor plans for comparison if the exhaustive search algorithm is applied. is about 6 10, which requires a lot of computational power for comparison. What we propose in this paper is a heuristic method that can reduce the solution space to /, which is 8008 in a 64-tile / CMP, so that the solution can be easily obtained from few iterations with exhaustive search. E. Algorithm As stated in [10], a good floor-planning algorithm should meet the hierarchy constraint. The hierarchy constraint requires clusters of the tiles and routers in the network topology to stay together in the layout. If we apply hierarchy constraints to CNOC, an intuitive idea for the floor planning would be to gather together the PEs or memory modules that are connected with the same IM/OM. In other words, the tiles of CMP should be grouped spatially, and each group of tiles should be connected by the same IM and OM. By following the hierarchy philosophy, we propose a three-step heuristic floor-planning algorithm. The three steps are tile mapping, router placement, and exhaustive search. 1) Tile mapping: We first divide the chip into four quadrants. Each quadrant contains N/4 groups of tiles, and each group contains N tiles, forming a rectangle, which is associated with a specific IM and OM. Figure 7. illustrates how we divide the 64-tile chip. In the first quadrant, there are 16 tiles. The tiles marked with 1 are associated with IM1 and OM1, and tiles marked with 2 are associated with IM2 and OM2. For the fourth quadrant, we place group 3 and 4 tiles in a similar fashion, and we rotate the first quadrant clockwise 90 degrees for symmetry. The remaining quadrants follow the same logic. 2) Router placement: Given the tile mapping from step 1, IMs and OMs that are associated with tiles in the first quadrant, and other N/4 arbitrary CMs must be located in the first quadrant. The same logic applies to other quadrants. In other words, each quadrant contains N/4 arbitrary CMs, since every CM has to connect to all the IMs and OMs, and IMs and OMs are evenly distributed among the four quadrants. Therefore, in each quadrant there are 3 /4 routers /4 IMs, /4 OMs, and /4 CMs. We further require that the placement of routers on each quadrant be symmetric. For example, in Figure 7. routers o1, o3, o5, and o7 are in the symmetric position on different quadrants. 3) Exhaustive search: The remaining part of the algorithm is straightforward. For each quadrant, there are /4 available positions for the /4 routers. Here, we exclude the peripheral of the chip. We examine each each possible floor plan for its, and feasibility in terms of wiring density and maximum wire length. Finally, we obtain the preferred floor plan by sorting the, values. F. Floor plan for the 64-node CNOC Figure 7. shows the result of our floor-planning algorithm based on the wire characteristics listed in TABLE III., under the constraint that the longest wire length is 7 hops, 40% length of the tile edge is routable, any link can contain up to128 bits, and all the wires can be operated under 1 GHz. In this floor plan, all the CMs are located around the central point of the chip, surrounded by other IMs and OMs. In section VI, we evaluate the power performance of CNOC based on this floor 185

6 plan. One thing to notice is that there could be multiple solutions for this problem. For example in Figure 7. router o1, o3, o5, o7 may be moved to the central points of the their own quadrants without affecting the power performance and still under the constraint of wiring density o o8 i8 i o7 i7 c7 c8 c1 o i6 c6 c2 i o6 c5 c4 c3 i3 o i5 i4 o o Figure 7. Floor plan of the 64-node 3-stage CNOC. VI. EXPERIMENTAL RESULTS In this section, we compare the performance of the VOQ PDRRM router with the VC router, and compare the CNOC with other high-radix NOC topologies. A. VOQ PDRRM Single Router Performance Methodology: We used our in-house cycle-accurate simulator to evaluate delay performance for the single router with different schemes. In the baseline VC router, there are eight VCs in an input port. The memory in each input port is shared by all of its eight VCs to increase the buffer utilization and implemented as linked lists. The VC router follows the design in [14], by performing VA and SA in the same clock cycle. VCs are assigned non-speculatively after SA, which is an input-first allocator, so that VA simply involves finding a free VC for each output port from a pool of free VCs every clock cycle and assigning it to the flits that win SA. For the VOQ router, we apply two different scheduling schemes DRRM and PDRRM. In the VOQ routers each input port has eight VOQs and the input port memory is shared by the eight VOQs. Figure 8. shows the delay performance results of the VOQ and VC routers with packet lengths equal to eight and input port memory size equal to 256 flits under uniform traffic. For the VC routers, we implemented two kinds of scheduling schemes Parallel Iterative Matching (PIM) and DRRM. Both PIM and DRRM are input-first, which means that an input port first chooses an input VC, either randomly (PIM) or according to round-robin order (DRRM), and then an output port chooses an input port to be granted, either randomly (PIM) or according to round-robin order (DRRM). An 8 8 single VC router with 1-iteration PIM can only achieve about 65% throughput. The DRRM has a similar throughput as the PIM scheme. In the VOQ schemes, we evaluated three kinds of scheduling algorithms: DRRM, PDRRM I, and PDRRM II. PDRRM I follows the algorithm described in section IV, and PDRRM II has some differences. In PDRRM II, the input pointers is stubborn, meaning that when a local winner in an input port loses arbitration in the s, the input pointer does not update until it finally gets granted. In this way the property of fairness in DRRM is well preserved. As illustrated in Figure 8., both DRRM and PDRRM II can achieve 100% throughput, while PDRRM I can achieve 81%. DRRM can achieve 100% throughput under uniform traffic because of the desynchronization effect [19]. PDRRM II preserves the stubbornness of DRRM in the s to maintain the fairness between different VOQs in the same input port, so it can also achieve 100% throughput under uniform traffic. However, from the simulation results we learned that PDRRM II has a much higher average latency in the high-load compared to PDRRM I, which is not tolerable in on-chip environment. Hence throughout this paper we adopt PDRRM I as the scheduling algorithm in CNOC. Though the throughput of PDRRM I cannot achieve 100%, we show that it can still maintain 78% throughput under multi-stage configuration, unlike DRRM dropping to 71%, which is prone to suffer from the TOL blocking problem. Figure 8. Latency vs. injection rate for different scheduling schemes in VC and VOQ routers, with packet length equal to eight. Figure 9. Latency vs. injection rate in CNOC with different router designs. B. VOQ CNOC VS. VC CNOC Figure 9. shows the delay performance for a 64-node 3- stage CNOC with different router designs, each with an input port memory size of 256 flits. With the VC router and DRRM scheduling scheme, the throughput is 62%. As we expected, the throughput of VOQ with the DRRM scheduling scheme drops 186

7 to 71% because of the TOL blocking issue. The PDRRM VOQ scheme can achieve 78% throughput, whichh gives the best performance. C. Latency Performance Comparison between CNOC and Other NOC Topologies In this sub-section, we compare the performance of CNOC with other NOC topologies, including the conventional 2-D Mesh and two other high-radix networks: GFT(2,4,4) (Generalized Fat Tree) [3], and CMeshX2 [4].. The reason why we chose GFT(2,4,4) and CMeshX2 is thatt both topologies consist of routers with at most 8 input/output ports. Generalized Fat Tree: The GFT configuration is described in [3]. GFT(p,q,r) means that there are p stages of connections between different stages of the routers. Each router has q upward connections and r downward connections. Figure 2. shows the configuration of a GFT(2, 4, 4), where there are 32 radix-8 routers and 16 radix-4 routers. All the links in the Fat Tree are bi-directional, which means that a packet only needs to travel upward to the lowest common ancestor router of its source and destination PE before travelling downward. The routing algorithm that we use in Fat Tree is oblivious routing, in which a packet travels to one of the random lowest common ancestor routers, and then travels down to the destination along the only route. Concentrated Mesh: In CMesh, every router is connected to four PEs, and routers form a Mesh network as the conventional 2D Mesh network. CMeshX2, where there are two independent CMesh networks (32 radix-88 routers in a 64- node system), has better power-efficiency and twice the bisection bandwidth compared to CMesh. So we selected CMeshX2 with XY routing as one of the high-radix topologies to evaluate. We actually tried CMeshX2 with express channels, but the throughput does not change too much under uniform traffic with express-channel-prioritized XY routing. Therefore in this paper we only evaluate CMeshX flits, larger than four schemes (64 flits) because the CNOC has the least router number. Even with larger input port memories the total area consumed by the routers in CNOC is still comparable to those of other topologies. TABLE II. summarizes the simulation results, where CNOC has the smallest zero-load latency, and highest throughput. TABLE II. Router Design Total router number Input port memory size SUMMARY OF THE LATENCY PERFORMANCE COMPARISON CNOC PDRRM VOQ 2D Mesh DRRM VC 24 radix-8 64 radix Throughput (%) Zero-load Latency (clock cycle) Fat Tree PDRRM DRRM VOQ VC 32 radix-8 16 radix-4 CMeshX2 DRRM VC 32 radix D. Power Efficiency Comparison between CNOC and Other Topologies Methodology: We follow the methodology in [11] to characterize the dynamic power consumption of different NOCs. We assume that for each flit to traverse through a hop, it consumes, where and are energy consumed by buffer write/read. and are the energy consumed when a flit traverses a crossbar or link. is the energy consumed by the allocators. For energy consumption on NOC links, we use Cadence Spectre to run simulations on models with parameters given by the Predictive Technology Model (PTM) of various lengths. For energy consumption within the NOC routers with different radix numbers, we follow the same methodology described in section IV.D. The energy consumed by one flit operation on different components under 1 GHz is shown in TABLE III. and TABLE IV.. In this paper, we assume that a flit contains 32 bits, and the global links contain 4 more bits for the control signals. TABLE III. Wire Length (mm) EXPERIMENTAL RESULTS FOR DIFFERENT LENGTHS OF LINKS Delay (ns) Flit Energy (pj) TABLE IV. EXPERIMENTAL RESULTS OF ENERGY FOR ROUTERS OF DIFFERENT RADICES Energy for 1 flit operation (pj) Radix-8 Router Radix-5 Router Radix-4 Router Figure 10. Latency vs. injection rate for different topologies with uniform traffic, packet length = 8. Figure 10. illustrates the delay performance results of VOQ CNOC with PDRRM, and other three topologies under uniform traffic. The packet size equals eight. For the Fat Tree, we applied two different router designs, VC DRRM and VOQ PDRRM. For CNOC, we set the input port memory size to be Power Efficiency: To gain more insight into the throughput and power performance, we identify a parameter named power efficiency, which is defined as 1/E, where E = (time for each PE to finish sending 1000 packets at the injection rate with an average latency of 100 clock cycles) x (total energy dissipated during the process). This method assumes that the PEs adopt the injection throttling technique [12] to maintain a certain injection rate to keep the average 187

8 end-to-end latency below a certain level, which is 100 clock cycles in our simulation scheme. The power efficiency is higher if a network spends less time to finish the traffic load, and at the same time spends less energy during the process. In other words, the more packets a network can transmit in spending a joule, the more power efficient it is. Figure 11. shows the normalized power efficiency for the five evaluated schemes. Clearly CNOC has the best performance, followed by the other two high-radix network topologies. The 2D Mesh has the worst power efficiency because of its large average hop counts and low throughput. The reason why CNOC outperforms Fat Tree with the same high-throughput router is because CNOC has fewer routers. Thus, the average cost in CNOC to transfer a packet is less than Fat Tree, since fewer routers are involved in the operation. Figure 11. Normalized network power efficiency for four topologies. VII. CONCLUSION Low-radix NOC architectures do not scale well with the increasing number of PEs in the CMP. The diameter of the network rises rapidly with the growing number of tiles, making the end-to-end communication latency intolerable. Many highimprove network radix NOC topologies have been proposed to performance. Among these we believe the Clos Network-onlow average hop Chip (CNOC) is the most promising with its counts and good load-balancing characteristic. In this paper, we propose (1) PDRRM VOQ router design to achieve high speed and high throughput in CNOC, (2) a heuristic floor-planning algorithm to minimize the power consumption caused by the long wires. Experimental results show that the throughput of a 64-node 3-stage CNOC under uniform traffic increases from 62% to 78% by replacing the baseline Virtual Channel (VC) routers with PDRRM VOQ routers. We also compared CNOC with other high-radix NOC topologies under the same router radix upper bound, and found that under the new design techniques CNOC has the highest throughput, lowest zero-load latency, and best power efficiency. ACKNOWLEDGMENT This work was supported by a Polytechnic Institute of NYU Angel Fund grant and by U.S. Army CERDEC. The authors would also like to thank Dr. N. Sertac Artan and Dr. Yang Xu for their precious comments, Chao Zeng and Zhenyu Sun for their help in simulations. REFERENCES [1] C. Clos, A study of non-blocking switching networks, Bell System Technical Journal, pp , 424, Mar [2] D. Pamunuwa, J. Oberg, L. R. Zheng, M. Millberg, A. Jantsch, and H. Tenhunen, Layout, performance rmance and power trade-offs in mesh-based network-on-chip architectures, in Proc. IFIP Int. Conf. Very Large Scale Integr., Dec. 2003, p [3] S. R. Ohring, M. Ibel, S. K. Das, and M. J. Kumar, On generalized fat trees, Proceedings of the 9th Intl.. Symp. on Parallel Processing, Washington DC, [4] J. Balfour and W. J. Dally,, Design tradeoffs for tiled CMP on-chip networks, in ICS 06: Proc. of the 20th annual international conference on Supercomputing, 2006, pp [5] J. Kim, J. Balfour, W. J. Dally, Flattened Butterfly Topology for On- Chip Networks, IEEE Computer Architecture Letters, vol. 6, no. 2, pp , July-Dec [6] D. Ludovici, F. Gilabert, S. Medardoni, and C. Gomez, Assessing Fat- Tree Topologies for Regular Network-on-Chip Design under Nanoscale Technology Constraints, Proc. of Conf. on Design, Automation and Test in Europe, [7] A. Adriahantenaina, H. Charlery, A. Greiner, L. Mortiez, and C. A. Zeferino, SPIN: a scalable, packet switched, on-chip micro-network, in Design Automation tion and Test in Europe Conf. and Exhibition, [8] C. Gomez, F. Gilabert, M. E. Gomez, P. Lopez and J. Duato, RUFT: Simplifying the Fat-tree Topology,, In Proceedings of the 14th IEEE International Conference on Parallel and Distributed Systems, [9] D. N. Jayasimha, B. Zafar, and Y. Hoskote, On-Chip Interconnection Networks: Why They are Different and How to Compare them, Technical Report, Intel Corp, [10] T. T. Ye and G. De Micheli, Physical planning for multiprocessor networks and switch fabrics. In Proc. ASAP, [11] H. Wang, X. Zhu, L. S. Peh, and S. Malik, Orion: a power-performance simulator for interconnection networks, In Proc. Of the 35th Intl. Symp. on Microarchitecture, [12] E. Baydal, P. Lopez, and J. Duato, A Family of Mechanisms for Congestion Control in Wormhole Networks, IEEE Trans. on Parallel and Distributed Systems, 16(9): , [13] William J. Dally, Wire-Efficient VLSI Multiprocessor Communication Networks, Proceedings of the Stanford Conference on Advanced Research in VLSI, Paul Losleben, ed., MIT Press, March 1987, pp [14] Amit Kumar, Partha Kundu, Arvind Singh, Li-Shiuan Peh and Niraj K. Jha, A 4.6Tbits 3.6GHz Single-cycle NoC Router with a Novel Switch Allocator in 65nm CMOS. In 25th International Conference on Computer Design, October [15] John Kim, William J. Dally, Brian Towels, Amit K. Gupta, Microarchitecture of High Radix router. Proceedings. 32th Annual International Symposium on Computer Architecture(ISCA), June 2005, pp [16] Peh, A High-Throughput Distributed Shared-Buffer NoC Router. Soteriou, ou, V.; Ramanujam, R.S.; Lin, B.; Li-Shiuan Peh Computer Architecture Letters. Volume 8, Issue 1, Date: Jan. 2009, Pages: [17] H. J. Chao and J.S. Park, "Centralized contention resolution schemes for a large-capacity optical ATM switch," in Proc. IEEE ATM Workshop, Fairfax, Virginia, May 1998 [18] N. McKeown, "The islip scheduling algorithm for input-queued switches," IEEE/ACM Transactions on Networking, vol. 7, no. 2, pp , Apr [19] H. Jonathan Chao and Bin Liu, High Performance Switches and Routers. [20] W.Dally and B. Towles, Principles and Practices of Interconnection Networks, Morgan Kaufmann Publishers Inc., San Francisco, CA,

Reservation Cut-through Switching Allocation for High-Radix Clos Network on Chip

Reservation Cut-through Switching Allocation for High-Radix Clos Network on Chip Yang Xu, Tang Jiang, Ming Yang and H. Jonathan Chao Department of Electrical and Computer Engineering Polytechnic Institute