Crossbar Analysis for Optimal Deadlock Recovery Router Architecture

rossbar Analysis for Optimal Deadlock Recovery Router Architecture Yungho hoi Timothy Mark Pinkston SMART Interconnects Group EE-Systems Dept, University of Southern alifornia, Los Angeles, A 90089-2562 fyunghoc@charity, tpink@charityguscedu; http://wwwuscedu/dept/ceng/pinkston/smarthtml Abstract We explore the design of optimal deadlock recovery-based fully adaptive routers by evaluating promising internal router crossbar designs Unified and decoupled crossbar designs aimed at exploiting the full capabilities of adaptive routing are evaluated by analyzing their effect on overall network performance We show that an enhanced hierarchical crossbar design that supports routing locality in virtual network class achieves highest performance with relatively low cost 1 Introduction The importance of the interconnection network in achieving high-performance parallel computing is conspicuous ommunication efficiency is maximized if network routers are designed to fully exploit the underlying capabilities of the network and routing algorithm Routing adaptivity and router speed critically affect the overall network performance Unfortunately, the two are competing factors as increased adaptivity generally results in increased router delay Deadlock-recovery routing schemes [1, 2] maximize adaptivity by completely eliminating routing restrictions enforced by deadlock avoidance schemes to prevent deadlock If not implemented carefully, this increase in adaptivity can compromise the cost and delay of the router, resulting in overall network performance degradation rather than improvement The internal router crossbar design significantly affects the cost and delay of other router components such as header selection, routing decision and arbitration logic, etc The router crossbar design determines its size and routing freedom, which can influence 50% or more of the total router delay [3] Therefore, it is necessary to obtain an optimal crossbar design to implement high-performance deadlock recovery routers This study explores the design of optimal deadlock recovery-based fully adaptive routers that minimize the cost of adaptivity while maximizing network performance A variety of promising internal router crossbar designs are evaluated in terms of cost and speed Further, the performance of This research was supported by an NSF Research Initiation Award, grant ES-9411587, and an NSF areer Award, grant ES-9624251 each design is simulated at the network level The next section presents relevant background and related work Section 3 describes the crossbar designs and their unique features Section 4 presents extensive evaluation and performance analysis of the router crossbar designs using modeling and network level simulation Finally, the conclusions drawn from this work are given in Section 5 2 Background Many crossbar designs have been developed for deadlock avoidance algorithms which reduce router cost and delay by exploiting routing restrictions enforced by the underlying algorithm In the Mesh Routing hip [4], for example, static dimension ordered routing is enforced where packets are not allowed to route in the Y dimension before reaching the dimension of the destination This makes it possible to partition the crossbar into smaller, faster units based on dimension In the partially adaptive Planar Adaptive Router [5], only two dimensions at a time are available for routing to avoid deadlock (routing in all other dimensions is prohibited) By exploiting these routing restrictions, the crossbar is partitioned into planar subcrossbar units (as opposed to dimensional in the Mesh Routing hip) to improve speed In the Hierarchical Adaptive Router [6], the crossbar is partitioned into ordered virtual network subcrossbar classes Deadlock is avoided by enforcing routing restrictions in the lowest virtual network, viz, Duato s algorithm [7] The less optimal alternative to these designs applicable to all avoidance algorithms is a unified crossbar design Its performance is less influenced by routing restrictions but more coupled to network characteristics such as node degree and virtual channel support Router crossbar designs should fully exploit the restrictions as well as the capabilities of the underlying routing algorithm and network to achieve highest possible performance Because of the different restrictions inherent to deadlock avoidance algorithms, not all crossbar designs are optimal or even applicable Deadlock recovery routers enforce few if any restrictions, allowing many more crossbar designs to be used including modified variants of designs proposed for deadlock avoidance routers These designs should aim at exploiting the full capabilities of unrestricted routing while, at the same

(a) U-B 1st bar 2nd bar th bar (c) H-B 2+ 2- from nth bar 1st bar from nth bar (b) -B 2nd bar nth bar 1st bar 2nd bar th bar (d) E-B Figure 1 Internal router crossbar designs 2+ 2- to 1st bar to 1st bar time, reduce crossbar complexity This is the motivationof our work We explore the design of optimal deadlock recoverybased routers through careful analysis of promising crossbar designs Although our analysis is applicable to deadlock recovery schemes in general, we focus on Disha-based [1] recovery routing 3 Router rossbar Designs Four alternative crossbar designs for deadlock recovery routers are presented and evaluated by examining their unique features They are classified into two categories: unified crossbar designs and decoupled crossbar designs We consider one unified-crossbar structure (U-B) and three decoupled crossbar structures shown in Figure 1: the cascade-crossbar (-B), the hierarchical-crossbar (H-B) and the enhancedhierarchical-crossbar (E-B) We describe these crossbar designs in greater detail, but first a few assumptions are made A connect channel is an internal physical channel which connects two subcrossbars within the same router; increasing the number of connect channels decreases internal blocking but increases subcrossbar size We assume a k-ary cube network with virtual channels per physical link and connect channels per router subcrossbar, where applicable Messages are assumed to be received from processing nodes througha randomly selected injection virtual channel (ie, in Figure 1) All mutual and external deadlocks are assumed to be recovered from by a deadlock recovery mechanism, eg, in Disha [1], the centralized deadlock buffer () is used to progressively recover from deadlock 31 Unified Design The most straightforward of the four crossbar designs is unified-crossbar, shown in Figure 1(a) Its structure consists of a single crossbar capable of connecting all router inputs to any of the router outputs across all virtual channels This results in a crossbar size of P = (21),whereP is the number of crossbar input ports The cost and speed are functions of both n and, resulting in increased delay as n and increase However, due to U-B s strictly noblocking internal structure, any input port can be connected to an available output port in a single cycle regardless of other connections This capability can be exploited by fully adaptive routing, making the U-B worthwhile to evaluate despite its potentially long delay 32 Decoupled Designs The decoupled crossbar designs consist of smaller subcrossbars connected by connect channels This structure reduces the size of the crossbar as well as the complexity of the routing arbitrator, making the router potentially faster Further, this design structure is intended to exploit routing locality in dimension or in virtual channel network If most packets tend not to change dimensions or virtual channel network frequently due to locality in routing, then it is not necessary for the crossbar to provide packets with direct access to all output channels even in the case of fully adaptive routing Instead, changes in direction or virtual network can be supported by simply requiring that indirect access to all output channels through subcrossbar connect channels be provided This makes decoupled crossbar designs which differ in the type of routing locality exploited interesting to evaluate Each design is briefly discussed below The cascade-crossbar design is derived from avoidancebased dimension ordered routing [4] but modified to exploit recovery-based fully adaptive routing As shown in Figure 1(b), each subcrossbar in -B is associated with only one dimension which consists of virtual channels for each direction This results in a subcrossbar input size of P = (2 + +1)ports Whenever packets change dimensions, the connect channel must be used to access the next subcrossbar in sequence Unlike the subcrossbars in avoidance-based routers, the subcrossbars and virtual channels in each subcrossbar may be used by packets in no prescribed order for recovery-based fully adaptive routing Additionally, wrap around connections are also allowed in -B from the lowest dimension to the highest dimension to support adaptive routing This router design exploits routing locality in direction and is, therefore, promising if packets being adaptively routed tend to continue in a given dimension the majority of time before turning (possibly using different virtual channels within that dimension) In fact, many adaptive routers currently implement a preferred channel selection policy of straight over turn to exploit such locality in routing decisions This design exploits this fact

The hierarchical-crossbar design is derived from the Hierarchical Adaptive Router [6] Each subcrossbar in H-B is associated with one virtual channel network (N) which includes all dimensions and directions as shown in Figure 1(c) This results in a subcrossbar input size of P = max[(2n + ); (2n + )] ports Its subcrossbars each support connections to all directions, which allows packets to change dimensions within each subcrossbar but requires the use of connect channels to change Ns This crossbar structure exploits routing locality in virtual channel network, which is expected to better support adaptive routing than locality in dimension by allowing packets to avoid faulty and congested areas without using connect channels Unlike in the deadlock-avoidance routers, no routing restrictions are enforced on the lowest subcrossbar N for recovery-based routing However, inherent to this hierarchical design structure is the restriction that no connect channels exist between the lowest and highest Ns Moreover, packets are injected into the highest N only and trickle down to the next lowest N on blockage, making the lowest virtual network a potential bottleneck While this ordering between Ns is required for avoidance-based routing, it is a restriction that can be relaxed in recovery routing The enhanced-hierarchical-crossbar design is proposed as an enhancement to the hierarchical-crossbar design In addition to the connectivity supported by H-B, the E-B design has wrap-around connect channels between the lowest and highest virtual channels networks since N ordering is not necessary in recovery-based routing (see Figure 1(d)) This solves the bottleneck problem of H-B In addition, injection ports from the processor are distributed over all subcrossbars in E-B to balance N load [8] Again, this feature is disallowed in deadlock-avoidance routers but is allowed with recovery routing Therefore, E-B exploits routing locality in N while supporting greater, more distributed access to Ns The resulting subcrossbar input size is P = (2n + +1)ports While the decoupled crossbar structures provide many potential performance benefits, each inherently has internal complications that can degrade performance if not handled correctly as discussed below 321 Internal Blocking The function of connect channels is to provide access to decoupled subcrossbars for packets that are currently blocked at the present subcrossbar or have finished routingover the resources supported by the present subcrossbar The number of connect channels between two neighboring subcrossbars is critical to the performance of the router as the lack of a sufficient number limits the available routing alternatives of packets, causing internal blocking (eg, packet P 3 in Figure 2(a)) Internal blocking may be more frequent in some decoupled designs than in others There is a cost and delay associated with increasing the number of connect channels to prevent internal blocking, which sets an upper bound Hence, it is important to examine this trade-off to balance performance and cost P3 P2 B Module Internal Blocking P2 onnect hannel () Legend : Inactive packet Request for channel Active packet Subcrossbar (a) Internal Blocking (b) Internal Self-Deadlock (c) Internal Mutual-Deadlock Figure 2 Internal Blocking and Deadlock 322 Internal Deadlock Internal deadlock is defined as the infinitely persisting packet blockage resulting from connect channel cyclic dependency Two types can occur: internal self-deadlock and and internal mutual-deadlock illustratedin Figures 2(b) and (c), respectively, for one connect channel per subcrossbar, minimal routing, and the -B design In Figure 2(b), packet P 1 tries to obtain available output channels in subcrossbar y or z as it has finished routing in the x dimension It subsequently moves through subcrossbars y and z and back to x through the wraparound connect channel because the output channels in y and z were not immediately available As a result, packet P 1 will remain blocked forever because it has used up all connect channels needed to route in the remaining dimensions to its destination Obviously, internal self-deadlock can significantly degrade performance as it may cause other packets to internally block In Figure 2(c), packets P 1, P 2,andP 3 are trying to reach subcrossbars z, x, and y, respectively, because they have finished routing in all other dimensions However, they mutually block one another in a cyclic fashion These internal deadlock situations must be guarded against in the decoupled crossbar structures Theorem 1: Necessary conditions for internal deadlock to occur are (1) a decoupled crossbar structure, (2) wraparound connect channels, and (3) the absolute necessity for a packet to traverse the next subcrossbar in sequence, ie, the destination address in the present dimension is reached but not in at least one other dimension Proof: yclic dependency among resources is a necessary condition for deadlock [9] To build a cyclic dependency among connect channels, conditions (1) and (2) are required To satisfy the hold and wait-for condition of deadlock, condition (3) is required If there is no need for a packet to move to other subcrossbars, the blocked packet will subsequently be assigned to an output channel in the current subcrossbar From Theorem 1, internal deadlock can occur only in the cascade-crossbar the only decoupled design which satisfies all conditions Internal deadlock can be solved by the same recovery method used for external deadlock (ie, the resource in Disha) However, internal deadlocks cause local router congestion which may lead to overall network perfor- x y z P2 P3 x y z

mance losses To improve router and network performance, the followingrestrictions may be applied to avoid internal selfdeadlock, although they are not necessary: no packet is allowed to use more than(n,1) connect channels and no packet may use the connect channels leading to subcrossbar dimensions in which it has finished routing 323 Multi-cycle Delay In the decoupled crossbar structures, packets may experience different setup delays and data-through delays, depending on the number of subcrossbar traversals If the clock cycle is bound by the setup (or data-thru) delay of one subcrossbar, packets requiring subcrossbar traversal would take multiple cycles to pass throughthe router, which increases their average router delay Therefore, while the router speed (clock cycle time) may be faster than the unified design, the average router delay can actually be higher, depending on the dynamic behavior of routes taken through the network If routing locality behavior exists and is appropriately exploited, the decoupled designs should outperform the unified designs This presents an interesting trade-off that must be evaluated by simulation 4 Performance of rossbar Designs An optimal router architecture for deadlock recovery-based routing should incorporate the best of alternative internal router crossbar designs We evaluate these designs at the router level by estimating router delay and cost using hien s model [3] and at the network level by measuring network performance via simulation 41 Router Level Performance Evaluation We compare our router designs in terms of cost and speed using hien s model [3] and assuming n = = 3Forthe decoupled crossbar designs, we vary the number of connect channels from one to three Table 1 gives the overall router cost and delay for the alternative crossbar designs The decoupled crossbar routers are faster and less costly than the unified crossbar router by up to 20% and 33%, respectively These advantages increase as the number of dimensions and virtual channels grow but diminish as the number of connect channels grow beyond a certain point Nevertheless, faster routers do not always result in higher network performance Performance evaluation of each router design at the network level is required to determine how well each design exploits routing locality and the underlying capabilities of recovery-based routing flexibility 42 Network Level Performance Evaluation We compare the performance of the crossbar designs through extensive simulation using FlexSim, a more flexible version of FLITSIM 20 All simulations are run on an 8 Table 1 ost and delay of router designs Gate ount Tsetup Tdata-thru U-B 32722 1282ns 524ns -B (=1) 22662 1030ns 441ns -B (=2) 26406 1060ns 452ns -B (=3) 30426 1087ns 461ns H-B (=1) 21828 1015ns 433ns H-B (=2) 24646 1030ns 441ns H-B (=3) 27648 1060ns 452ns E-B (=1) 22662 1030ns 441ns E-B (=2) 26406 1060ns 452ns E-B (=3) 30426 1087ns 461ns 8 8 three dimensional torus (n = 3) with 3 virtual channels per physical channel and full-duplex links Messages are 32 flits long A buffer depth of two is assumed All router designs use one injection and reception channel per node A true fully adaptive minimal deadlock recovery routing scheme (Disha) is assumed with a default time-out of 25 cycles before deadlock is suspected Maximum normalized throughput (in flits/cycle/sec) and average latency (in nsec) is measured which take into account multi-cycle and router delay penalties of the different designs lock cycle time is assumed to be the minimum data-thru delay of a single pass through the router (sub)crossbar Uniform Traffic Results: Router designs using -B with 1 to 3 connect channels are compared against the U- B router design in Figure 3(a) As shown, the performance of routers with -B improves drastically as the number of connect channels increases This result indicates that connect channels in -B are critical resources and that locality in dimension is not high enough to exploit the flexibility provided by adaptive routing To have comparable throughput as U-B, -B requires at least 3 connect channels Router designs using H-B with 1 to 3 connect channels are compared to the U-B router design in Figure 3(b) Unlike that of the -B router, the performance of the H-B router is not affected significantly by the number of connect channels This indicates that the frequency of packets using connect channels is much smaller in H-B than in -B, which means that locality in virtual channel network is better able to exploit adaptive routing, unlike locality in dimension However, the H-B router does not have performance comparable to U-B (H-B with 3 connect channels has 12% less maximum throughput) This shows that the lowest virtual channel network is indeed a bottleneck When packets internally block in the H-B router, they move down to the next lower virtual network and have to finish routing in this congested network even though higher networks free up after packets reach the lower virtual networks Therefore, this crossbar design is less suitable for deadlock recovery-based adaptive routers As shown in Figure 3(c), the performance of the E-B

(a) ascade B (b) Hierarchical B (c) Enhanced B 1 1 1 : U B o : B(=1) + : B(=2) * : B(=3) : U B o : H B(=1) + : H B(=2) * : H B(=3) : U B o : E B(=1) + : E B(=2) * : E B(=3) 45 50 45 50 45 50 Figure 3 Latency and throughput comparisons under Uniform traffic router exceeds that of U-B and all other crossbar designs The E-B router with only 2 connect channels has up to 25% lower latency and slightly higher maximum throughput than the U-B router Moreover, there is not a wide performance gap in going from one connect channel to three, which indicates that they are not critical resources One reason why the E-B router with only one connect channel shows such good performance is that message injection at the node is distributed uniformly over all virtual channel networks, which minimize the need for packets to change virtual channel networks and allows them to experience less subcrossbar traversals Another reason is that locality in virtual channel network is high and the design can still exploitthe full capabilities of adaptive routing Moreover, the possible performance losses of the decoupled crossbar structure are negligible compared to the overall advantages Nouniform Traffic Results: We also characterize the performance of these router designs using bit-reversal and perfect shuffle nouniform traffic patterns As shown in Figure 4(a), each additional connect channel increases the maximum throughput of the -B router by 7 units However, unlike the case with uniform traffic, even three connect channels are not enough to obtain maximum throughput equal to the U-B router Not only are connect channels critically limiting resources but also locality in dimension under bit-reversal traffic pattern is worse than that under uniform traffic In Figure 4(b), the H-B router shows similar results as the -B router except for the fact that the number of connect channels do not significantly impact performance Instead, the lowest virtual channel network limits the performance of the H-B router In contrast, the E-B router shows comparable performance to the U-B router, which means exploitation of locality in virtual channel network is profitable under uniform as well as nouniform traffic Further, uniformly distributing message injectionacross subcrossbars helps to mitigatethe potential problems associated with its decoupled crossbar structure (ie, internal blocking and multiple cycle delay) so as to minimize their effects Simulation results under perfect shuffle traffic (Figure 5(a) and (b)) further confirm that connect channels and the low- With sequential RA ost (Gate ount) Table 2 Summary of Results Maximum throughput (RAN) Average message latency (RAN) Maximum throughput (BR) Average message latency (BR) Maximum throughput (PS) U-B 1 1 1 1 1 1 1 -B (=1) 069 045 105 039 117 062 114 -B (=2) 081 088 091 059 113 080 093 -B (=3) 093 1 085 077 105 097 086 H-B (=1) 067 076 105 082 107 081 094 H-B (=2) 075 090 089 081 089 080 082 H-B (=3) 085 088 081 079 090 079 081 E-B (=1) 069 093 095 099 1 095 080 E-B (=2) 081 103 076 106 101 101 075 E-B (=3) 093 102 077 104 103 099 076 Average message latency (PS) est virtual channel network are limiting resources in -B and H-B designs, respectively In contrast, E-B routers outperform all other routers including the U-B router (Figure 5(c)) Moreover, the average latency for the E-B router is measured to be up to 25% lower than the U-B router Table 2 summarizes our results for the four alternative router designs presented The cost, average message latency, and maximum throughput of all router designs are normalized to the U-B router The H-B ( = 1) design is the least costly, however its performance is comparatively low The E-B ( = 2) design gives the best performance (highest throughputand lowest latency under uniform and nouniform traffic), and its cost is 20% cheaper than the U-B router design We, therefore, conclude that the enhanced-hierarchicalcrossbar design with a moderate number of connect channels is the most optimal design for fully adaptive deadlock recovery routers 5 onclusions and Future Work This paper explores the design of optimal deadlock recovery-based routers through careful analysis of unified and decoupled internal router crossbar designs rossbar designs are evaluated by examining their unique features, cost, speed, and overall effect on network performance We find that the higher cost and delay of the unified-

(a) ascade B (b) Hierarchical B (c) Enhanced B 1 1 1 : U B : U B : U B o : B(=1) o : H B(=1) o : E B(=1) + : B(=2) + : H B(=2) + : E B(=2) * : B(=3) * : H B(=3) * : E B(=3) Figure 4 Latency and throughput comparisons under Bit Reversal traffic (a) ascade B (b) Hierarchical B (c) Enhanced B 1 1 1 : U B : U B : U B o : B(=1) o : H B(=1) o : E B(=1) + : B(=2) + : H B(=2) + : E B(=2) * : B(=3) * : H B(=3) * : E B(=3) Figure 5 Latency and throughput comparisons under Perfect Shuffle traffic crossbar design outweighs the benefits of nonblocking connectability for high dimensional, large virtual channel networks Subcrossbar connect channels in the cascaded-crossbar design are limiting resources because routing locality in dimension poorly exploits fully adaptive routing, particularly for nouniform traffic Increasing the number of connect channels has an overall effect of improving performance up to the point where the subcrossbar size becomes prohibitively large onnect channels in the hierarchical-crossbar design are not critical resources but, instead, the lowest virtual channel network can be a performance bottleneck Although less costly than the other crossbar designs to implement by up to 20%, the hierarchical-crossbar design yields 20% less network throughput The enhanced-hierarchical-crossbar design outperforms the others in both cost and performance; compared to the unified design, it is 20% cheaper, 25% faster, and achieves slightlyhigher throughput Of the decoupled crossbar designs, it requires the fewest connect channels as it is able to exploit fully adaptive routing flexibility with routing locality in virtual channel network Our results suggest that the increased adaptivity offered by deadlock recovery-based routing algorithms can be profitably exploited and implemented in routers with reasonable cost and speed We will continue to explore the design of other internal router architecture components optimized for efficient deadlock recovery-based routers in future work References [1] Anjan K and Timothy Mark Pinkston DISHA: A Deadlock Recovery Scheme for Fully Adaptive Routing In eedingsof The 22ndInternationalSymposiumon omputer Architecture, pages 20210,IEEE omputer Society, June 1995 [2] J Kim, Z Liu, and A hien ompressionlessrouting: A Framework for Adaptive and Fault-tolerant Routing In eedings of the 21st International Symposium on omputer Architecture,IEEE omputer Society, pages 289-300, April 1994 [3] AndrewA hien A ost and Speed Model for k-ary cube Wormhole Routers In eedings of the Symposium on Hot Interconnects IEEE omputer Society, August 1993 [4] harles M Flaig LSI mesh routing systems Master s thesis, alifornia Institute of Technology, Departmentof omputer Science, May 1987 [5] Andrew A hien and J H Kim Planar-Adaptive Routing: Low-ost Adaptive Networks for Multiprocessors In eedings of the 19th Symposium on omputer Architecture, pages 268-277 IEEE omputer Society, May 1992 [6] Ziqiang Liu and Andrew A hien, Hierarchical Adaptive Routing, In Symposium on Parallel and Distributed essing, October 1994 [7] J Duato A New Theory of Deadlock-free Adaptive Routing in Wormhole Networks IEEE Transactions on Parallel and Distributed Systems, 4(12):1320, 1331 1993 [8] Steve Scott and Greg Thorson Optimized Routing in the ray T3D, in eedings of the Workshop on Parallel omputer Routing and ommunication, pp 28294, May 1994 [9] J Duato A Necessary and Sufficient ondition for Dead lock-free Adaptive Routing in Wormhole Networks IEEE Transactions on Parallel and Distributed Systems, 6(10):1055-1067, October 1995