ServerNet Deadlock Avoidance and Fractahedral Topologies

ServerNet Deadlock Avoidance and Fractahedral Topologies Robert Horst Tandem Computers Incorporated 10555 Ridgeview Court Cupertino, CA 95014 Abstract This paper examines the problems of deadlock avoidance in multistage networks, and proposes a new class of scalable topologies for constructing large networks without introducing loops that could cause deadlocks. The new topologies, called fractahedrons, are deadlock-free and reduce the maximum link contention compared to other networks. The use of fractahedral topologies is illustrated by various configurations of 4-port ServerNet routers. The properties of fractahedral networks are compared with networks configured as a mesh, hypercube or fat tree. 1.0 Introduction Multistage networks are finding increasing use in both massively parallel computer systems and in networks of workstations and PCs. The networks that provide this connectivity must provide high bandwidth, low latency, scalability, low cost, and reliability. To provide high reliability, it is important for the network to be designed in a a way that guarantees it will not deadlock. While some general techniques are known to avoid or recover from deadlocks, many of these techniques cannot be directly applied without adversely impacting cost or performance. Most traditional topologies for MPP networks have been developed and analyzed without particular regard to solving the deadlock problem. Some topologies that appear to be symmetric and ideal candidates for MPP systems may in fact be quite asymmetric and suboptimal when routing aigorithms are designed to avoid deadlocks. The development of ServerNet has caused us to take a fresh look at MPP topologies to look for better ways of constructing networks to optimize performance while avoiding the possibility of deadlock. ServerNet is a system area network for providing high-speed communications from processor to processor, processor to U0 device, or U 0 device to other U0 devices [l]. The first implementation of ServerNet, (formerly called TNet), has byte-serial point-to-point 50 MB/sec links. Full duplex operation is provided by pairing two unidirectional links in a cable that can reach up to 30 meters. Complex networks can be constructed using 6-port router ASICs (application-specific integrated circuits) that contain input FIFO buffers and a non-blocking crossbar switch. Full network fault-tolerance can be provided by configuring pairs of router fabrics with dual-ported nodes. ServerNet is the key enabling technology for implementing systems with different requirements for performance and reliability [2]. ServerNet systems support software-based faulttolerance through the process-pair technology of the Tandem Nonstop Kernel, and support duplexed hardwarebased fault-tolerance for running standard operating systems such as Unix and Windows NT. In addition, Server- Net can provide reliable communications in clusters of non-fault tolerant workstations or PCs. 2.0 Background Proposed topologies for MPP routing networks include the mesh, ring, torus, star, binary tree, fat tree, hypercube, cube-connected cycles, and shuffle-exchange network. Characteristics of these networks can be found in many MPP references [3,4,5]. A key design problem for many networks is that they contain loops that could give rise to deadlocks. Figure 1 illustrates the way a deadlock can occur in a wormhole-routed network. With wormhole routing, the head of a packet is routed before the tail of the packet arrives at that router. Deadlocks can occur when a set of packets cannot make further progress because of a circular dependency in which each packet must wait for another to proceed before acquiring access to an output link. This deadlock situation can occur in any network with loops in the connection graph. Previous solutions to the deadlock problem were costly in terms of router complexity or communications efficiency. A popular solution to deadlock avoidance was 1063-7133/96 $5.00 0 1996 IEEE Proceedings of IPPS 96 274

A C 111 n A Cool I ink 110 01 1 Figure 1. Deadlock in a wormhole-routed network. The head of each packet is blocked by the tail of another packet. Circles are routers (packet switches). 100 00 1 described by Dally and Seitz in [6]. They propose adding virtual channels to routers, then breaking loops by allowing some messages to pass other packets. This solution requires multiple packet buffers at each router stage, and severely complicates the router design. The cost of the buffers can be quite significant because buffering space may dominate the area of a typical router. Other solutions to the deadlock problem are softwarebased and can impact performance. For instance, some networks detect deadlocks with timeout counters, discard the packets in progress, and re-send the lost packets. This technique cannot be used in system area networks because the lightweight protocol implemented over these networks cannot tolerate out of order delivery of packets. If the entire transfer is retried to avoid out of order delivery, the deadlock recovery time may be unacceptable. Solutions based on retry also make it difficult to distinguish between network congestion and hardware-related intermittent failures requiring maintenance actions. Another technique for avoiding deadlocks is to design the routing algorithm to preclude routing loops. For instance, dimension order muting may be used in a mesh network to avoid routing loops. With dimension-order routing, packets are routed first in one direction, say the X direction, then the Y direction. With this rule applied in Figure 1, routes A and C would be allowed, but routes B and D would be disallowed, thus preventing the deadlock situation. Figure 2 shows a 3-dimensional hypercube with certain paths disallowed in order to break cycles. By designating specific paths to be disabled, the routing algorithm is less restrictive than dimension-order routing. Disables are configured to break the loops on each face of the cube, as well as to break loops with six and eight links. The main problem with this technique is that most arrangements of path disables give uneven link utilization under uniform load. With the disables as 000 (J Figure 2. Breaking deadlocks in a hypercube by disabling paths. Arrows show paths that are not allowed. shown in Figure 2, the upper links are lightly utilized because they are used only to communicate with the top node, while the bottom links are more heavily used because they are used both to communicate with the bottom node as well as for pass-though traffic between other nodes. The link utilization can be made to be uniform if the path disables can be unidirectional. Such an arrangement could be described with twelve single-ended arrows instead of six double ended arrows. The disadvantage of this technique is that most traffic in the network is not reflexive; the path from A to B may be different than the path from B to A. Non-reflexive routing is allowed in ServerNet, but it increases the impact of a link failure. There may be nothing wrong with any of the hardware along the path from A to B, but that path may be unusable due to the inability to send acknowledgments back from B to A. The problem with many network topologies is that we must either choose non-reflexive routing or uneven link utilization. The deadlock problem can be avoided through other network topologies, but these networks may not provide the required bandwidth. Bandwidth in MPP systems is often measured in term of bisection bandwidth, the total traffic that can flow between halves of the system when cut at its weakest point. Tree networks are free of routing loops, but their bisection bandwidth is determined by the bandwidth through the router at the root node. The fat tree 275

can improve this situation by replicating routers at higher levels of the tree, but creates the new problem of finding a way to evenly distribute traffic over the parallel links in the fat part of the tree. This problem can cause increased link contention, a subject that is addressed in more detail in section 3.3. 2.1 Fully-connected networks of routers The basic building blocks for the new topologies are fully-connected assembly of routers. Figure 3 shows all fully-connected configurations of 6-port routers. (The first generation of ServerNet is implemented with 6-port routers because it offers the best price-performance point given the available pins and gates on the chosen ASIC technology.) M Ports Max link contentio 10 5: 1 12 4: 1 The configurations giving the most ports are the three and four router options given in Figure 3b and 3c. Of these two options, the four router option has less potential link contention; at most three nodes may simultaneously attempt to use any one of the inter-router links. The reduced contention means that this configuration will be less prone to queuing delays than the three router configuration. It is also attractive because routing within this assembly routes packets based on exactly two bits of the destination node identifier. This prevents sparse usage of the node address space and simplifies the routing algorithm. The topology of Figure 3c can be redrawn in three dimensions as a tetrahedron as shown in Figure 4. 12 3: 1 Figure 4. Tetrahedral topology with 6-port routers. 10 6 2: 1 1: 1 2.2 Fractahedral networks Multiple tetrahedrons can be connected together with higher level tetrahedrons to increase the number of connected nodes. This type of structure repeats at higher levels to form networks with similar topologies when viewed from any scale. This self-similar structure of tetrahedrons is called a fractuhedron, for fractal-tetrahedron. Figure 5 shows the self-similar structure of a three level fractahedron. Figure 3. Fully-connected topologies of 6-port routers. 276

Figure 5. Three level 2-3-1 thin fractahedron Each tetrahedron has routers with ports divided into three sets. One set connects to two lower-level tetrahedrons, another set connects to the other three routers in that tetrahedron, and the last set connects to the next higher level tetrahedron. If there is only one connection between each tetrahedron and the next higher level, this is called a thin fractahedron, and if all routers of one level connect to the next level, it is called a fat fractahedron. The basic topology is not restricted to 6-port routers. Any fully connected set of routers can form a similar snowflake structure. Depending on system implementation, there may be additional router levels between the end nodes (CPUs or peripheral adapters) and the lowest level tetrahedral router. One or two added router levels are t,ypically needed to fan out to the devices associated with one CPU. With one additional router level connecting each pair of CPUs to the level 1 tetrahedron, a 16-CPU syst em may be constructed with a maximum delay between CPUs of four router hops -- two within the tetrahedron, and one each to get to and from the tetrahedron. When extended to 1024 CPUs through a thin fractahedron, the maximum delays is twelve. Note that all thin fractahedrons have a bisection bandwidth fixed at four links. While this is adequate for many applications, there are other applications requiring more bandwidth. Hence it is desirable to scale the bandwidth to meet those demands. 2.3 Fat fraetahedrons In Figure 5, there are unused ports at three of the four comers of each tetrahedron. If the higher-level tetrahedrons are replicated, each copy of the higher level tetrahedron can connect to a different corner of the tetrahedron at the next levlel down. With all four upward ports of each tetrahedron connected to replicated routers, the structure is called a 2-3-1 277

fat fractahedron. In a 128 CPU network, at level 2 there are four independent layers of tetrahedron, each connecting to a different corner of the level 1 tetrahedrons. In three dimensions, level 2 is conceptually four tetrahedral layers nested inside each other, but not connected to each other. In two dimensions, it can be envisioned as papers stacked up with a router on each sheet. Each comer of the 4-layer tetrahedron has a pair of four-conductor cables connected to the four routers in the stack. Each of these cables connects to the four comers of a different level 1 tetrahedron. There is also a 16-conductor cable that connects to all four routers at each of the four corners. This cable then connects to a corner of the level 3, 16-layer tetrahedron, and so on. Routing in multilayer networks is done depth-first by examining address bits from high-order to low order. At any level, if there is no match in the address bits above those controlling that level s tetrahedron, then the packet is sent to the next higher level. In networks with all layers implemented, this ascent up the tree takes only one router delay per level. In effect, packets always go straight up the tree without taking any inter-tetrahedral links. Those links are used only on the way down to get to the correct destination. Each tetrahedron encountered matches three more bits of the address, and can take one or two router delays (one if the layer was already correct, two if a tetrahedron delay must be taken to get to the correct layer). In the case of ServerNet, these matches are actually done by looking up entries in the routing table inside each router. In a 1024 CPU system with 3 levels (and 1-4-16 layers), worst case delay is 10 router delays (4 on the way up, 6 on the way down), a reduction of two compared to the thin fractahedron. Table 1 gives a summary of the characteristics of 2-3-1 fractahedrons. The delay equations do not include any additional delays added between an end node and the first level tetrahedron. In this table and throughout this document, we reserve the upward connections from the top level for future expansion to avoid the need to remove existing connections as a system is expanded. In other words, more nodes could be supported in N levels if we knew there would never be a need for the N+l level. TABLE 1.2-3-1 N-level2-3-1 fractahedral parameters Parameter Maximum Nodes Bisection BW Thin Fat 2*SN 2*8N Maximum delays I 4N-2 hops I 3N-1 hops 4 Links I 4NLinks 2.4 Deadlock prevention In the fat fractahedron, the addition of multiple layers has also introduced potential routing loops. However the preceding routing algorithm eliminates these loops and avoids possible deadlocks. Conceptually, there are multiple upward and downward paths from one node to another, and use of all possible paths would result in deadlock. But the routing algorithm always takes a local inter-level link rather than going through a neighboring inter-level link. This algorithm eliminates possible loops in a way similar to dimension-order routing in a hypercube. The ServerNet routers also have path disable logic that can be set to enforce the elimination of the loops, even if the routing table is corrupted by a fault. 3.0 Comparison to other topologies Given a specific router whose design has been driven by technology constraints, it is useful to examine different ways to connect systems with those routers. In the case of ServerNet, this means finding the best way to build systems with 6-port routers. In this section, we contrast ways of forming a 64-node network with different configurations of 6-port routers. In commercial applications, it is not possible to know the data access patterns a priori, making static load balancing impossible. For instance, for a given database query, we may have an arbitrary set of four CPU nodes trying to communicate with an arbitrary set of four disk controller nodes over an extended period of time. The ability of a network to handle load imbalances is a key factor in application performance, and is discussed for each different topology. Initially, we just use the maximum link contention as a measure of the ability to handle load imbalance. Further studies will use simulation to better determine the effects of contention. 3.1 2-DMesh To implement a 2-D mesh with a 6-port router, four ports are devoted to the four directions, leaving the last two ports available to connect to the nodes. Connecting 64-nodes requires a 6x6 mesh. Maximum latency for this network is 11 router hops for transfers between opposite corners. The router delays scale quickly as the number of nodes grows. A 128 node network would need an 8x8 mesh with a maximum of 15 router hops, while a 1024 node network requires a 23x23 mesh and 45 hops. Another drawback of the mesh is the worst case link contention. If we assume dimension-order routing to break deadlocks, the worst case contention is along the 278

same path as the longest latency. If we label the columns A-F and the rows 1-6, the worst case contention comes from simultaneous transfers from A1 -F6, A2-E6, A3-D6, A4-C6, and A5-B6. All five of these transfers need to turn the same corner at A6. With two nodes at each router, a total of ten transfers may simultaneously try to share the A6 links, giving a 1O:l contention ratio. 3.2 Hypercubes A 64-node (6-D) hypercube requires a 7-port router; six for the hypercube and one for the node connection. With 6-port routers, it would be necessary to use a lower dimension hypercube with some other structure to increase the number of connected nodes. Even if a satisfactory structure could be found, it would be necessary to restrict the allowable paths to avoid deadlocks. This path restriction would give uneven link utilization and high contention, as described previously. Another drawback of the hypercube is that the bandwidth between nodes is fixed. There is no easy way to trade performance for cost to give a range of price-performance points. 3.3 Trees/Fat Trees Trees and fat trees come the closest to meeting the requirements for large commercial systems. Tirees are deadlock-free, can be expanded independent of the number of router ports, and can be scaled in performlance by moving from a simple tree to a fat tree. Figure 6 is a diagram of a 64-node fat tree.!,, I 7' 1... to other,layeys 0123 6666 0123 Figure 6. 64-node 4-2 fat tree implemented with 6-port routers. With a 6-port router, the six ports can be partitioned into groups of 3-3 or 4-2. The 3-3 partitioning has no bandwidth reduction toward the root, but is more expensive than the 4-2 partitioning. In most networks, we anticipate some degree of locality in the data access patterns. For instance, each processor in a cluster would typically have a high degree of local access to reach its system disk, and to reach one of a collection of equivalent resources (such as communications lines). For this reason, the 4-2 fat tree may be preferred for most systems even though there is some bandwidth reduction at each level. The bisection bandwidth scales as the networks grows, but not at the same rate. For 64-nodes, the bisection bandwidth is 4 links. In the 64-node fat tree, there are many equivalent paths through the second level routers, and there must be a policy for deciding which path to use. For instance, in routing a packet from node 0 to node 63, any one of the four links to the top level could be traversed. The first temptation might be to dynamically select a non-busy link. However, if sequential packets can take different paths to the same destination, earlier packets might encounter more contention upstream, causing them to be delivered out of order. The guarantee of inorder delivery of packets is key to eliminating software protocol overhead in ServerNet. A typical need for inorder delivery is in the delivery of an I/O interrupt packet that must follow the data transfer from a controller. The interrupt packet cannot be allowed to pass the data on the way to the CPU. To maintain in-order delivery, there must be a fixed path between each a pair of nodes. Figure 6 shows one arbitrary partj tioning of the outbound traffic from nodes 0-16 through routers A-D. Links to the highest level are labeled EIM, FJN, GKO, and HLP to show which link is used for each destination. This partitioning gives even link utilization in the case of uniform traffic, but can have very bad contention in some situations. For instance, assume that nodes 0-1 1 want to send data to nodes 52-63. All twelve transfers will contend for the single link HLP, for a 12:1 contention ratio. Other static partitionings of traffic through the high-level links can do no better than the 12: 1 contention ratio. 3.4 Fat tred fat fractahedron comparison Figure 7 shows 64-nodes connected through a fat fractahedron. The network has been drawn in the style of a fat tree to more clearly show the comparison. In this network, the worst case link contention is for the links within the second level tetrahedrons. For instance, if nodes 6,7, 14,and 15 are all trying to send to nodes 54, 55, 62, and 63, all four transfers will attempt to use the same diagonal link in the same layer of level 2. While this network has the same bisection bandwidth as the 4-2 fat tree, it spreads traffic more evenly through the inter-level links. The worst case contention is just 4:1, for a major improvement over the 12: 1 contention in the fat tree. The cost of the contention reduction is an increase in the number of routers from 28 to 48. 279

~ 1 A 3-3 fat tree could improve bisection bandwidth, but at great cost in routers and router hops. For 64 nodes, a 3-3 fat tree would require 100 routers, and transfers would take an average of 5.9 router hops. 4.0 Conclusions 01234567 Maximum link contention 0.0 Figure 7. 64-node 2-3-1 fat fractahedron drawn in the style of a fat tree. In the fractahedron, the router delay grows in smaller increments than in the fat tree (which always has an odd number of router hops). Table 2 contrasts the number of levels required to reach a number of other nodes for the two topologies. The average number of hops for the fractahedron is slightly less: 4.3 versus 4.4 for the fat tree. TABLE 2.64-node comparison Attribute 1 2 1 3 4 5 Average hops 4-2 Fat Tree 2-3-1 Fat Fractahedron 12:l 4: 1 1 Routers I 28 I 48 I This paper has introduced a new family of topologies for massively parallel systems. The fractahedral topologies have been designed to eliminate loops and to reduce link contention compared to existing MPP topologies. The topology scales to any number of nodes, and allows for tradeoffs between cost and performance. The current focus is on tetrahedral ensembles of 6-port ServerNet routers, but the concepts easily generalize to other fully connected groups of N-port routers. Future work will center on simulations of large topologies in order to better understand network performance under heavy loading. As large ServerNet-based systems are deployed, we will begin to characterize the workloads and will measure network performance in real customer environments. 5.0 References R. W. Horst, TNet: A Reliable System Area Network, IEEE Micro, Vol. 15, No. 1, pp. 37-45, February 1994. W. E. Baker, R. W. Horst, D. P. Sonnier, W. J. Watson, A Flexible ServerNet-based Fault-Tolerant Architecture, in Proc. 25th Int. Symp. Fault-Tolerant Computing, Pasadena, CA, June 27-30 1995. G. Almasi, A. Gottlieb, Highly Parallel Computing, BenjamidCummings Publishing Co., 1994. D. Reed, R. Fujimoto, Multicomputer Networks: Message-Based Parallel Processing, MIT Press, 1987. C. Leiserson, Fat-Trees: Universal Networks for Hardware-Efficient Supercomputing, IEEE Trans. Computers, Vol. C-34, No. 10, pp. 892-901, Oct. 1985. W. J. Dally, C. L. Seitz, Deadlock-Free Message Routing in Multiprocessor Interconnection Networks, IEEE Trans. Computers, Vol. C-36, No. 5, pp. 547-553, May 1987. ServerNet, Tandem, and Nonstop are trademarks of Tandem Computers Incorporated. 280