SOFTWARE BASED FAULT-TOLERANT OBLIVIOUS ROUTING IN PIPELINED NETWORKS*

SOFTWARE BASED FAULT-TOLERANT OBLIVIOUS ROUTING IN PIPELINED NETWORKS* Young-Joo Suh, Binh Vien Dao, Jose Duato, and Sudhakar Yalamanchili Computer Systems Research Laboratory Facultad de Informatica School of Electrical and Computer Engineering Universidad Politecnica de Valencia Georgia Institute of Technology P.O.B. 22012 Atlanta, Georgia 30332-0250 46071 Valencia, Spain e-mail: {suh, dao, sudha}@ee.gatech.edu e-mail: jduato@aii.upv.es phone: (404) 894-2940 fax: (404) 853-9959 Abstract -- This paper presents a software based approach to fault-tolerant routing in oblivious, wormhole routed networks. When a message encounters a faulty output link it is removed from the network by the local router and delivered to the messaging layer of the local node s operating system. The message passing software can re-route this message along a non-minimal oblivious path or via an intermediate node, which will forward the message to the destination. A message may encounter multiple faults and pass through multiple intermediate nodes. This paper discusses deadlock, livelock, and performance issues. Router designs are minimally impacted remaining compact, oblivious, and fast. Therefore this approach is a good candidate for incorporation into the next generation of wormhole routed multiprocessor networks. 1.0 Introduction Interconnection networks in modern machines make use of oblivious, fixed path wormhole routing to achieve high throughput and low message latency. However, fault tolerant communication requires the network to be able to dynamically route messages along alternative, possibly non-minimal paths. This paper proposes a software based approach for re-routing messages blocked by faults. The techniques reported here are motivated by several considerations. First it is targeted towards environments where the fault rates are relatively low, i.e., on the order of a maximum of 3 failed components between repair cycles. During this time we wish the machine to continue functioning, with possibly degraded communication performance. Solutions for higher fault rates have been addressed elsewhere [9]. Second, we wish to retain the features of existing oblivious router designs, i.e. compactness and speed. This implies that additional hardware complexity in the form of additional virtual channels should be avoided. Finally, we wish to make the common case fast. Therefore messages that do not encounter faults should be minimally impacted. The basic idea is quite simple. When a message encounters faulty link, it is removed from the network by the local router and delivered to the messaging layer of the local node s operating system. The message passing software either i) modifies the header so the message may follow an alternative dimension order path, or ii) computes an intermediate node address. In either case, the message is re-injected into the network. In the case that the message is transmitted to the intermediate node, * This research was supported in part by a grant from the National Science Foundation under grant CCR- 9214244 it will be forwarded upon receipt to the final destination. A message may encounter multiple faults and pass through multiple intermediate nodes. The problem is distinct from adaptive packet routing in networks using packet switching or virtual cut through [14]. Routing is oblivious and is based on wormhole. Re-routing must consider dependencies across multiple routers caused by small buffers (< message size) and pipelined dataflow. The proposed techniques do accommodate a range of fault patterns: more than previously proposed wormhole routing techniques. Only messages that encounter faults are affected, and degradation is largely proportional to the number of faults. If the mean time between failures is large, this approach is viable. Thus, we feel it is a good candidate for inclusion in the next generation of wormhole routed networks, and it targets commercial multiprocessors. This is particularly true when the application environment does not justify the use of expensive, custom, fault-tolerant backplanes. 2.0 Fault Model This paper considers k-ary n-cube networks. Adjacent faulty links and faulty nodes are coalesced into fault regions. Fault regions may overlap forming a larger fault region. We also assume that the fault regions do not disconnect the network. Fault regions may be convex or L- shaped or U-shaped concave regions. Some example fault regions are illustrated in Figure 1. Messages are routed obliviously, and therefore cannot progress when the required output link at a node leads to a fault region. The message is removed from the network and delivered to the local node s message handling software. This message is said to be absorbed at the local node and is subsequently referred to as a faulted message. This status is recorded in the message header. 3.0 Fault Tolerant Oblivious Routing This section describes the software-based fault-tolerant oblivious routing algorithm, e-sft. In the absence of faults, the e-sft routing algorithm is equivalent to the traditional dimension order routing algorithm, e-cube [8]. When the outgoing link at a node leads to a fault region, the message is absorbed and delivered to the messaging layer of the local node. The header may now be modified to reflect a different, non-minimal path around the faulty region. For example in Figure 1, a message from node P to node Q is blocked by a fault region. The message can be transmitted along the negative X-dimension across the wraparound channel to node Q. Alternatively, an intermediate node may be selected, and the message is sent to this node. The intermediate node will receive the message and

Y P T X R D Convex Fault Region Q forward it to the final destination. For example in Figure 1, a message from node P to node R may be first sent to intermediate node T. Node T can then transmit the message in dimension order to node R. If other faults are encountered by the re-routed message, the process is repeated. 3.1 Routing Header In order to keep track of the manner in which a message is re-routed, the header contains a 2 bit flag called the Direction Flag (DF). The DF is used to record the following information regarding the path that a message has taken. 00 - Traversing the shortest distance along the X-dimension 01 - Traversing the longer distance along the X-dimension 10 - Traversing the shortest distance along the Y-dimension 11 - Traversing the longer distance along the Y-dimension For example, the message from P to Q in Figure 1 will have a DF value of <01>. Since fault-free routing is dimension order, a DF value of <10> would imply the message had attempted to traverse the X-dimension in both directions and is now trying to traverse the Y- dimension. The exception to this interpretation is when the source and destination nodes are in the same column. In this case the message will attempt the Y-dimension first and the interpretation of DF is reversed, i.e., <00> and <01> (<10> and <11>) refer to the Y-dimension (Xdimension). The DF value in the header is only modified when the message is absorbed at a node. It enables e-sft routing to keep track of the directions along each dimension that have been attempted and therefore aid in rerouting decisions. There are three additional fields in the header. A faulted status bit (F) indicates that the message has encountered at least one fault and is being re-routed. This bit enables a node to distinguish between messages destined for the local node and messages which must be forwarded. A prevent flag (PF) status bit is used to prevent the occurrence of certain livelock situations. The role of the PF will be clear from the example described later. A two bit re-route table field (RT) specifies one of three tables to be used for re-routing decisions. Finally, since messages may be routed through intermediate nodes, the header must contain two sets of address fields. The first records the final destination address (X-Final, Y-Final). This is an absolute address. The second is used for routing the messages and is an offset within each dimension (X- 16 G E A 8 7 5,11 4,10 2,14 13 1 Concave Fault Region S F 6 9 C 3 12 Figure 1. Examples of Fault Regions B H 15 X-Dest Y-Dest RT:Re-route Table F:Faulted Status Bit PF:Prevent Flag DF:Direction Flag F X-Final Y-Final Dest, Y-Dest). The message header now appears as shown in Figure 2. Note that the routers only process the offset fields and set the F bit. All of the remaining header processing is done in software, and only when messages encounter faults. Thus, router operations are minimally impacted. 3.2 The Routing Algorithm The network hardware routes messages using traditional dimension order routing based on the X-Dest and Y-Dest fields in the message header. If the outgoing channel at a router is faulty, the router sets the F bit, and routes the message to the local processor interface. This causes the message to be marked as a faulted message and ejected to the local messaging layer. If the F bit is set, the messaging software checks the X-Final and Y-Final fields to determine if the message is to be delivered locally. If not, a rerouting function is invoked. The X-Dest and Y-Dest fields are updated, and if necessary the DF and PF flags (as described below) are modified. The re-routing function depends only on the relative address of the first node where a fault is encountered and the final destination. Let the coordinates of the node where the message first encounters a faulty channel be ( x f, y f ) and the coordinates of the destination be ( x d, y d ). Let the offsets at ( x f, y f ) in each dimension be x (given by x d x f ) and y (given by ). There are three possible cases i) y = 0, ii) y d x = 0, and iii) x 0 & y 0. The re-routing decisions for each case are captured in Tables 1-3. The RT field identifies which of the three cases apply to the message and is set at ( x f, y f ). The notation s signifies that the message header is modified to be transmitted to the node with coordinates ( x d, y d ), along the shortest path l in the X-dimension. The notation signifies transmission along the longer path in the X-dimension. Note that re-routed messages still follow dimension order, though not shortest path within a dimension (due to faults). When re-routed messages are absorbed at nodes due to faults, the RT field identifies the table to be used in making the re-routing decision for the message. If the RT field is 0, then this is the first node at which the message encountered a fault and RT must be set to signify one of the three cases above. The tables can be interpreted as follows. If a header to be re-routed has a DF value shown in the first column, it is re-routed to the node shown in the second column. The notation specifies the direction in the dimension the message is to be transmitted. The remarks column specifies the action that takes place if the message subsequently encounters a fault before reaching the node specified in column 2 or arrived at the intermediate node without RT DF X-Dest:X-coordinate offset Y-Dest:Y-coordinate offset X-Final:X-coordinate of final destination Y-Final:Y-coordinate of final destination Figure 2. Format of the message header y f PF

meeting a fault. The Tables attempt to capture the following ideas. When a message encounters a fault, it is first re-routed in the same dimension in the opposite direction. If another fault is encountered, the message is routed in an orthogonal dimension in an attempt to route around the faulty regions. The DF keeps track of the directions attempted. The PF will be shown to prevent certain livelock situations within concave fault regions. 01 l DF=<10> after meeting a fault if PF = 0 DF=<11> after meeting a fault if PF = 1 10 s DF=<00> & PF is unchanged if no fault met ( x c, y c + r) y DF=<11> and PF = 1 if fault met 11 s DF=<00> after received by a node ( x c, y c r) y Table 1: Re-routing decisions for the case ( x c, y d ) y 01 l DF=<10> after meeting a fault if PF = 0 ( x c, y d ) y DF=<11> after meeting a fault if PF = 1 10 s DF=<00> & PF is unchanged if no fault met ( x c + r, y c ) x DF=<11> and PF = 1 if fault met 11 s DF=<00> after received by a node ( x c r, y c ) x Table 2: Re-routing decisions for the case y = 0 x = 0 01 l DF=<10> after meeting a fault 10 s DF=<11> after meeting a fault ( x c, y d ) y DF=<00> if not meet a fault 11 l DF=<00> after received by a node ( x c, y d ) y Table 3: Re-routing decisions for the case x 0 & y 0 Table 1 shows the routing decisions for the case y = 0, i.e., the first node at which the message was absorbed, and destination node are on the same row. The message is injected into the network with DF=<00> and PF=0 along the shortest path in the X-dimension. This is captured in the first row of the table. The remarks column indicates the action taken when the message encounters a faulty link. In this case, DF is changed to <01> to signify re-routing along the longer path in the X-dimension and the message is re-injected. If the message encounters another faulty channel before reaching ( x d, y d ), DF is changed to <10> (PF=0) or <11> (PF=1) as denoted in the remarks column of row 2. Let us assume DF is set to <10>. In this case from row 3 we see that the message is re-routed r hops along positive direction in the Y-dimension in an attempt to find a path around the fault region. The message is explicitly sent to an intermediate node ( x c, y c + r), where ( x c, y c ) is the current node. When the message reaches this intermediate node, it is absorbed and DF is reset. However, if instead the message with DF=<10> encounters a faulty channel, DF is set to <11> and PF set to 1, and the message is re-routed r hops in the negative direction in the Y-dimension. The choice of r is arbitrary. The right choice depends upon the expected height of the fault region. The safest (and most expensive) choice would be to use r = 1 hop. Table 2 illustrates the routing decisions for the case of x = 0. These are nearly similar to the case for y = 0. But, when the offset in Y-dimension is eliminated at an intermediate node, RT field is changed for Table 1. Table 3 captures decisions for the case of x 0 & y 0. In this case the message must traverse both the X- and Y- dimensions, and, after the message encounters a faulty channel along the longer path in X-dimension, e-sft attempts to eliminate the offset in one of the dimensions reducing subsequent re-routing decisions to those captured in Table 1 and Table 2. When one of the dimensions has been eliminated, the RT field is changed to reflect the choice of Table 1 or Table 2. Figure 1 illustrates an example where y = 0. In the figure S denotes the source node and D the destination node. The path taken by the message is numbered. A message sent with DF=<00> from S meets a faulty channel, and is absorbed by node A (1). Since y = 0, DF is l set to <01> and the message is transmitted to, where it meets a faulty channel at node B. Now DF is changed to <10> and the message is to be transmitted along the Y-dimension. However, since y = 0, the purpose now is to traverse the Y-dimension just enough to be able to be routed around the faulty region. In this example r = 1 so, the message is sent to node C, which is located one hop away from node B (3). Since the message is received rather than being ejected due to a fault, DF is reset to <00>. The message tries the shortest path in the X-dimension and fails (4) and the process is repeated until step 8. After step 8, DF is set to <10> and node F s tries to send the message to ( x c, y c + 1) y. However, the Y- dimension channel is faulty. Therefore DF is changed to <11>, which indicates that the message should be sent in the opposite path in the Y-dimension. After some more failures in the X-dimension, the message is arrived at node H. In the next step, the message is delivered to the final destination. Note that in step 11 and step 5 (or step 2 and step 14) the DF value is <01>, and after both steps the message meets a faulty channel at node C in the X- dimension. If node C changes DF to <10> after both steps 5 and 11, the message passes through nodes E-C-F-G-F- C-E-C-F..., infinitely, resulting in livelock. Therefore we require DF to be set to <11> after step 11 to force the message to traverse the negative Y-dimension. This distinction is realized by the PF bit in the header. The PF is initially 0. When DF makes a transition from <10> to <11>, PF is set to 1 and remains at 1. The value of PF is used to prevent cycles in the course of the message transmission (which corresponds to a cycle of DF values) and thus avoid certain livelock situations.

3.3 Deadlock and Livelock Freedom A few observations can be made about the behavior of e-sft routing. Re-routed messages still follow dimension order, which is deadlock-free [8], though not necessarily the shortest path in each dimension. When intermediate nodes are utilized, these nodes are selected to be in the same column or row as the current node. Finally, absorbed messages use dynamically allocated buffers in node memory (rather than router buffers) to prevent introducing dependencies between consumption channels and injection channels. It has been shown that these latter dependencies could lead to deadlock [2]. Thus we have the following theorem. Theorem 1. e-sft is deadlock-free. Due to space limitations, the proof of deadlock freedom is provided in a detailed technical report [15]. While e-sft is deadlock-free it is not necessarily livelockfree for arbitrarily large fault patterns. However, when the number of faults is limited to a small number, e.g., 3, livelock freedom can also be guaranteed [15]. Under the current fault model, the PF flag enables a message to exit certain types of concave regions, and routing does proceed around convex regions. These observations and experience with simulations are encouraging. However, the issue of livelock-freedom for the current fault model is still under study. 4.0 Performance Evaluation The performance of e-sft was evaluated with flit-level simulation studies of message passing in a 16-ary 2-cube with 16 flit long messages and a single flit routing header. We use a congestion control mechanism (similar to [1]) by placing a limit on the size of the buffer on the injection channels. If the input buffers are filled, messages cannot be injected into the network until a message in the buffer has been routed. Injection rates are specified as the number of 16 flit messages injected each 5000 cycle period. Thus, injection rate of 20 corresponds to 0.064 flits/node/cycle. Note that these are 32 bit flits. A 32 bit header enables routing within a 64 64 torus. A flit crosses a channel in a single cycle, and traverses a router from input to output in a single cycle. Routing decisions are assumed to take a single cycle with the network operating with a 50 Mhz clock, and 20 ns cycle time. The software cost for absorbing and re-injecting a message is derived from measured times on an Intel Paragon and reported work with active message implementations [11]. Based on these studies we assess this cost at 25µ s per absorption/injection or 50µ s each time a message must be processed by the messaging software at an intermediate node. If the message encounters busy injection buffers when it is being re-injected, it is re-queued for re-injection at a later time. Absorbed messages have priority over new messages to prevent starvation. Relative to existing router designs [4], the only additional functionality required is in the Routing Arbitration Block [4]. One side effect of the increased header size is a possible increase in virtual channel buffer size and the width of the internal datapaths, although 32 bit datapaths appear to be reasonable for the next generation of routers. The remaining required functionality of e-sft is implemented in the messaging layer software. 4.1 Simulation Results In a fault-free network the behavior of e-cube and e-sft is identical. Simulation experiments placed a single fault region of varying size within the network. Performance of e-sft is shown in Figure 3 for three different sized concave fault regions (5, 8, and 11 nodes) and for a 9 node convex fault region. Due to the greater difficulty in entering and exiting a concave fault region, the average message latency for e-sft is greater in the presence of concave fault regions rather than for equivalent sized convex fault regions. The curves also show that for each of the particular fault configurations, the latency remains relatively constant as the throughput increases. As the throughput increases, the number of messages each node injects and receives increases, but the percentage of messages that encounter the fault region remains relatively constant. Therefore, the latency remains relatively flat. Another factor is that the high latencies of re-routed messages Latency (Clock Cycles) Latency (Clock Cycles) 1400.0 1200.0 1000.0 Throughput (Flits/Cycle/Node) 800.0 600.0 400.0 200.0 Latency Vs. Throughput 0.0 0.00 0.05 0.10 Throughput (Flits/Cycle/Node) 1000.0 800.0 600.0 400.0 200.0 5 Node Concave 8 Node Concave 11 Node Concave 9 Node Convex Figure 3. Latency-throughput curves Inject = 10 Inject = 20 Inject = 40 Inject = 60 Latency Vs Faults 0.0 0.0 10.0 20.0 30.0 0.20 0.15 0.10 0.05 Inject = 10 Inject = 20 Inject = 40 Inject = 60 Faults (Node Failures) Throughput Vs Faults 0.00 0.0 10.0 20.0 30.0 Faults (Node Failures) Figure 4. Latency-throughput vs. node faults

masks some of the growth in the latency of messages that do not encounter faults, though a close inspection of the graphs reveals a small but steady growth in average latency. Figure 4 shows the performance of e-sft in the presence of a single convex fault region ranging in size from 1 failed router node to 21 failed router nodes. The latency plot indicates that when the network is below saturation traffic, the increase in the size of the fault block causes significant increases in the average message latency. This is due to the increase in the number of messages encountering larger fault regions (an 1 node fault region represents 0.4% of the total number of nodes in the network, while a 21 node fault region represents 8.2% of the total number of nodes). The latency and throughput curves for high injection rates (60) represent an interesting case. Throughput and latency appear to remain relatively constant. At high rates and larger fault regions, more messages become absorbed and re-routed. However, the limited buffer size provides a natural throttling mechanism for both new messages as well as absorbed messages waiting to be re-injected. As a result, active faulted messages in the network form a smaller percentage of the traffic and both the latency and throughput characteristics are dominated by the steady state values of traffic unaffected by faults. The initial drop in throughput for small number of faults is due to the fact that a higher percentage of faulted messages are delivered reducing throughput. These results suggest that sufficient buffering of faulted messages and priorities in re-injecting them have a significant impact on the performance of faulted messages. At lower injection rates the throughput of the network remains relatively constant independent of the size of the fault blocks since fault blocks only increase the latency of the messages. Since messages are guaranteed delivery, when operating well below saturation the network quickly reaches the steady state throughput. The effect of the overhead on the message latencies can be significant. Message latency histograms show peaks at intervals of 2500 cycles (corresponding to the 50µ s software overhead each time a message passes through the messaging layer software at a node). Among messages that do encounter faults, it appears that on the average the majority of faulted messages do not require more than three re-routing steps to be routed around the fault region. In general, the results demonstrate good performance with messages being re-routed a few times. In practice, we find that the probability of multiple router failures before repair to be very low. Therefore we expect that large majority of faulted messages will not have to pass through more than one node. This would make these techniques attractive for next generation wormhole routed networks. 5.0 Concluding Remarks We find that performance of re-routed messages is significantly affected by the techniques for buffering and reinjection. While the large majority of traffic is unaffected by faults, reliable message delivery and improving the latency of faulted messages will require better understanding of how re-routed messages should be handled. Finally, it appears that this approach can be extended nat- urally to networks employing adaptive routing. For example, many fully adaptive routing protocols rely on dimension order routing over a subset of channels [10,7] to avoid deadlock. Messages which are blocked waiting on these channels, and experience faults on these channels can be absorbed and re-routed. Furthermore, partially adaptive routing protocols such as those based on the Turn Model [13] can also be adapted in a similar fashion. These issues and extensions are the subject of ongoing research. REFERENCES [1]R. Boppana and S. Chalasani. A comparison of adaptive wormhole routing algorithms, Proc. of Int. Symp. on Computer Architecture, pp. 351-360, 1993. [2] R. Boppana and S. Chalasani, Fault-tolerant routing with non-adaptive wormhole algorithms in mesh networks, Proc. of Supercomputing, pp. 693-702, 1994. [3] S. Chalasani and R. Boppana, Fault-tolerant wormhole routing in tori, Proc. of Int. Conf. on Supercomputing, 1994. [4]A. Chien, A cost and speed model for k-ary n-cube wormhole routers, Proc. of Hot Interconnects Workshop, Aug. 1993. [5] A. Chien and J. H. Kim, Planar-adaptive routing: Low-cost adaptive networks for multiprocessors, Proc. of Int. Symp. on Computer Architecture, pp. 268 277, 1992. [6] W. J. Dally, Virtual-channel flow control, IEEE Trans. on Parallel and Distributed Systems, vol. 3, pp. 194-205, March 1992. [7] W. J. Dally and H. Aoki, Deadlock-free adaptive routing in multicomputer networks using virtual channels, IEEE Trans. on Parallel and Distributed Systems, vol. 4, pp. 466-475, April 1993. [8] W. J. Dally and C. L. Seitz, Deadlock-free message routing in multiprocessor interconnection networks, IEEE Trans. on Computers, C-36, pp. q547 553, May 1987. [9]B. V. Dao, J. Duato, and S. Yalamanchili, Configurable flow control mechanisms for fault tolerant routing, Proc. of Int. Symp. on Computer Architecture, June 1995. [10]J. Duato, A new theory of deadlock-free adaptive routing in wormhole networks, IEEE Trans. on Parallel and Distributed Systems, vol. 4, pp. 1320-1331, Dec. 1993. [11]T. von Eicken, D. E. Culler, S. C. Goldstein, and K. E. Schauser, Active messages: a mechanism for integrated communication and computation, Proc. of Int. Symp. on Computer Architecture, pp. 256-266, 1992. [12]C. J. Glass and L. Ni, Fault-tolerant wormhole routing in meshes, Proc. of the Fault Tolerant Computing Symposium, 1993. [13]C. J. Glass and L. Ni, The turn model for adaptive routing, Proc. of Int. Symp. on Computer Architecture, pp. 278-287, 1992. [14]S. Konstantinidou and L. Snyder, Chaos router: architecture and performance, Proc. of Int. Symp. on Computer Architecture, pp. 212-221, 1991. [15]Y. J. Suh, B. V. Dao, J. Duato, and S. Yalamanchili, Software based fault tolerant routing, Technical Report TR-GIT/CSRL-95/04, Georgia Institute of Technology, Atlanta, Georgia 30332-0250, April 1995.