Characterization of Deadlocks in Interconnection Networks

Size: px

Start display at page:

Download "Characterization of Deadlocks in Interconnection Networks"

Bryan Booker
6 years ago
Views:

1 Characterization of Deadlocks in Interconnection Networks Sugath Warnakulasuriya Timothy Mark Pinkston SMART Interconnects Group EE-System Dept., University of Southern California, Los Angeles, CA Abstract Deadlock-free routing algorithms have been developed recently without fully understanding the frequency and characteristics of deadlocks. Using a simulator capable of true deadlock detection, we measure a network's susceptibility to deadlock due to various design parameters. The effects of bidirectionality, routing adaptivity, virtual channels, buffer size and node degree on deadlock formation are studied. In the process, we provide insight into the frequency and characteristics of deadlocks and the relationship between routing flexibility, blocked messages, resource dependencies and the degree of correlation needed to form deadlock. 1 Introduction Interconnection network routing algorithms aim to minimize message blocking by efficiently utilizing network virtual channel and physical channel resources while ensuring deadlock freedom. Routing approaches to accomplish this can be based on avoiding deadlock or on recovering from deadlock. The main distinction between these two approaches is the decision made in trading off routing freedom and deadlock formation. Avoidance-based routing algorithms enforce certain routing restrictions in order to altogether avoid deadlocks [1,, ]. Recovery-based routing algorithms relax routing restrictions and recover from potential deadlock situations [4, 5]. The circumstances under which either routing approach is preferable depend critically on the frequency with which deadlocks occur and the resulting effects. For instance, deadlock may be so infrequent for a particular network configuration that avoidance-based routing inefficiently uses network resources, resulting in frequent message blocking. On the other hand, deadlock may be so frequent and costly in some network configurations that avoidance-based routing outperforms recovery-based routing. This paper precisely quantifies the frequency and characteristics of deadlock formation in wormhole and cut-through k-ary n-cube networks and identifies network design parameters which influence deadlock formation. This enables us to better understand the nature of deadlocks and their likelihood and to determine the circumstances under which routing al- This research was supported by an NSF Research Initiation Award, grant ECS , and an NSF Career Award, grant ECS gorithms should be based on recovery as opposed to avoidance. In accomplishing this, we analyze the effects of different traffic patterns, bidirectionality, routing adaptivity, node degree, number of virtual channels and buffer depth on the frequency and characteristics of deadlocks. To our knowledge, no other study of router-related deadlock in interconnection networks has been performed to the detail presented here. In the next section, we classify deadlocks through example. Section presents the experiments we performed and the results. Section 4 presents related work and important findings are summarized in Section 5. Deadlock Formation Deadlocks in interconnection networks can occur as a result of cyclic resource dependencies formed when messages hold onto some resources (i.e., virtual channels) while waiting to acquire others. As a message progresses through a network, it acquires exclusive ownership of a virtual channel (VC) prior to each hop. When the header flit of a message blocks, it can be thought of as requesting the exclusive use of one of possibly many alternative VCs in order to progress to the next hop. A blocked message resumes once a new VC is acquired. As the tail of a message moves through the network, it releases previously acquired VCs no longer needed, so they can become available for other messages. The exclusive ownership and resource wait-for conditions along with the condition that messages are not preempted makes cyclic dependencies and deadlock possible..1 Depicting Deadlocks We use channel wait-for graphs (CWGs) [6] to model resource dependencies within interconnection networks. Although similar to dependency graphs used in previous work [, 7, 8], these graphs depict network state reflecting resource allocations and requests existing at a particular point in time, not the resource allocations allowed by the routing algorithm. Hence, in this context, CWGs depicting the entire network state are not necessarily connected. Figures 1 through 4 show examples of messages being routed in k-ary n-cube wormhole networks, along with the corresponding CWGs. In the network illustrations (Figures 1a, a, a and 4a), the source and destination nodes of message m i are labeled s i and d i, respectively. VC labeling in these figures is done only to facilitate explanation and is not

2 s 4 8 d s 10 5 d 1 11 d 5 d s 1 1 d 4 s 4 5 s 6 7 s 1 s 0 1 d d 4 s owned by owned by owned by owned by owned by d 5 d 6 s 4 5 d 1 4 s owned by owned by owned by owned by owned by 6 7 Figure 1. (a) A "single-cycle deadlock" for DOR with 1 VC. (b) The CWG contains a knot. intended to convey information regarding the relative positions of VCs within the network. In the CWGs (Figures 1b, b, b and 4b), vertices represent VCs. Outgoing arc(s) at each vertex are labeled with the message which currently owns that VC. A path formed by a series of solid arcs with the same label implies the temporal order in which VCs were acquired and continues to be owned by a particular message. Blocked messages are represented by connecting the ends of such paths to one or more desired VCs using dashed arcs. At any vertex, the labels of incoming dashed arcs represent the group of messages that desire to use that VC at this instant in time. Only those portions of the network's CWGs useful for illustrative purposes are shown in these figures. Figure 1a shows five messages (,,,,and ) being routed statically in dimension order within a torusconnected network with one VC. Note that messages,,and are blocked while messages and have acquired all of the channels needed to reach their destinations. Message has acquired channels 1 and, and requires to continue. Similarly, message has acquired channels, 4,and 5, and requires 6 to continue; message has acquired channels 6, 7,and 0, and requires 1 to continue. Thus, each of these blocked messages will wait indefinitely for one of the other messages in the group to release an owned VC. Figure 1b shows the CWG for the scenario in Figure 1a. There is a single cycle in this graph consisting of vertices 0, 1,,, 4, 5, 6 and 7. Given the set of all resources involved in this cycle, R = f 0 ; 1 ; ; ; 4 ; 5 ; 6 ; 7 g, observe that the set of vertices that can be reached by each and every member of R is R itself. This type of relationship formed by vertices in one or more cycles is referred to as a knot [9]. Assuming that the routing function is connected, a knot is a necessary and sufficient condition for deadlock [6].. Classifying Deadlocks..1 Single-Cycle Deadlocks Deadlock can be characterized by its deadlock set, resource set, andknot cycle density. The deadlock depicted in Figure 1 is what we refer to as a single-cycle deadlock. In Figure. (a) A "single-cycle deadlock" for minimal adaptive routing with 1 VC. (b) The CWG contains a knot. this example, the deadlock involves messages in its deadlock set f ; ; g, occupies 8 channels in its resource set f 0 ; 1 ; ; ; 4 ; 5 ; 6 ; 7 g, and has a knot cycle density of one cycle (true of all single-cycle deadlocks). Single-cycle deadlocks are more likely to occur in networks having minimal resources and/or highly restrictive routing options on available resources. As in the above example (Figure 1) of a torus network with one VC that allows only non-adaptive (static) dimension ordered routing, the routing function returns at most a single channel option. This is reflected in the CWG by a single dashed outgoing arc at any vertex in Figure 1b (maximum fan-out of one). In such a network, a single cycle is sufficient to form a knot. However, for this to occur, a correlated resource dependency among multiple messages must form. Single-cycle deadlocks are also possible in networks which use less restrictive routing (e.g., minimal adaptive routing with only one VC) when only one routing option is available to all messages comprising the deadlock set (e.g., due to faulty links or routing in the destination's dimension). An example is illustrated in Figures a and b. Here, each of the messages,, and has acquired VCs, exhausted their routing adaptivity, and are therefore waiting to acquire the one channel needed to reach their respective destinations. However, the required channels are already owned by members of this group of messages. The CWG (Figure b) contains a single cycle, and the vertices in this cycle form a knot, R = f 1 ; ; 5 ; 7 g. Hence, with a knot cycle density of one, this too is a single-cycle deadlock; its deadlock set contains 4 messages f ; ; ; g and its resource set includes 8 channels f 0 ; 1 ; ; ; 4 ; 5 ; 6 ; 7 g. This single-cycle deadlock not only requires all of the messages in the deadlock set to have exhausted their adaptivity, but also to own all of the resources needed by other messages in the deadlock set. Therefore, an even higher degree of correlation of message resource dependency is required for this type of deadlock to occur. In this example, message has acquired 8 and 9, and is waiting for a VC owned by message which is involved in the deadlock. Although the message is not able to

3 s 5 s m 6 d 4 s 5 s m 6 s 1 s d,7 d 4, d 4,6 d 1, s 6 s 0 1 m 11 m 6 m6 1 m m 8 s 1 s d,7 d d 4,6 d 1, s 6 s 0 1 m m 11 m 6 m6 1 m m 8 s 4 owned by owned by owned by owned by s 7 owned by owned by m 6 owned by owned by m m s 4 owned by owned by owned by owned by s 7 owned by owned by m 6 owned by owned by m m Figure. (a) A "multi-cycle deadlock" for minimal adaptive routing with VCs. (b) The CWG contains a knot. proceed until the deadlock is resolved, it is not considered to be in the deadlock set as its resources do not meet the condition for participation in a knot as described previously. This type of message is referred to as a dependent message and is distinguished from those messages actually in the deadlock set. The usefulness of this distinction is evident when developing deadlock detection mechanisms for recovery-based routers. The detection mechanism must be careful not to incorrectly identify dependent messages as being among those properly in the deadlock set, as removing them from the network will not resolve the deadlock. Moreover, dependent messages may be transient in that they may be able to proceed using an alternate resource not owned by one of the messages in the deadlock set... Multi-Cycle Deadlocks Figures a and b depict the network and the CWG for a more complex example of a deadlock, one comprised of multiple resource dependency cycles. This network uses minimal adaptive routing and two VCs per physical channel. Once again, all messages (...m 8 )have exhausted their adaptivity and are blocked. Each message is waiting to acquire one of two VCs needed to continue routing, both of which are owned by other members of the group. There are multiple unique cycles in the CWG. The set of all vertices involved in this group of cycles, R = f 1 ; ; 5 ; 7 ; 9 ; 11 ; 1 ; 15 g, meets the requirement for a knot. This is an example of what is referred to as a multi-cycle deadlock; its deadlock set has 8 messages f :::m 8 g, its resource set has 16 VCs f 0 ; 1 ;::: 15 g, and its knot cycle density is 4 cycles. CWGs similar to Figure b, where there are multiple outgoing dashed arcs per blocked message (fan-out > 1), are indicative of networks which allow a greater degree of routing flexibility (e.g., provide multiple VCs per physical channel, allow adaptive routing, etc.). Given that the messages in this example have exhausted their adaptivity, the vertices with a fan-out of two in Figure b correspond to a routing relation that supplies two alternative resources for each of the blocked messages. Should messages have blocked prior to exhausting their adaptivity, vertices with larger fan-out (i.e., Figure 4. (a) A "cyclic non-deadlock" for minimal adaptive routing with VCs. (b) The CWG does not contain a knot. 4) would exist in the graph. As can be seen by this example, the fan-out of vertices in the CWG, which is determined by routing adaptivity and the number of VCs per physical channel, greatly influences the number of unique cycles that can form. More importantly, increasing the routing flexibility exponentially increases the degree of correlation of resource dependency required for multiple cycles to form knots... Cyclic Non-Deadlocks A scenario in which multiple cycles exists but which does not result in deadlock (referred to as cyclic non-deadlock) is depicted in Figures 4a and 4b. This is similar to the previous example except that message 's destination is changed, allowing it to acquire the required VCs on its way to its destination. There are 8 unique cycles in the CWG. Given the set of all vertices in this group of cycles f 1 ; ; 5 ; 9 ; 11 ; 1 ; 15 g, note that vertices 7 and 16 are reachable from members of this set, but the opposite does not hold. This set (or any subset thereof) does not meet the conditions for a knot; therefore, there is no deadlock in this network. This is because message may eventually reach its destination and subsequently release 7, which will allow one of the two messages waiting for this channel ( or ) to continue. Other messages will then be able to proceed in a similar fashion. This example confirms the notion that cycles are necessary but not sufficient for deadlock, as was concluded by Duato [7]. Resource dependency graphs of deadlock avoidance algorithms based on Duato's framework may have cycles but will always have an escape resource to avoid deadlock (such as 7 in Figure 4b). The elimination of these cycles as required by some avoidance-based routing schemes is therefore overly restrictive. Similarly, eliminating cycles in a packet wait-for graph [10] to avoid deadlock is also overly restrictive the packet wait-for graph for this example clearly contains cycles, yet no deadlock exists. In summary, single-cycle deadlocks are possible in networks which have a single channel resource and limited adaptivity defined on that resource (due to static routing or exhausted adaptivity). Multi-cycle deadlocks involving highly correlated message resource dependencies are possible in networks using multiple resources and which allow

4 greater routing adaptivity over those resources. It has been shown that the number of blocked messages (number of vertices which have outgoing dashed arcs) and the flexibility in routing (fan- out of these vertices) greatly influence the formation of cycles [11]. However, deadlock occurs only when a group of cycles form a knot. Normalized Deadlocks o uni directional + bi directional Deadlock Set Size o uni directional + bi directional Deadlock Characterization Our approach for precisely detecting deadlocks is based on a theoretical framework which defines a deadlock as a knot within a CWG [6]. We implement a deadlock detection algorithm that is able to identify knots within the CWG of an ongoing network simulation. The deadlock detection algorithm involves maintaining a CWG, detecting cycles within this graph, and identifying groups of cycles which form knots. It is implemented in a flit-level simulator called FlexSim (an extension of FlitSi.0). All simulations are run for normalized loads up to full network capacity or until the network saturates with respect to the number of resource dependency cycles, generally well beyond the loads at which network performance saturates (shown in the figures by a vertical dashed line). Each simulation is run for 0,000 simulation cycles beyond steady state. Unless otherwise stated, all simulations are performed using uniform traffic, a 16-ary -cube with bidirectional channels, a fixed message size of flits, an edge buffer depth of two flits, one injection and reception channel, and a channel selection policy which favors continuing routing in the current dimension over turning. Minimal true fully adaptive routing (TFAR) is used for adaptive routing and dimension ordered routing (DOR) is used for static routing. Since no other restrictions are enforced, deadlocks are possible for both routing schemes. The deadlock detection algorithm is invoked every 50 simulation cycles. Deadlocks are broken by removing a message in the deadlock set (flit-by-flit) from the network so as to synthesize a recovery procedure (as in the Disha scheme [5]). Deadlock frequency is presented as normalized deadlocks which is the ratio given by the number of deadlocks averaged over all messages delivered. When no deadlocks exist, we instead use the total number of resource dependency cycles formed and the amount of congestion (number and percentage of blocked messages) to represent the conditions that could lead to deadlock formation. The size of deadlock and resource sets and the knot cycle density are used to describe the size and complexity of deadlocks..1 Effect of Physical Links on Deadlocks In studying the effect of network links on deadlock formation, we measure the frequency of deadlocks in tori with uniand bidirectional channels. We assume DOR with one VC per physical channel for both networks (all other parameters set to default values). Figures 5a and 5b show normalized deadlocks vs. load rate and deadlock set size vs. load rate for the two networks under uniform traffic. Normalized load rate is calculated based on total link bandwidth and average internode distance, which differs for both networks. The figures show that the uni-torus leads to relatively more deadlock despite having generated less overall traffic Figure 5. (a) Normalized deadlocks vs. rate. (b) Deadlock set size vs. load rate. load Below network saturation, there are 1 and 7 deadlocks for every 100 messages delivered (on average) in the bi- and unidirectional networks, respectively. For the two networks, no more than 4 (bi) and (uni) messages are involved in each deadlock below saturation loads. This indicates that unless messages experience deadlock more than once, up to % (bi) and 15% (uni) of all messages participate in deadlock. Deep into saturation, deadlock frequency grows to 11% (bi) and 60% (uni) while the number of messages involved in deadlock converges to around 6 for both networks. From this, we can infer that at highly saturated load rates messages may be involved in multiple deadlocks prior to being delivered, particularly in the uni-directional network. The deadlocks formed in both networks are of the singlecycle deadlock variety described in Section..1. The requirements and factors leading to deadlock for the two networks, however, are different which helps to explain the disparity in deadlock frequency. For one, a bi-torus requires a minimum of messages per deadlock whereas only messages comprise the minimal deadlock set for a uni-torus. As confirmed by Figure 5b, the uni-torus has deadlocks involving fewer messages for all load rates up through deep saturation. Second, and more importantly, for uniform traffic in a torus with 16 nodes per dimension each bi-link is used by 1% of the messages traveling in a particular direction within a given dimension whereas each uni-link is used by 50% of the messages in the network. This suggests that the highly correlated resource dependencies resulting from all network traffic having to travel in the same direction (and turn) to reach their respective destinations is a major contributor to deadlock frequency. Our results show that as expected, adding routing resources (e.g., bidirectional physical links) reduces resource contention such that correlated resource dependencies required for deadlock are less likely to form. Although bidirectionality significantly reduces deadlock frequency, it does not by itself reduce the likelihood of deadlock formation to sufficiently low enough levels. However, bidirectionality may be combined with other techniques (following sections) to reduce deadlock frequency to well within acceptable levels.. Effect of Adaptivity on Deadlocks In studying the effect of adaptivity on deadlock formation, we measure the frequency of deadlocks and cycles in tori using DOR and TFAR. To focus on the effects of adaptivity alone, we again use a single VC per physical channel for both algorithms. Figures 6a and 6b show the normalized

5 Normalized Deadlocks and Cycles * TFAR Cycles o TFAR Deadlocks + DOR Cycles and Deadlocks Deadlock and Resource Set Size * TFAR Resource Set o DOR Resource Set x TFAR Deadlock Set + DOR Deadlock Set Figure 6. (a) Normalized deadlocks and cycles vs. load rate. (b) Deadlock and resource set size vs. load rate. Normalized Deadlocks * TFAR1 + DOR1 o DOR Number of Cycles TFAR4 0.5 DOR 0.5 DOR DOR4 TFAR TFAR1 10 TFAR DOR % Messages Blocked Figure 7. (a) Normalized deadlocks vs. load rate. (b) Number of cycles vs. percent of messages blocked. deadlocks and cycles vs. load rate and the deadlock and resource set size vs. load rate for the two algorithms under uniform traffic. DOR allows only single-cycle deadlocks to form (as in Figure 1), so one curve can represent both cycle and deadlock information. In contrast, TFAR allows cyclic non-deadlocks (similar to Figure 4). Since many more cycles can exist than there are deadlocks, two different curves are used to convey cycle and deadlock formation. Our results show that TFAR suffers no deadlocks below network saturation, 1 deadlock per 100 messages delivered at saturation, and about the same number of deadlocks as messages delivered in deep saturation. The ratio of deadlocks to messages delivered for DOR is even smaller prior to saturation (less than 1 per 1000 messages delivered). This rate gradually increases to 1 deadlock for every 10 messages delivered in deep saturation. In terms of actual number of deadlocks (not normalized to throughput), DOR suffers more than TFAR by as much as a factor 6. Interestingly, DOR has higher sustained throughput over TFAR despite having a larger number of deadlocks. This explains the discrepancy between actualdeadlockand normalized deadlock. It is also observed that the performance of TFAR is highly sensitive to just a few deadlocks while the performance of DOR remains relatively unaffected even as the number of deadlocks grows. The size of deadlock and resource sets in DOR are inherently limited by the single-cycle deadlocks which form. Given that deadlocks are broken immediately upon detection, the effects of deadlocks in DOR are local, isolated to a given row or a column within the network. The relatively simpler correlation of message dependency required for these deadlocks makes them more likely but, at the same time, less severe. In contrast, TFAR can lead to large multicycle deadlocks which have a more global effect upon the network. Hence, the higher degree of correlation of message resource dependency required for these deadlocks makes them less likely but more severe. The results shown in Figure 6b confirms our hypothesis. Large multi-cycle deadlocks appear in TFAR with deadlock sets and resource sets that are 5 to 7 and 7 to 10 times larger than those of DOR, respectively. What's more, the knot cycle densities for TFAR deadlocks are greater by a factor of 10 to 0. Some of the larger deadlocks observed in TFAR involve as many as 5% of the messages within the network, occupy more than 40% of the channels, and involve hundreds of cycles, thus confirming their global nature. As a result, the residual effects of such large deadlocks are longer-term and widespread; just a few can greatly degrade performance. This is in contrast to the deadlocks in DOR which have more localized, shorter-term effects, thereby making DOR' s performance less affected by a large number of deadlocks. The cyclic non-deadlocks in TFAR may also degrade performance. Duato [] has described situations where messages block cyclically faster than they can be drained and remain blocked for extended periods, leading to large message latencies. The large number of cycles we have observed even in the absence of deadlocks suggests that this may be occurring. Hence, low throughput resulting from these cyclic non-deadlocks contributes to the higher normalized deadlock frequency for TFAR although fewer actual deadlocks form. Given that TFAR with a single VC makes harmful deadlocks and cyclic non-deadlocks probable, recovery-based adaptive routing would benefit from additional VCs. Next we will examine the effect of additional VCs on reducing the likelihood of deadlock formation.. Effect of Virtual Channels on Deadlocks In investigating the effects of traffic flow on deadlock formation using multiple VCs per physical channel, we measure the deadlock frequency of DOR and TFAR in tori networks which allow the unrestricted use of,, and 4 VCs (all other parameters default). For experiments in which deadlock did not occur, we use network congestion and resource dependency cycles formed as a measure of the likelihood of possible deadlock. Figures 7a and 7b show normalized deadlocks vs. load rate and number of cycles vs. percentage of blocked messages under uniform traffic. In Figure 7b, each curve is annotated with the load rate at which cycles first appear (first point) and the load rate at which the highest number of cycles were found (last point). In Figure 7a, DOR with two VCs (DOR) does not lead to deadlock prior to saturation; the nd VC more than doubles the load at which deadlocks begin to appear when using only 1 VC. At its saturation load rate, approximately 1 deadlock occurs for every 100 messages delivered. Deadlock frequency increases to 1 for every 5 messages delivered in deep saturation. Beyond saturation, the actual number of deadlocks for DOR1 and DOR is roughly the same. However, a larger reduction in throughput at loads after saturation makes the normalized deadlock measure slightly higher for DOR (as shown in the figure). With or more VCs, DOR suffers no deadlocks. In contrast, VCs are sufficient to discourage deadlocks in TFAR (DOR, DOR4, TFAR, TFAR and TFAR4 are not plotted as no deadlocks occurred).

6 A number of factors contribute to the elimination of deadlocks when additional VCs are introduced. The new VCs are resources that become available to messages which would otherwise block. The likelihood of the formation of cycles and knots decreases when fewer messages are blocked within the network. The new VCs also provide a higher number of routing options for those messages which still block within the network. As was illustrated in Section, additional routing options increase the deadlock set size, resource set size, and knot cycle density needed for deadlock, thereby requiring a higher degree of correlation of message dependency in order for deadlock to form. This greatly diminishes the likelihood of deadlocks. Note that TFAR amplifies the effects of additional VCs since adaptivity makes new routing options available in each dimension. This explains why TFAR is able to eliminate all deadlocks with a smaller number of VCs (two instead of three for DOR). The simpler correlation of message dependencies required for deadlock in DOR combined with restrictions in the use of the new resources makes VCs insufficient to eliminate deadlock in DOR. Figure 7b indicates that adding VCs reduces congestion and allows higher loads to be applied before a large number of cyclic non-deadlocks form. TFAR1 results in increasingly higher congestion and a larger number of cycles starting at saturation. TFAR eliminates the cycles encountered at low load rates in TFAR1, and substantially reduces the overall congestion (from over 70% of the messages being blocked down to as few as 1%). As TFAR reaches saturation, its congestion increases while the number of cycles grows rapidly. The third and the fourth VCs for TFAR and DOR show a similar effect on reducing congestion and eliminating cycles at loads prior to saturation, leading to rapid growth in cycles once saturation is reached. In summary, we observe that additional VCs are able to reduce the amount of messages which block within the network, as expected. This, along with the higher degree of correlation of message dependencies required for deadlock in the presence of a larger number of routing options due to the additional VCs greatly diminishes the likelihood of deadlock. We find the extent to which deadlocks are eliminated with as few as VCs per physical channel to be surprising. However, as networks with multiple VCs reach saturation, a higher number of blocked messages along with the larger number of routing options increases the number of cycles exponentially. The fact that no knots exist even in the presence of hundreds of thousands of cycles suggests the formation of extremely large cyclic non-deadlocks at saturation loads. Operating below saturation avoids this performance degradation..4 Effect of Buffer Depth on Deadlocks We now investigate the effects of increasing the channel buffer size on deadlock formation. We measure the frequency of deadlocks in bidirectional tori with channel buffer depths of, 4, 6, 8, 16, and flits. TFAR with one VC per physical channel is used. Using a buffer of the same depth as message length corresponds to virtual cut-through switching [1]. Other buffer depths correspond to wormhole or buffered wormhole switching [1]. Figures 8a and 8b show normalized deadlocks vs. load rate and normalized Normalized Deadlocks x buffer=. buffer=4 + buffer=6... buffer=8 o buffer=16 * buffer= Normalized Deadlocks x buffer=. buffer=4 + buffer=6... buffer=8 o buffer=16 * buffer= Messages in the Network Figure 8. (a) Normalized deadlocks vs. load rate. (b) Normalized deadlocks vs. messages in the network. deadlocks vs. the number of messages in the network. As shown in Figure 8a, networks with buffer depths of, 4 and 6 flits all saturate at a similar load rate. After saturation, these networks lead to a large amount of deadlocks (15 to 5 deadlocks for every 10 messages delivered). The network with a buffer depth of 8 flits saturates at a 5% higher load rate, and leads to a similar deadlock frequency for load rates beyond saturation. Networks with buffers depths of 16 and flits saturate at a 75% higher load than the smallest buffers, reflecting the larger capacity of these networks. A buffer depth of 16 flits leads to the highest number of deadlocks (15 to 5 deadlocks per every 10 messages delivered) while the virtual cut-through network (buffer depth of flits) leads to the smallest number of deadlocks (1 deadlock for every messages delivered) at load rates beyond saturation. The increase in saturation load as the buffer depth is increased confirms that each message requires the simultaneous use of fewer channels due to the higher capacity. This allows for message compaction. Below saturation, compaction leads to less resource contention and allows more messages to be serviced by the network. The similar saturation load for buffer depths of, 4, and 6 flits (6%, 1%, and 18% of the message size) indicates that the amount of compaction occurring for these buffer sizes are alike, and suggests that messages have blocked close to their source nodes, thereby neutralizing the effect of compaction. Increases in saturation loads are greater for larger increments in buffer sizes, thereby suggesting effective compaction for these buffer sizes (buffer sizes of 8, 16 and flits which can accommodate 5%, 50%, and 100% of a message, respectively). When normalized with respect to the number of messages in the network (Figure 8b), the networks with smaller capacity buffers lead to a substantially higher number of deadlocks. This is explained by the fact that in these networks, each message requires the simultaneous use of a larger number of resources, thereby leading to higher resource contention. Although higher capacity buffers allow more messages to enter the network and, potentially, a larger number of messages to block at saturation, the degree of correlation of message dependency required for deadlocks also increases due to the message compaction, thereby making multi-cycle deadlocks with large deadlock and resource sets less probable.

7 .5 Effect of Network Node Degree on Deadlocks To investigate the effects of node degree on the frequency of deadlocks, we measured deadlock frequency in a 16-ary - cube (D) and a 4-ary 4-cube (4D) torus-connected network, both of which use TFAR routing with one VC. Load rate was normalized based on the total link bandwidth and average internode distance of the two networks. The 4D network resulted in relatively fewer deadlocks at loads prior to saturation (less than 1% of the deadlocks which occurred for the D network). Also, the 4D network achieved higher performance well beyond the saturation load of the D network, thereby leading to an even larger gap in the normalized deadlock frequency. The two main factors contributing to this are the additional network resources (physical channels) and the increased routing freedom (dimensions). Similar to other experiments, additional links serve to reduce resource contention and the high node degree, along with adaptive routing, increases the required correlation of message dependencies in order for knots to form. The few deadlocks that did form in the 4D network were all single-cycle deadlocks, which suggests that the few messages in the deadlock sets were limited to restricted routing due to exhaustion of routing adaptivity towards the destination..6 Effect of Non-Uniform Traffic on Deadlocks The deadlock frequencies for non-uniform traffic patterns (bit-reversal, matrix-transpose, perfect-shuffle, and hot- spot) were similar to (in most cases, within 10% of) the deadlock frequencies for the uniform traffic patterns in the experiments described above. The characteristics of the deadlocks (deadlock set size, resource set size, and knot cycle density) were similar as well. The only exception to this was for DOR. Single-cycle deadlocks in DOR (as shown in Figure 1) require circular overlap of messages. The source and destination pairs designated by some of these non-uniform traffic patterns are such that this overlap is not possible. 4 Related Work Deadlock approximation schemes proposed previously [4, 5] have provided little insight into the frequency of true deadlocks. In contrast, our work presents frequencies of actual deadlock as well as their characteristics as they relate to key network parameters. CWGs and similar constructs have previously been used to statically represent connections allowed by deadlock-avoidance based algorithms [, 8]. In contrast, we use these graphs to model dynamic resource allocation in unrestricted routing, and to precisely define and detect deadlocks. A summary of work characterizing deadlocks as knots in generalized resource graphs intended to describe deadlocks in operating systems is presented in [9]. Our work is a specialized application of this framework, intended for depicting deadlocks in interconnection networks. 5 Conclusions and Future and Work We characterize the causal effects of various network parameters on blocked messages, resource dependency cycles, and deadlocks to gain a greater understanding of the viability of deadlock recovery-based routing. Through simulation and analysis, we empirically show how deadlock probability is influenced by these factors when routing restrictions are not enforced so as to avoid deadlock. Our results for k-ary n-cube networks with n confirm that deadlock probability is less in bidirectional networks than in unidirectional networks, and it decreases as node degree and adaptivity is increased. Localized deadlocks of limited harmful effect are more probable with dimension ordered routing whereas globally harmful deadlocks are probable with true fully adaptive routing. Deadlock probability is less in virtual cut-through networks than in bufferedwormhole and wormhole networks, as expected. Interestingly, however, deadlocks are highly improbable (none were detected) if as few as VCs are used with dimension ordered routing and only VCs are used with true fully adaptive routing in bidirectional wormhole networks. These results lead us to conclude that recovery-based routing is viable since the unrestricted use of only a few virtual channels is sufficient to make deadlock highly improbable. Providing greater routing flexibility and buffer capacity through increased routing adaptivity, number of virtual and physical channels (bidirectional), and buffer depth greatly increases the complexity of correlated resource dependencies required for deadlock to occur. We will continue this characterization study by examining the effect of irregular network topology, hybrid message length, misrouting, etc., on deadlock. We also plan to characterize deadlock formation under hybrid non-uniform traffic loads using program-driven simulations. References [1] Andrew A. Chien and J. H. Kim. Planar-Adaptive Routing: Low- Cost Adaptive Networks for Multiprocessors. In Proc. of the 19th Symposium on Computer Architecture, pp 68-77, May 199. [] L. Ni and C. Glass, The Turn Model for Adaptive Routing, In Proc. of the 19th International Symposium on Computer Architecture, IEEE Computer Society, pages 78-87, May 199. [] J. Duato. A New Theory of Deadlock-free Adaptive Routing in Wormhole Networks. IEEE Transactions on Parallel and Distributed Systems, 4(1):10-11, December 199. [4] J. Kim, Z. Liu, and A. Chien. Compressionless Routing: A Framework for Adaptive and Fault-tolerant Routing. In Proc. of the 1st International Symposium on Computer Architecture, pp 89-00, April [5] Anjan K.V. and Timothy M. Pinkston, An Efficient, Fully Adaptive Deadlock Recovery Scheme: Disha, In Proc. of the nd International Symposium on Computer Architecture, pp 01-10, June [6] Sugath Warnakulasuriya and Timothy Mark Pinkston. Implementation of Deadlock Detection in a Simulated Interconnection Network Environment, Technical Report CENG 97-01, University of Southern California, January [7] J. Duato. A Necessary and Sufficient Condition for Dead lock-free Adaptive Routing in Wormhole Networks. IEEE Transactions on Parallel and Distributed Systems, 6(10): , October [8] Loren Schwiebert, D.N. Jayasimha, A Necessary and Sufficient Condition for Deadlock-Free Wormhole Routing, Journal of Parallel and Distributed Computing,, (1996).

8 [9] Mamoru Maekawa, Arthur E. Oldehoft, and Rodney R. Oldehoft, Operating Systems: Advanced Concepts, Benjamin Cummings, [10] William J. Dally and Hiromichi Aoki, Deadlock- Free Adaptive Routing in Multicomputer Networks Using Virtual Channels, IEEE Transactions on Parallel Distributed Systems, Vol. 4, No. 4, April, 199. [11] Timothy Mark Pinkston and Sugath Warnakulasuriya. On Deadlock in Interconnection Networks, To appear in Proc. of the 4th International Symposium on Computer Architecture, June [1] Parviz Kermani and Leonard Kleinrock. Virtual cut- through: A new computer communication switching technique, Computer Networks, pages 67-86, [1] C.B. Stunkle et al. The SP high-performance switch, IBM Systems Journal, vol. 4, no., pp , 1995.

EE482, Spring 1999 Research Paper Report. Deadlock Recovery Schemes

EE482, Spring 1999 Research Paper Report Deadlock Recovery Schemes Jinyung Namkoong Mohammed Haque Nuwan Jayasena Manman Ren May 18, 1999 Introduction The selected papers address the problems of deadlock,