A New Adaptive Hardware Tree-Based Multicast Routing in K-Ary N-Cubes

Size: px

Start display at page:

Download "A New Adaptive Hardware Tree-Based Multicast Routing in K-Ary N-Cubes"

Amanda Pierce
6 years ago
Views:

1 IEEE TRANSACTIONS ON COMPUTERS, VOL. 50, NO. 7, JULY A New Adaptive Hardware Tree-Based Multicast Routing in K-Ary N-Cubes Dianne R. Kumar, Member, IEEE, Walid A. Najjar, and Pradip K. Srimani, Fellow, IEEE AbstractÐMulticast communication is a key issue in almost all applications that run on any parallel architecture and, hence, efficient implementation of of multicast is critical to the performance of multiprocessor machines. Multicast is implemented in parallel architectures either via software or via hardware. Software-based approaches for implementing multicast can result in high message latencies, while hardware-based schemes can greatly improve performance. Deadlock freedom in multicast communication is much more difficult to achieve resulting in more involved routing algorithms and higher startup delays. Hardware tree-based algorithms do not require these high startup delays, but do suffer from high probabilities of message blocking leading to poor performance. In this paper, we propose a new hardware tree-based routing algorithm (HTA) for multicast communication under virtual cut-through switching in k-ary n-cubes that outperforms existing software and hardware path-based multicast routing schemes. Simulation results are compared against several commonly used multicast routing algorithms and show that HTA performs extremely well under many different conditions. Index TermsÐMulticast communication, path-based routing, tree-based routing, deterministic routing, adaptive routing, virtual cut-through switching. æ 1 INTRODUCTION EFFICIENT routing of multicast messages is extremely important to the performance of multiprocessors. Since most current multiprocessors only support unicast communication, multicast is therefore implemented as multiple unicast messages resulting in high message latencies. Hardware-based multicast support can greatly improve performance. Among the proposed hardware-based schemes for multicast are path-based and tree-based routing algorithms ([3], [22], [9], [14], [19], [2], 12], [5], [4]). Each one of these schemes uses multidestination messages, which are messages that have more than one header flit. The main difference between these two types of multicast schemes lies in how the header flits in a multidestination message are routed. In tree-based routing, a multidestination message is routed through the network at each intermediate node along a multidestination message's path using all header flits in the message. In path-based routing, a multidestination message is routed through the network using only the first header flit in the message. Once the first header flit reaches its destination and is absorbed by the node, the next header flit is routed. In tree-based routing, no ordering of the destinations is required before the message is injected into the network and the shortest paths between the source node and all destinations are always taken. However, this type of routing suffers from a high. D.R. Kumar is with the Department of Computer Science and Engineering, University of Colorado at Denver, Denver, CO dkumar@carbon.cudenver.edu.. W.A. Najjar is with the Department of Computer Science and Engineering, University of California Riverside, Riverside, CA najjar@cs.ucr.edu.. P.K. Srimani is with the Department of Computer Science, Clemson University, Clemson, SC srimani@cs.clemson.edu. Manuscript received 1 Sept. 2000; accepted 20 Mar For information on obtaining reprints of this article, please s to: tc@computer.org, and reference IEEECS Log Number probability of message blocking at intermediate nodes, leading to higher deadlock probability. Path-based routing does not suffer from this high probability of message blocking. However, it does require destinations to be ordered at the source and does not always provide the shortest path between the source node and each destination node. In this paper, we propose a hardware tree-based routing algorithm (HTA) which attempts to reduce the probability of message blocking, resulting in low message latencies. The probability of a message blocking is kept low by using virtual cut-through switching, indepent virtual channels (VCs) for unicast and multicast messages, several VCs per physical channel (PC), an efficient deadlock detection and recovery scheme, and delayed header flit routing. The Hardware Tree-based routing algorithm (HTA) is a fully adaptive and minimal tree-based routing scheme for multicast routing. The scheme is fully compatible with existing unicast routing schemes. Multicast routing is briefly described in Section 2. Section 3 describes our proposed HTA scheme, including its routing scheme, deadlock recovery mechanism, and router implementation. Experimental deadlock probabilities, as well as a comparison of HTA, Software Multicast, and Column-Path algorithms, are reported in Section 4. Section 5 discusses related work and concluding remarks are given in Section 6. 2 MULTICAST ROUTING ALGORITHMS The interconnection network model considered in this study is a k-ary n-cube using input buffering and virtual cut-through switching [8]: Message advancement is similar to worm-hole routing [17], except that the body of a message can continue to progress even while the message /01/$10.00 ß 2001 IEEE

2 2 IEEE TRANSACTIONS ON COMPUTERS, VOL. 50, NO. 7, JULY 2001 head is blocked, and the entire message can be buffered at a single node. Note that a header flit can progress to the next node only if the whole message can fit in the destination buffer. For simplicity, all message lengths are equal. 2.1 Multicast Implementation in Software Implementing multicast in software is currently widely used since most current systems only support one-to-one communication. In software implementations of multicast, one or more unicast messages must be sent. The simplest implementation of multicast using unicast messages is to have one unicast message sent for every destination address in the multicast message. This scheme is referred to as Software Multicast in Section Multicast Implementation in Hardware Among the proposed hardware-based schemes for multicast are path-based and tree-based routing algorithms Path-Based Routing In path-based multicast, a multidestination message is routed through the network using only the first header flit in the message. Once this destination is reached, the first header flit is removed by the router and the next header flit is used to continue routing the message to the next destination. The data flits are simultaneously forwarded to both the destination node as well as to the next input queue required for the next header flit. This continues until all destinations are reached and the message is completely consumed by the node. To reduce the path length of a multicast message, the destination node set is divided into disjoint subsets. Each disjoint destination subset is then composed into separate submulticast messages and sent along separate multicast paths. By appropriately ordering the destinations within each set or subset at the source node, the path taken can be reduced and messages are routed more efficiently. In path-based routing schemes, the probability of message blocking is low since at most two channels are requested per message (one regular channel and/or one sink channel if a message has reached a destination). However, path-based routing does require an ordered list of destination addresses for each copy of a message before it is injected into the network. This destination ordering can be computed at compile time if the destinations are statically determined. The subpath between the source node and one of the destination nodes in a multicast path is often not the shortest path. Some path-based routing schemes include Dual-Path [11], Multipath [11], Column-Path [3], [2], and Hierarchical Leader-Based Scheme (HL) [18]. The Dual-Path and Multipath algorithms have the disadvantage of being incompatible with commonly used unicast routing algorithms, such as the e-cube routing algorithm, and therefore will not be used in the simulations here Tree-Based Routing In tree-based multicast routing, a multidestination message is routed through the network at each intermediate node along a multidestination message's path using all the header flits within it. The multidestination message is routed along a common path among all header flits in the message as far as possible. The header flits are then routed and moved onto different channels headed for a unique set of destination nodes. The data flits are simultaneously forwarded to each of the different channels already allocated for each header flit. This branching continues as necessary until all destination nodes have been reached. Tree-based routing has the advantage that no ordering of the destinations is required before the message is injected into the network. The shortest path between the source node and all destination nodes is always taken. However, tree-based routing has been shown to be prone to large blocking probabilities at intermediate nodes, resulting in poor overall performance [10], [5], [4]. The probability of blocking is much greater than that for path-based routing schemes because all branch channels must be available for the whole multidestination message to continue. Because of the probability of message blocking, tree-based routing algorithms suffer from higher multicast message latency. Some tree-based algorithms include Double-Channel XY [10], Tree-Based Multicast with Branch Pruning [13], Resumable Multicast [5], [4], Restricted Branch Multicast [5], [4], and Quad-Branch Multicast (QBM) [24]. The Double-Channel XY algorithm requires double channels for deadlock freedom and the number of channels between every pair of nodes grows exponentially with the number of mesh dimensions. In addition, this algorithm has been shown to perform worse than path-based algorithms for wormhole switching. Tree-Based Multicast with Branch Pruning does perform well. However, the data size for each message must be very small. The Resumable Multicast and Restricted Branch Multicast algorithms do not perform well unless the number of fanouts at each intermediate node is reduced to two (which results in an algorithm similar to path-based routing). QBM conforms to double-xy routing and is more suitable for bulk multicasts. 3 HARDWARE TREE-BASED MULTICAST ROUTING ALGORITHM (HTA) HTA is a routing scheme that combines two distinct routing algorithms, one for unicast communication and one for multicast communication. The well-known deadlock-free routing algorithm proposed in [6], [1] is used for unicast messages (briefly explained in Section 3.1). Multicast messages use the fully adaptive, tree-based routing algorithm explained in Sections 3.1 through 3.4. Both routing algorithms are implemented within the same network and each message is assigned to the appropriate routing algorithm when input into the network. Although the algorithm proposed in [6], [1] is used here for the unicast routing in HTA, many existing unicast routing schemes are fully compatible with HTA. The main characteristics of HTA are:. Virtual cut-through switching is used with distinct virtual paths for unicast and multicast messages, each path using three VCs per dimension (total of six VCs per dimension).

3 KUMAR ET AL.: A NEW ADAPTIVE HARDWARE TREE-BASED MULTICAST ROUTING IN K-ARY N-CUBES 3. Unicast messages are routed using the deadlockfree routing algorithm proposed in [6], [1] (see Section 3.1).. Multicast messages are routed using tree-based, fully adaptive routing along with a deadlock detection and recovery scheme (see Sections 3.1 and 3.2) and delayed header flit routing (see Section 3.1). Each message is composed of all header flits followed by all data flits. Each header flit holds one destination address and destinations do not need to be ordered within a multicast message. 3.1 The Routing Scheme Our proposed HTA scheme consists of two separate routing algorithms for unicast and multicast messages which are described below Unicast Communication Routing Algorithm The deadlock-free routing algorithm proposed in [6], [1] is used for unicast communication in HTA and is an adaptive routing algorithm based on dimension-order routing. In this adaptive routing scheme, a message is routed on any adaptive channel until it is blocked. Once blocked, a message is routed using dimension-order routing if possible. A message may return to the adaptive channels in the following routing decisions if the adaptive channels are available. When a message is routed using dimension-order routing, it is routed along decreasing dimensions with a dimension decrease occurring only when zero hops remain in all higher dimensions. By assigning an order to the network dimensions, no cycle exists in the channeldepency graph and the algorithm is deadlock-free. A minimum of three VCs per dimension is required for deadlock-free routing in k-ary n-cubes. Two VCs per dimension are used for dimension-order routing and all remaining VCs are used for adaptive routing Multicast Communication Routing Algorithm Multicast messages are routed through the network using a tree-based, delayed header flit routing algorithm along with a deadlock detection and recovery scheme. The first header flit of a multicast message is routed to any free channel using the following priority scheme: The message first requests any free channel in the dimension in which it has the greatest distance left to travel. If more than one dimension has the same distance left to travel, a dimension is randomly selected. If there are no free channels within the selected dimension, then any free channel in the dimension with the next furthest distance left to travel is requested. This type of requesting continues until a channel has been assigned to the header flit of this message or until no free channels have been found. If no free channels are found, the header flit at the top of the queue blocks. No other header flits can be routed until this header flit has been routed. After the header flit is routed at the current node, it is then moved to the neighboring node's queue. Because delayed header flit routing is used (explained in Section 3.2), the header flit just routed remains at the neighboring queue until all remaining header flits in this multicast message have been routed at the current node. After the first header flit is routed, all remaining header flits are routed in the same manner as the first header flit, with one exception. When each of the remaining header flits reaches the top of the queue, it is first routed (if possible) to any channel already allocated to this multicast message by any of the preceding header flits that have already been routed. If this header flit cannot be routed in any of the previously routed dimensions, then it is routed using the priority scheme described above for the first flit in the message. By trying to route the remaining header flits to already allocated channels for each multicast message, extra channels are only assigned to the multicast message when necessary. This keeps other channels available for other multicast messages in the network and reduces the probability of blocking since a smaller number of channels are assigned per node for each multicast message. Once all header flits for each multicast message are routed, the data flits for this message are moved simultaneously to all channels allocated to this multicast message. HTA allows full adaptivity for multicast messages since there is no channel routing restriction. To deal with potentially deadlocked situations, the deadlock detection and recovery scheme described in Section 3.3 is used. The schematic of the HTA routing algorithm is shown in Fig. 1; the pseudocode is provided in the Appix. 3.2 Header Flit Routing The scheme most commonly used for routing header flits [10], [13], [4] is referred to here as immediate header flit routing. To increase performance, a new type of scheme is proposed, called delayed header flit routing Immediate Header Flit Routing When a header flit is routed at the current node and moved to a neighboring queue, it is routed at this neighboring node without waiting for the remaining header flits in the message to be routed. Fig. 2 shows an example of immediate header flit routing. In this figure, header flit A:2 is blocked while header flit A:1 continues to be routed, holding VCs and causing message B to block Delayed Header Flit Routing HTA uses delayed header flit routing to lower the probability of messages blocking and to increase performance. In delayed header flit routing, a header flit at a neighboring node is prevented from being routed until all header flits at the current node have been routed. Because, in tree-based routing, the remaining header flits may not be immediately routed, it's more advantageous to keep all header flits within close proximity of one another using delayed header flit routing. This close proximity prevents header flits from being assigned to queues at downstream nodes before all flits in the message can use them. This keeps the downstream queues free so that they are available for other messages in the network that can use them immediately. This scheme only requires a small additional amount of control logic to detect when all

4 4 IEEE TRANSACTIONS ON COMPUTERS, VOL. 50, NO. 7, JULY 2001 Fig. 1. HTA Routing Algorithm for a multicast message. header flits at the current node have been routed and one extra control line per VC is used to notify the header flits at the neighboring nodes when routing can continue. Fig. 3 shows an example of delayed header flit routing. In this figure, header flit A:2 is blocked while header flit A:1 waits at the neighbor node until the last header flit in this message (flit A:2) is routed. Message B can now be routed. As the number of destinations grows, delayed header flit routing becomes increasing important in reducing the probability of message blocking. 3.3 Deadlock Detection and Recovery Mechanism In HTA, each node has a dedicated holding queue, called the deadlock queue. If a header flit currently under consideration cannot be routed to a channel in a predetermined amount of time (timeout delay), the header flit is considered to be in a potential deadlock situation and is routed to the deadlock queue at the current node. This timeout delay value will further be explored in Section 4. Once one of the header flits in a message has been assigned to the deadlock queue, all remaining header flits in the message must be routed as soon as they reach the top of

5 KUMAR ET AL.: A NEW ADAPTIVE HARDWARE TREE-BASED MULTICAST ROUTING IN K-ARY N-CUBES 5 Fig. 2. Immediate header flit routing: Header flit A:2 is blocked while header flit A:1 continues to be routed, holding VCs and causing message B to block. Fig. 3. Delayed header flit routing: Header flit A:2 is blocked while header flit A:1 waits to be routed at the neighbor node until the last header flit in this message (flit A:2) is routed; Message B can now routed. the queue. If any remaining header flit cannot be immediately routed, it is also considered to be in a potential deadlock and is routed to the deadlock queue at the current node. Messages in the deadlock queue are reinjected into the network after a predetermined amount of time (reinjection delay). This reinjection delay will be further explored in Section 4. When the deadlock queue is full and another message is potentially deadlocked, an interrupt is generated and the message is absorbed into the current node. When space is available in the router's deadlock queue, the message is prefetched from the local processing node and moved to the deadlock queue in the router. Messages in the deadlock queue have priority over those messages that are newly generated at the same node. By allowing the overflow of messages to be stored in the local processing node, this deadlock queue becomes essentially infinite for all practical purposes without causing any additional delay in routing and eliminates the possibility of deadlock. Fig. 4 shows an example of a potentially deadlocked situation. Flit A:1 has been routed to a neighboring queue in the X dimension. Flit A:2 has not been routed in T number of cycles (timeout delay) and has been routed to the deadlock queue at the current node. Since this flit is potentially deadlocked, all remaining header flits in this message (Flit A:3) must now be immediately routed. Since Flit A:3 requests the queue already occupied by Message B, this flit cannot be immediately routed to a free channel and must also be routed to the deadlock queue. After a predetermined amount of time (reinjection delay), the message that was routed to the deadlock queue is moved to the source queue at the current node and reinjected into the network. Fig. 4. Potentially deadlocked situation: Flit A:1 has been routed to a neighboring queue. Flit A:2 has timed-out and been routed to the deadlock queue, requiring flit A:3 to be immediately routed. Since flit A:3 requests the queue occupied by Message B, it must also be routed to the deadlock queue behind flit A:2.

6 6 IEEE TRANSACTIONS ON COMPUTERS, VOL. 50, NO. 7, JULY 2001 Fig. 5. Schematic of 2D router for HTA. 3.4 HTA Router Implementation The HTA router implementation is shown in Fig. 5 and uses one unidirectional physical channel (PC) per dimension per node. Six VCs are multiplexed over one PC, with three of the VCs dedicated to unicast and three dedicated to multicast. Only one VC is required for the multicast communication algorithm. However, three VCs are used here in order to decrease message blocking probability and to increase performance. At least one sink channel is used for each node. Once a sink channel is assigned to a message, it is not released until the whole message has finished its transmission. Storage buffers are associated with input (IP) channels, requiring the routing decision to be made after buffering the message. A crossbar is used to connect input buffers to output buffers allowing the transfer of data flits to multiple VCs. 4 PERFORMANCE EVALUATION BY SIMULATION Extensive simulation experiments were carried out to compare the performance of our proposed HTA scheme with the two most representatives of the existing multicast schemes, e.g., Software Multicast (one unicast message is sent for every destination address in the multicast message) and Column-Path [3], [2]. (The destinations in a multicast message are placed into submulticast messages according to the column the destination is in. For example, in a unidirectional torus, at most k submulticast messages can be sent per multicast message, one submulticast message sent per column.) A discrete-time simulator was used for 8-ary 2-cube and 16-ary 2-cube networks. Message sizes varied from 16 to 64 flits and the number of destinations per multicast message was randomly chosen and varied from eight to 32. The buffer sizes used in the simulation are all equal to a single message length. All router implementations use six VCs per dimension. The Software Multicast and Column- Path algorithms both use two VCs per dimension for deterministic routing and four for adaptive. The timeout and reinjection delays for all message sizes and number of destinations per message simulated here for HTA are 16 and 50 clock cycles, respectively. Fifty cycles is a feasible delay because the deadlock detection and recovery scheme is not a software-based approach. Instead, the deadlock queue that holds potentially deadlocked messages in HTA is located in the router. The deadlock queue can hold one multicast message with all overflow messages being absorbed by the local processing node. The communication startup time required for ordering the messages in the Column-Path algorithm is not included in the simulations. The time required for creating and placing the messages in the source queue is also not included for any of the routing algorithms simulated here. The simulations use a stabilization threshold of a difference between traffic 1,000 clock cycles apart to determine steady state. Traffic was varied from 0.1 until saturation was reached in 0.1 increments. Simulations were performed for traffic composed of only multicast communication, only unicast communication, and half unicast and half multicast communication. To reduce the probability of deadlock near saturation, injection limitation schemes are often used [12], [19], [14]. In

7 KUMAR ET AL.: A NEW ADAPTIVE HARDWARE TREE-BASED MULTICAST ROUTING IN K-ARY N-CUBES 7 Fig. 6. Deadlock probabilities. the simulations performed here, message injections were limited to three unicast and one multicast message in the source queue simultaneously. This back-pressure mechanism sometimes results in a fairly flat curve near saturation in some of the latency versus target traffic graphs. All implementations use 12 sink channels. Although this is an unusually high number of consumption channels, the Column-Path routing algorithm requires this many channels for deadlock freedom since the adaptive routing algorithm proposed in [6], [1] is used for the base routing conformed path (as opposed to e-cube routing, where less number of sink channels are required). For fairness, the Software Multicast and HTA are also simulated with 12 sink channels, although both only require one sink channel. Fig. 6 shows the probability of deadlock versus normalized applied load for HTA. Fig. 7 shows the message latencies as well as the accepted load versus offered load

8 8 IEEE TRANSACTIONS ON COMPUTERS, VOL. 50, NO. 7, JULY 2001 Fig. 7. All multicast communications for an 8-ary 2-cube network with message size = 64 flits. Fig. 8. All multicast communication for a 16-ary 2-cube network with message size = 64 flits. plots of the three multicast routing algorithms for all multicast traffic for various message sizes and number of destination nodes. The remaining figures (Figs. 8, 9, and 10) show plots only for message latencies versus offered load since all other accepted traffic versus offered traffic graphs are similar to those in Fig. 7. When both unicast and multicast communication are simulated simultaneously, message latency includes both unicast and multicast latency. A discussion of these results is found in the following sections. 4.1 Deadlock Probability Fig. 6 shows the probability of deadlock versus normalized applied load for HTA. The probability of deadlock is the total number of potentially deadlocked messages (PDM) divided by the number of messages that have reached their destinations. The probabilities are low except near saturation. Taking into account the differences in the simulations (e.g., time-out and reinjection delays, bidirectionality, switching type, message and network size), HTA's results are comparable to those reported for k-ary n-cubes under unicast traffic in [20], [23], [9], [7], [14].

9 KUMAR ET AL.: A NEW ADAPTIVE HARDWARE TREE-BASED MULTICAST ROUTING IN K-ARY N-CUBES 9 Fig. 9. All unicast communication with message size = 64 flits. In schemes such as DISHA, when a potential deadlock occurs, one of the messages in the deadlocked set is removed from the network, using additional buffers at each node to route the message directly and immediately to its destination, and, therefore, the PDM does not have another chance of deadlocking [22], [23], [20]. In HTA, the PDM is removed from the network at the current node and is reinjected into the network after a given amount of time at the same current node in which it deadlocked. The reinjected PDM may potentially deadlock again along the path to its destination node. HTA does not require as much complex logic, is scalable, and does not have a single point of failure. Increasing the number of destinations per message results in a greater probability that a destination will block, increasing deadlock probability. The greater the message size, the greater the deadlock probability because more resources are occupied. Increasing network size results in a longer path between source and destination and also results in greater deadlock probability. 4.2 Multicast Latency Message Latency Fig. 7 shows that HTA performs best among all three algorithms at all utilization and for all message sizes, even without including the time for destination ordering in the Column-Path algorithm. This is because the probability of message blocking has been kept low, the deadlock detection and recovery algorithm is efficient, and because tree-based routing does not unnecessarily copy data flits when routing multicast messages. As the number of destinations per message increases, the Column-Path algorithm performs better (although not as well as HTA) because more destinations can be grouped together, requiring a smaller number of submulticast messages to be sent per multicast message Saturation Point HTA always has the highest saturation point since channels are only used and data flits are only copied when necessary. Traffic in the network is kept low, resulting in increased saturation points Effects of Network Size HTA's performance increases as the 8-ary 2-cube network is increased in size to a 16-ary 2-cube network (Fig. 8). Its performance is always better than the other two algorithms for all message sizes. The Column-Path algorithm performance suffers slightly due to the lower probability that messages will fall in the same column and therefore uses more submulticast messages. The HTA is much more flexible with respect to topology and its latency remains low Effects of Traffic Type (Unicast vs. Multicast) When traffic is composed of all unicast messages (Fig. 9), the Column-Path and Software Multicast algorithms give similar performance because both these algorithms use the same unicast routing scheme and have the same number of

10 10 IEEE TRANSACTIONS ON COMPUTERS, VOL. 50, NO. 7, JULY 2001 Fig. 10. Fifty percent unicast and 50 percent multicast communications for an 8-ary 2-cube network with message size = 64 flits. VCs devoted to unicast messages. The slight variances are simply due to the random generation of messages. At low utilization, HTA performs best because only three VCs per dimension are devoted to unicast communication, while the other two algorithms devote all six VCs. Having a smaller number of VCs per dimension means less message multiplexing, resulting in lower message latencies. At high utilization, the Column-Path and Software Multicast algorithms perform better because each of these algorithms has six VCs per dimension for unicast messages. Six VCs provide greater adaptivity for messages to be routed around blocked messages at high utilization, resulting in higher saturation points. When traffic is composed of half unicast and half multicast traffic (Fig. 10), unicast and multicast latency versus applied load graphs are shown. HTA performs comparably to or better than the other algorithms for all messages sizes and utilization. 5 COMPARISON OF HTA WITH EXISTING SCHEMES HTA differs in many important respects from the previously proposed tree-based routing algorithms [10], [13], [5], [4]. Below, we provide a detailed comparison of HTA with the existing schemes.. In most tree-based routing algorithms, wormhole switching is implemented. Virtual cut-through switching can greatly decrease the probability of deadlock over wormhole switching when a fixed number of virtual channels are used [23]. Although, in [5], [4], cut-through switching is used, the switching is implemented using a common buffer pool and the buffer is located in the local processing node (not in the router itself). In HTA, a buffer is implemented at every VC and every buffer is located in the router. This greatly increases performance.. In [5], [4], only one PC (with no VCs per PC) is used per dimension and only one sink channel is implemented. Increasing the number of VCs increases routing freedom, which in turn exponentially decreases the probability of deadlock [23], [20] and results in better performance.. Although [5], [4] use a deadlock detection and recovery scheme, the HTA scheme is more efficient. When a header flit blocks for a predetermined amount of time in HTA, the header flit is routed to the deadlock queue. All remaining header flits are then immediately routed to, first, any previously routed channels, then to any available and applicable channel, and, finally, to the deadlock queue if no other routing option remains.. The HTA scheme improves upon the deadlock recovery method in [5], [4] in which all header flits (including all those that have already been routed at the current node and at any of the downstream nodes) are aborted. In addition, for their deadlock recovery scheme, the entire message is always copied to the local processing node when a multicast is split (whether deadlock occurs or not). When an abort does happen, the message is already stored at the local node and is ready to be reinjected into the network after a given amount of time. However, this method wastes valuable channel bandwidth and causes contention in the network if two multicast channels request the split channel.. Choices for timeout values for deadlock detection and recovery schemes include timeouts equivalent to the size of the message [9], four times the message length [14], 8-16 cycles [21], and 800-1,000 cycles [19]. Tree-based multicast communication timeout requirements are slightly different than those for unicast communication. In tree-based multicast, more than one header flit is usually routed at each node. If the timeout is too great for each header flit, message progress through the network will be very slow since data flits are not forwarded until all header flits are routed. If the timeout is too small, many false deadlocks will result. HTA uses 16 cycles for all message sizes and number of destinations per message.. Reinjection delays for unicast communication on wormhole switching under k-ary n-cube networks are around 200 cycles [19], [14]. Deadlock detection and recovery techniques similar to DISHA [21] do not require reinjection delays because, when messages are potentially deadlocked, they are immediately routed using ªfloating buffersº to their destination. HTA uses a reinjection delay of 50 cycles.

11 KUMAR ET AL.: A NEW ADAPTIVE HARDWARE TREE-BASED MULTICAST ROUTING IN K-ARY N-CUBES 11. It was shown in [23] that bidirectional networks have a lower deadlock probability than unidirectional networks. Deadlock recovery schemes using bidirectional channels include [9], [19], [23], [14]. Unidirectional channels are addressed in [23], [7]. HTA uses unidirectional channels.. The QBM scheme [24] is a deadlock-free algorithm. This limits routing options to only those paths valid for deadlock freedom. The scheme also requires a startup delay for building a QBM tree at the ning of a user-level multicast, making it more suitable for bulk multicasts. HTA has more routing flexibility since any available path is a valid routing option in its deadlock detection and recovery scheme and HTA does not require any additional startup delay.. Several machines already have some hardware support for multicast. The ncube-2 is a wormholeswitched hypercube which supports broadcast within each subcube [15]. However, deadlock is possible if multiple multicasts exist [16]. The NEC Cenju-3 supports broadcast within each continuous region, but deadlock is once again possible if multiple multicasts exist. Finally, the Thinking Machines Corporation (TMC) CM-5 supports one multicast at a time via the control network. 6 CONCLUSIONS In this paper, we introduce a new fully adaptive minimal hardware tree-based routing algorithm (HTA) for multicast communication under virtual cut-through switching in k-ary n-cubes and present experimental evaluation of its performance under different operating conditions. HTA is compatible with existing unicast routing algorithms and uses deadlock detection and recovery. Our experimental results demonstrate that the deadlock probability in the proposed scheme remains low except near saturation; the probabilities are comparable to other existing schemes and vary between 0 and 15 percent, except near saturation, where it goes up to 30 percent. HTA performs very well and can outperform both Software Multicast and Column-Path algorithms. The superiority of the proposed multicast routing algorithm is due to its ability to keep the probability of message blocking at each intermediate node along a multicast message's path low. APPENDIX PSEUDOCODE OF THE HTA SCHEME Note: i = current node that flit is at i 0 = previous node that flit was at j = current queue that flit is in j 0 = previous queue that flit was in (all other variable meanings should be obvious from the context of the pseudocode) FUNCTION 1: function route_msg_at_node () delayed_header_flit_condition = FALSE; for (flit=0; flit;num_flits_in_msg; flit++) if flit = header flit while delayed_header_flit_condition = FALSE for (k=0; k;num_header_flits_in_msg; k++) if node[i'].queue[j'].flit[k]routed = TRUE then num_flits_routed++; if num_flits_routed = num_header_flits_in_msg then delayed_header_flit_condition = TRUE; route_header_flit_at_top_of_queue (flit); else forward data flit to all allocated paths for this msg at this node; if at least one header flit in current msg has been routed to deadlock q at current node then while node[i].deadlock_q.reinject_delay++ ; reinject_threshold node[i].deadlock_q.reinject_delay++; Place message from deadlock q into source q; FUNCTION 2: function route_header_flit_at_top_of_queue(header_flit) if header_flit = unicast then Route header_flit using unicast deadlock-free algorithm; else if header_flit can be routed to at least one channel already routed to by another flit in this msg then Route to a previously routed channel using a roundrobin policy among all previously routed channels; exit; else if header_flit can be routed to one or more free multicast channels at this node then Prioritize dimensions for this current header flit so

12 12 IEEE TRANSACTIONS ON COMPUTERS, VOL. 50, NO. 7, JULY 2001 that the dimension with the greatest distance this flit has to travel has the highest priority (priority n) followed by the next greatest distance this flit has left to travel (with priority n-1), and so on; for (p=n; p>0; p±) if can route current header_flit to any free multicast channel in the dimension with priority p then Route header flit on a free multicast channel in dimension with priority p; exit; else if any previous flit in this message at this node timed out then Route header flit to deadlock q; exit; else header_flit.timeout++; if header_flit.timeout > threshold then Route header_flit to deadlock q; exit; [10] X. Lin, P. McKinley, and L. Ni, ªPerformance Evaluation of Multicast Worm-Hole Routing in 2D Mesh Multicomputers,º Proc. Int'l Conf. Parallel Processing, pp , [11] X. Lin and L.M. Ni, ªDeadlock-Free Multicast Wormhole Routing in Multicomputer Networks,º Proc. Int'l Symp. Computer Architecture, pp , [12] P. Lopez, J. Martinez, J. Duato, and F. Petrini, ªOn the Reduction of Deadlock Frequency by Limiting Message Injection in Wormhole Networks,º Proc. Parallel Computing, Routing, and Comm. Workshop, June [13] M. Malumbres, J. Duato, and J. Torrellas, ªAn Efficient Implementation of Tree-Based Multicast Routing for Distributed Shared-Memory Multiprocessors,º Proc. Eighth IEEE Symp. Parallel and Distributed Processing, pp , Oct [14] J. Martinez, P. Lopez, J. Duato, and T. Pinkston, ªSoftware-Based Deadlock Recovery Technique for True Fully Adaptive Routing in Wormhole Networks,º Proc. Int'l Conf. Parallel Processing, Aug [15] NCUBE Co., NCUBE 6400 Processor Manual, [16] L. Ni, ªShould Scalable Parallel Computers Support Efficient Hardware Multicast,º Proc. Int'l Conf. Parallel Processing, [17] L.M. Ni and P.K. McKinley, ªA Survey of Wormhole Routing Techniques in Direct Networks,º Computer, pp , [18] D.K. Panda, S. Singal, and P. Prabhakaran, ªMultidestination Message Passing Mechanism Conforming to Base Wormhole Routing Scheme,º Proc. First Parallel Routing and Comm. Workshop, [19] F. Petrini, J. Duato, P. Lopez, and J. Martinez, ªLIFE: A Limited Injection, Fully Adaptive, Recovery-Based Routing Algorithm,º Proc. Fourth Int'l Conf. High Performance Computing, Dec [20] T. Pinkston and S. Warnakulasuriya, ªOn Deadlocks in Interconnection Networks,º Proc. Int'l Symp. Computer Architecture, pp , June [21] V. Anjan and T. Pinkston, ªAn Efficient, Fully Adaptive Deadlock Recovery Scheme: DISHA,º Computer Architecture News, vol. 23, no. 2, May [22] K.V. Anjan and T. Pinkston, ªAn Efficient, Fully Adaptive Deadlock Recovery Scheme: DISHA,º Proc. Int'l Symp. Computer Architecture, pp , [23] S. Warnakulasuriya and T. Pinkston, ªCharacterization of Deadlocks in Interconnection Networks,º Proc. Int'l Parallel Processing Symp., Apr [24] J. Yang and C. King, ªEfficient Tree-Based Multicast in Wormhole- Routed 2D Meshes,º Proc. Int'l Symp. Parallel Architectures, Algorithms, and Networks, REFERENCES [1] P. Berman, L. Gravano, G. Pifarre, and J. Sanz, ªAdaptive Deadlock and Livelock Free Routing with All Minimal Paths in Torus Networks,º Proc. Symp. Parallel Algorithms and Architectures, pp. 3-12, [2] R. Boppana, S. Chalasani, and C. Raghavra, ªOn Multicast Wormhole Routing in Multicomputer Networks,º Proc. Symp. Parallel and Distributed Processing, Oct [3] R. Boppana, S. Chalasani, and C. Raghavra, ªResource Deadlocks and Performance of Wormhole Multicast Routing Algorithms,º IEEE Trans. Parallel and Distributed Systems, vol. 9, no. 6, June [4] G. Byrd, R. Nakano, and B. Delagi, ªA Dynamic Cut-Through Communication Protocol with Multicast,º Technical Report STAN-CS , Stanford Univ., Aug [5] G. Byrd, N. Saraiya, and B. Delagi, ªMulticast Communication in Multiprocessor Systems,º Proc. Int'l Conf. Parallel Processing, vol. 1, pp , Aug [6] J. Duato, ªA New Theory of Deadlock-Free Adaptive Routing in Wormhole Networks,º Proc. Symp. Parallel and Distributed Processing, pp , [7] A. Folkestad and C. Roche, ªDeadlock Probability in Unrestricted Wormhole Routing Networks,º Proc. IEEE Int'l Conf. Comm., June [8] P. Kermani and L. Kleinrock, ªVirtual Cut-Through: A New Computer Communication Switching Technique,º Computer Networks, vol. 3, pp , [9] J. Kim, Z. Liu, and A. Chien, ªCompressionless Routing,º Proc. Int'l Symp. Computer Architecture (ISCA), Apr Dianne Kumar received the BS degree in applied physics from Xavier University in 1992 and the MS degree in electrical engineering and the PhD degree in computer science from Colorado State University in 1994 and 1999, respectively. She is currently an assistant professor in the Department of Computer Science and Engineering at the University of Colorado at Denver. Her current research interests include multiprocessor systems, interconnection networks, parallel and distributed computing, and networking. She is a member of the IEEE and of the ACM. Walid A. Najjar received the BE degree in electrical engineering from the American University of Beirut in 1979 and the MS and PhD degrees in computer engineering from the University of Southern California in 1985 and 1988, respectively. He is an associate professor in the Department of Computer Science and Engineering at the University of California Riverside. He was on the faculty of the Department of Computer Science at Colorado State University (1989 to 2000), before that he was with the USC-Information Sciences Institute. His research interests include computer architecture, reconfigurable and embedded systems, parallel computing systems, and interconnection networks.

13 KUMAR ET AL.: A NEW ADAPTIVE HARDWARE TREE-BASED MULTICAST ROUTING IN K-ARY N-CUBES 13 Pradip K. Srimani is a professor and chair of the Department of Computer Science at Clemson University. He has previously served on the faculty of the India Statistical Institute, Calcutta, Gesselschaft fuèr Mathematik und Datenverarbeitung, Bonn, West Germany, Indian Institute of Management, Calcutta, India, and Southern Illinois University, Carbondale, Illinois, Colorado State University, Ft. Collins, Colorado, and the Technical University of Compiegne, France. He was the editor-in-chief of the IEEE Computer Society Press and is an associate editor of the IEEE Transactions on Data and Knowledge Engineering and a contributing member of IEEE Software. His research interests include mobile computing, distributed computing, parallel algorithms, networks, and graph theory applications. He is a co-editor of two books on software reliability and distributed mutual exclusion algorithms by IEEE CS Press. He has guest-edited special issues for Computer, IEEE Software, VLSI Design, Journal of Systems & Software, and Journal of Computer & Software Engineering, IEEE Transactions on Software Engineering, Parallel Computing, International Journal of Systems Science. He is a member of the ACM/IEEECS Steering Committee on Curricula He is a fellow of the IEEE and a member of the ACM.. For further information on this or any computing topic, please visit our Digital Library at

A Simple and Efficient Mechanism to Prevent Saturation in Wormhole Networks Λ

A Simple and Efficient Mechanism to Prevent Saturation in Wormhole Networks Λ E. Baydal, P. López and J. Duato Depto. Informática de Sistemas y Computadores Universidad Politécnica de Valencia, Camino