Deadlock- and Livelock-Free Routing Protocols for Wave Switching

Similar documents
Optimal Topology for Distributed Shared-Memory. Multiprocessors: Hypercubes Again? Jose Duato and M.P. Malumbres

SOFTWARE BASED FAULT-TOLERANT OBLIVIOUS ROUTING IN PIPELINED NETWORKS*

A Simple and Efficient Mechanism to Prevent Saturation in Wormhole Networks Λ

Deadlock. Reading. Ensuring Packet Delivery. Overview: The Problem

Software-Based Deadlock Recovery Technique for True Fully Adaptive Routing in Wormhole Networks

Lecture 12: Interconnection Networks. Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E)

A Hybrid Interconnection Network for Integrated Communication Services

Combining In-Transit Buffers with Optimized Routing Schemes to Boost the Performance of Networks with Source Routing?

A New Theory of Deadlock-Free Adaptive Multicast Routing in. Wormhole Networks. J. Duato. Facultad de Informatica. Universidad Politecnica de Valencia

Performance Evaluation of a New Routing Strategy for Irregular Networks with Source Routing

Lecture: Interconnection Networks

Combining In-Transit Buffers with Optimized Routing Schemes to Boost the Performance of Networks with Source Routing

Input Buffering (IB): Message data is received into the input buffer.

Adaptive Multimodule Routers

Generalized Theory for Deadlock-Free Adaptive Wormhole Routing and its Application to Disha Concurrent

Basic Low Level Concepts

Lecture 13: Interconnection Networks. Topics: lots of background, recent innovations for power and performance

Fault-Tolerant Routing Algorithm in Meshes with Solid Faults

EE 6900: Interconnection Networks for HPC Systems Fall 2016

Fault-Tolerant Routing in Fault Blocks. Planarly Constructed. Dong Xiang, Jia-Guang Sun, Jie. and Krishnaiyan Thulasiraman. Abstract.

Improving Network Performance by Reducing Network Contention in Source-Based COWs with a Low Path-Computation Overhead Λ

A New Theory of Deadlock-Free Adaptive. Routing in Wormhole Networks. Jose Duato. Abstract

Routing and Deadlock

Networks: Routing, Deadlock, Flow Control, Switch Design, Case Studies. Admin

A Fully Adaptive Fault-Tolerant Routing Methodology Based on Intermediate Nodes

Lecture 24: Interconnection Networks. Topics: topologies, routing, deadlocks, flow control

Lecture 16: On-Chip Networks. Topics: Cache networks, NoC basics

Communication Performance in Network-on-Chips

EE482, Spring 1999 Research Paper Report. Deadlock Recovery Schemes

4. Networks. in parallel computers. Advances in Computer Architecture

Removing the Latency Overhead of the ITB Mechanism in COWs with Source Routing Λ

Wormhole Routing Techniques for Directly Connected Multicomputer Systems

Module 17: "Interconnection Networks" Lecture 37: "Introduction to Routers" Interconnection Networks. Fundamentals. Latency and bandwidth

TDT Appendix E Interconnection Networks

Lecture 12: Interconnection Networks. Topics: dimension/arity, routing, deadlock, flow control

Lecture 3: Flow-Control

Flow Control can be viewed as a problem of

Lecture 7: Flow Control - I

Interconnection Networks: Flow Control. Prof. Natalie Enright Jerger

Routing Algorithm. How do I know where a packet should go? Topology does NOT determine routing (e.g., many paths through torus)

Recall: The Routing problem: Local decisions. Recall: Multidimensional Meshes and Tori. Properties of Routing Algorithms

Total-Exchange on Wormhole k-ary n-cubes with Adaptive Routing

Packet Switch Architecture

Packet Switch Architecture

Generic Methodologies for Deadlock-Free Routing

The Odd-Even Turn Model for Adaptive Routing

Interconnection topologies (cont.) [ ] In meshes and hypercubes, the average distance increases with the dth root of N.

Lecture 15: PCM, Networks. Today: PCM wrap-up, projects discussion, on-chip networks background

Interconnection Networks

Deadlock and Livelock. Maurizio Palesi

Communication in Multicomputers with Nonconvex Faults?

On Topology and Bisection Bandwidth of Hierarchical-ring Networks for Shared-memory Multiprocessors

Lecture: Interconnection Networks. Topics: TM wrap-up, routing, deadlock, flow control, virtual channels

Interconnection Networks: Topology. Prof. Natalie Enright Jerger

Interconnection Network

Resource Deadlocks and Performance of Wormhole Multicast Routing Algorithms

Improving the Performance of Bristled CC-NUMA Systems Using Virtual Channels and Adaptivity

Fault-Tolerant Wormhole Routing Algorithms in Meshes in the Presence of Concave Faults

Multi-path Routing for Mesh/Torus-Based NoCs

Communication in Multicomputers with Nonconvex Faults

Boosting the Performance of Myrinet Networks

Synchronized Progress in Interconnection Networks (SPIN) : A new theory for deadlock freedom

Multicomputer distributed system LECTURE 8

Lecture 26: Interconnects. James C. Hoe Department of ECE Carnegie Mellon University

CONNECTION-BASED ADAPTIVE ROUTING USING DYNAMIC VIRTUAL CIRCUITS

The final publication is available at

Evaluation of NOC Using Tightly Coupled Router Architecture

Network-on-chip (NOC) Topologies

MESH-CONNECTED networks have been widely used in

NOW Handout Page 1. Outline. Networks: Routing and Design. Routing. Routing Mechanism. Routing Mechanism (cont) Properties of Routing Algorithms

Interconnect Technology and Computational Speed

EE382C Lecture 1. Bill Dally 3/29/11. EE 382C - S11 - Lecture 1 1

A closer look at network structure:

Deadlock: Part II. Reading Assignment. Deadlock: A Closer Look. Types of Deadlock

A First Implementation of In-Transit Buffers on Myrinet GM Software Λ

Multiprocessor Interconnection Networks

Lecture: Transactional Memory, Networks. Topics: TM implementations, on-chip networks

A Survey of Routing Techniques in Store-and-Forward and Wormhole Interconnects

Network on Chip Architecture: An Overview

Power and Performance Efficient Partial Circuits in Packet-Switched Networks-on-Chip

NOC: Networks on Chip SoC Interconnection Structures

Deadlock-free Fault-tolerant Routing in the Multi-dimensional Crossbar Network and Its Implementation for the Hitachi SR2201

Routing Algorithms, Process Model for Quality of Services (QoS) and Architectures for Two-Dimensional 4 4 Mesh Topology Network-on-Chip

Lecture 28: Networks & Interconnect Architectural Issues Professor Randy H. Katz Computer Science 252 Spring 1996

MIMD Overview. Intel Paragon XP/S Overview. XP/S Usage. XP/S Nodes and Interconnection. ! Distributed-memory MIMD multicomputer

The Cray T3E Network:

ECE 669 Parallel Computer Architecture

Interprocessor Communication. Basics of Network Routing

Interconnection Networks

Deadlock-free Routing in InfiniBand TM through Destination Renaming Λ

On Constructing the Minimum Orthogonal Convex Polygon in 2-D Faulty Meshes

Deadlock-Free Connection-Based Adaptive Routing with Dynamic Virtual Circuits

Design and Evaluation of a Fault-Tolerant Adaptive Router for Parallel Computers

A Literature Review of on-chip Network Design using an Agent-based Management Method

Rajendra V. Boppana. Computer Science Division. for example, [23, 25] and the references therein) exploit the

Lecture 25: Interconnection Networks, Disks. Topics: flow control, router microarchitecture, RAID

Lecture 18: Communication Models and Architectures: Interconnection Networks

The Adaptive Bubble Router 1

MMR: A High-Performance Multimedia Router - Architecture and Design Trade-Offs

EE/CSCI 451: Parallel and Distributed Computation

Transcription:

Deadlock- and Livelock-Free Routing Protocols for Wave Switching José Duato,PedroLópez Facultad de Informática Universidad Politécnica de Valencia P.O.B. 22012 46071 - Valencia, SPAIN E-mail:jduato@gap.upv.es Sudhakar Yalamanchili Computer Systems Research Laboratory School of Electrical and Computer Engineering Georgia Institute of Technology Atlanta, Georgia 30332-0250 E-mail:sudha@ee.gatech.edu Abstract Wave switching is a hybrid switching technique for high performance routers. It combines wormhole switching and circuit switching in the same router architecture. Wave switching achieves very high performance by exploiting communication locality. When two nodes are going to communicate frequently, a physical circuit is established between them. By combining circuit switching, pre-established physical circuits and wave pipelining across channels and switches, it is possible to increase network bandwidth considerably, also reducing latency for communications that use pre-established physical circuits. In this paper, we propose two protocols for routers implementing wave switching. The first protocol handles the network as a cache of circuits, automatically establishing a circuit when two nodes are going to communicate. Subsequent communications use the previously established circuit. When a new circuit requests channels belonging to another circuit, a replacement algorithm selects the circuit to be torn down. The second protocol relies on the programmer and/or the compiler to decide when a circuit should be established or torn down for a set of messages. Also, we show that the proposed protocols are always able to deliver messages, and are deadlock- and livelock-free. 1. Introduction Distributed memory multiprocessors rely on an interconnection network to exchange information between nodes. In multicomputers [2], each processor has a local address space and the interconnection network is used for message passing between processors. In distributed shared-memory multiprocessors (DSMs), the interconnectionnetwork is used either to access remote memory locations [16] or to support a cache coherence protocol [17]. Nowadays, these architectures use a similar hardware support to implement the interconnection network [14, 16, 17]. State-of-the-art inter- Supported by the Spanish CICYT under Grant TIC94 0510 C02 01 connection networks use low dimensional topologies and wormhole switching [5]. Multicomputers usually send messages by calling a system function. This system call has a considerable overhead due to buffer allocation at source and destination nodes, message copying between user and kernel space, packetization, in-order delivery and end-to-end flow control. Even for a very efficient messaging layer based on active messages [20], software overhead accounts for 50 70% of the total cost [15]. Therefore, reducing the network hardware latency has a minimal impact on performance. On the other hand, messages are directly sent by the hardware in DSMs, as a consequence of remote memory accesses or coherence commands. Reducing the network hardware latency and increasing network throughput is crucial to improve the performance of DSMs. Satisfying the requirements of multicomputers and DSMs is not a trivial task. Wormhole routers are simple and fast. The main limitation of wormhole switching is the contention produced by blocked messages. Those messages remain in the network, preventing the use of the channels they occupy and wasting channel bandwidth. Virtual channels can increase throughput considerably by dynamically sharing the physical bandwidth among several messages [7]. Another approach to reduce contention and improve channel utilization consists of using adaptive routing [11]. Adaptive routing algorithms must be carefully designed to avoid deadlocks [8, 9]. Virtual channels can be combined with adaptive routing to maximize throughput. Unfortunately, virtual channels and adaptive routing make the router more complex, increasing node delay [4]. As a consequence, latency may increase. Latency can also be reduced by using an appropriate mapping of processes to processors, exploiting spatial locality in communications. In many cases, this locality is not only spatial but also temporal. If the router architecture supports circuit switching, the compiler could generate instructions that instruct the router to set up a path or circuit that will be heavily used during a certain period of time. Also, once a circuit has been established there is no contention for the messages using that circuit.

It is possible to design a router architecture such that a circuit is set up and left open for future transmission of data items. As far as we know, this technique was first proposed in [3] for systolic communication. It has also been proposed in [13] for message passing. The underlying idea behind pre-established circuits is similar to the use of cache memory in a processor. A common limitation of both, caches and networks, is that they are limited resources. If a new circuit requires the use of some channels belonging to previously established circuits, they must be torn down. Another common aspect of both, caches and networks, is that they may require compiler support to work more efficiently. Prefetching is an efficient technique to hide memory latency in case of cache misses [18]. Similarly, when two nodes are going to exchange several messages, the compiler could generate instructions that instruct the router to set up a circuit between those nodes before that circuit is needed. When data are available, the circuit has already been established. Thus, message transfer will be faster because header routing time and contention have been eliminated. There is an important difference between caches and circuits. Caches are much faster than core memory. However, circuits offer the same bandwidth as regular channels. If circuits were able to use faster channels, network performance would increase considerably because bandwidth would be allocated to messages that really need that bandwidth. In a previous paper, we proposed a new hybrid switching technique called wave switching as well as a router architecture supporting it [10]. Wave switching implements wormhole switching and circuit switching concurrently. By combining circuit switching, pre-established physical circuits 1 and wave pipelining across channels and switches, it is possible to increase network bandwidth considerably, also reducing latency for communications that use pre-established physical circuits. As shown in [10], wave switching is able to reduce latency and increase throughput by a factor higher than three if messages are long enough ( 128 flits), even if circuits are not reused. For short messages, wave switching can only improve performance if circuits are reused. In this paper, we propose two routing protocols to establish and tear-down circuits. The first protocol handles the network as a cache of circuits. The second protocol relies on the programmer and/or the compiler to decide when a circuit should be established or torn down for a set of messages. Also, we show that the proposed protocolsare deadlock- and livelock-free. The router architecture for wave switching proposed in [10] is described in section 2. Section 3 proposes two routing protocols for wave switching. Those protocols are shown to be deadlock-free and livelock-free in section 4. Finally, the paper presents some concluding remarks and directions for future work. 1 A physical circuit is a circuit made of physical channels. A virtual circuit is a circuit made of virtual channels. Where it is irrelevant to the discussion, we will refer to them simply as circuits. Input channels Input queues (virtual channels) Switch Routing control unit From/to local processor Output queues (virtual channels) Figure 1. Typical architecture of a wormhole router 2. Router Architecture for Wave Switching When the network topology has more dimensions than the number of physical dimensions used to implement it, some channels require long wires. In this case, channel delay has a major impact on clock frequency. Some researchers have proposed the use of pipelined channels to reduce the impact of wire length on clock frequency [19]. Pipelined channels use wave pipelining. New data are injected into a channel before previously injected data reached the other end of the channel. Propagation speed is only limited by wire capacitance. At a given time, several data are propagating along a channel. As wires have some capacitance, wave front is not sharp, limiting the maximum frequency at which data can be pipelined. The use of pipelined channels allows the designer to compute clock frequency independently of wire delay. Although long wires do not affect the speed of the router, clock frequency is still selected by considering routing delay and switch delay. In this section, we present a router architecture that allows higher clock frequencies, increasing bandwidth considerably. It is based on the use of wave pipelining across both, switches and channels. Assume that the router supports circuit switching and that a circuit has been established. Let us analyze the requirements to support wave pipelining. If we consider the router architecture shown in Figure 1, we can see that it is possible to pipeline flits across the switch as fast as flit buffers are able to deliver flits. Similarly, flits can be pipelined into physical channels at the same speed. However, using a higher clock frequency implies that round-trip delay for sending flits across channels and receiving acknowledgments requires more clock cycles. As a consequence, a windowing protocol with a longer window should be used. A longer window also requires mux mux Output channels

deeper buffers. Finally, deeper buffers require a longer delay to reach the first empty buffer in the worst case, therefore increasing latency and limiting the increase in clock frequency. Therefore, a different approach is required if we want to take full advantage of wave pipelining across switches and channels. Taking into account that the circuit has been previously established, flits will not find any busy channel in their way towards the destination node. If there is no contention with other messages then there is no need for flow control, unless virtual channels are used. Assume that physical channels are not split into virtual channels. In this case, circuits use physical channels and flow control can be removed. As a consequence, flit buffers are no longer required and every switch transmits information directly to the switch in the next node. However, a careful design is required to minimize the skew between wires in a parallel data path. Synchronizers are required at each delivery channel. Synchronizers may also be required at each switch input to reduce the skew. This switching technique is known as circuit switching. With circuit switching, it is possible to use wave pipelining across switches and physical channels, clocking at a very high frequency. Basically, clock frequency is limited by memory bandwidth, by signal skew and by latch setup time. With a proper router and memory design, this frequency can be much higher than the one used in current routers, increasing channel bandwidth and network throughputaccordingly. As shown in [10], circuit simulations using Spice indicated that clock frequency could be up to four times higher than in a wormhole router using the same technology. Note that latency for sending information in pre-established physical circuits is also considerably reduced because flit buffers have been removed. However, pre-established physical circuits require the use of dedicated physical channels to allow multiple circuits between routers. Also, some mechanism is required to set up and tear-down circuits. This can be solved by using a hybrid router architecture as shown in Figure 2. This architecture implements wormhole switching and circuit switching concurrently. Circuits are built using physical channels. However, physical circuits are set up and torn down by using a set of dedicated virtual channels, which will be referred to as control channels. Wormhole switching uses another set of dedicated virtual channels. This router architecture has several switches S 0 ;S 1 ;:::;S k and two routing control units. One of them, together with switch S 0, implements wormhole switching. The second routing control unit implements pipelined circuit switching (PCS) [12] as follows: The remaining switches S 1 ;:::;S k implement circuit switching on pre-established physical circuits using wave pipelining. As circuit switching does not provide flow control at the link level, physical channels are split into narrower physical channels. Although there is no flow control at the link level, end-to-end flow control is required between the injection buffer at the source node and the delivery buffer at the destination node. Taking Input Channels Pipelined Input Channels Sync Sync Sync Sync Switch S Switch S Control Channels k 1 Switch S 0 WH Routing Control Unit PCS Routing Control Unit From/to Local Processor mux mux mux Figure 2. New router architecture Pipelined Output Channels Output Channels into account the round-trip delay for control signals (i.e., acknowledgments), a windowing protocol is implemented. This protocol requires deep delivery buffers to prevent buffer overflow while acknowledgments are transmitted towards the source node of the message. Each physical channel in switch S 0 is split into k +w virtual channels. Among them, k channels are the control channels associated with the corresponding physical channels in switches S 1 ;:::;S k. Control channels are only used to set up and tear-down physical circuits. These channels have capacity for a single flit because they only transmit control flits. Control channels are handled by the PCS routing control unit. The remaining w virtual channels are used to transmit messages using wormhole switching and require deeper buffers. They are handled by the wormhole routing control unit, as mentioned above. The hybrid switching technique

implemented by this router architecture will be referred to as wave switching. The proposed router architecture allows sending messages using wormhole switching. Messages are routed using either a deterministic or an adaptive routing algorithm, blocking if necessary on busy channels. Thus, the routing algorithm must be deadlock-free. The router also allows establishing physical circuits on switches S 1 ;:::;S k. These circuits are established by sending a probe through the control channels. In order to maximize the probability of establishing a circuit, a misrouting backtracking protocol with a maximum of m misroutes is used (MB-m) [12]. Once the physical circuit has been reserved, an acknowledgment is returned. The only difference with respect to PCS proposed in [12] is that the path being reserved is formed by a different set of channels, i.e., those using switches S 1 ;:::;S k. Therefore, the probe only reserves a fragment of a physical circuit if the corresponding channels (a bidirectional control channel and the associated physical channel in switch S i ;i 2[1::k]) are free. Both of them are reserved at the same time. Therefore, status registers are only needed for control channels in the PCS routing control unit. The control channel is needed to backtrack if the need arises, to return the acknowledgment, and to release the physical circuit. Once a circuit is no longer required, it is torn down by sending a control flit through the same path using the control channels. Figure 3 shows the status registers associated with the PCS routing control unit. The Channel Status registers indicate whether the corresponding channel is free or busy. It can be easily extended to handle faulty channels. These registers are associated with each output control channel. The direct and reverse mappings between input and output channels are stored in the Direct Channel Mappings and Reverse Channel Mappings registers, respectively. As mentioned above, the reverse path is required to return acknowledgments. The History Store keeps track of the output links that have already been searched by the probe. Together, the History Store registers of all the network keep track of the paths already searched by each probe, therefore avoiding the repeated search of the same path. By storing information about searched links in the History Store register, the probe is kept small. Finally, a control bit associated with each output control channel (Ack Returned) indicates whether the acknowledgment for path setup has been returned through that channel. Channel Status Direct Channel Mappings Reverse Channel Mappings History Store Ack Returned Header Backtrack Misroute Force X1-offset Xn-offset Figure 4. Format of a routing probe The format of a routing probe is shown in Figure 4. The Header bit identifies the flit as a probe. The Backtrack bit indicates whether the probe is progressing or backtracking. The Misroute field indicates the number of misrouting operations performed by the probe. The Force bit is used by one of the routing protocols to force channel release. The remaining fields are offsets from the destination node. The circuits starting at each node are recorded in a special set of registers denoted as Circuit Cache. Those registers are located in the network interface of every node. Figure 5 shows the structure of those registers. When a circuit uses a switch S i ;i 2[1::k] at the source node, it uses the same switch S i at every intermediate node. The Switch field indicates the switch being searched by the probe or the switch used by the circuit once the path has been set up. The Channel field indicates the output channel used by the circuit at the source node. The Dest field indicates the destination node of the circuit. If a probe does not succeed to establish a circuit across some switch, it may try other switches depending on the routing protocol. The Initial Switch field records the first switch tried to avoid repeating the search. The Ack Returned field indicates that the acknowledgment of path setup has been returned and the circuit is ready to be used. The remaining fields are used for replacement purposes. The In-use bit is set when there is a message in transit, thus preventing the circuit from being released until message transmission has finished. This bit is reset when the source node receives the acknowledgment for the last fragment of the message. The Replace field stores accounting information regarding the use of the circuit. The meaning of this field depends on the replacement algorithm. Depending on the requirements of the applications running on the machine, the interconnection network may be designed with a single or several wave pipelined switches. Note that if a single wave pipelined switch per node is used, each pair of adjacent routers can only have one link capa- Initial Switch Initial Switch Initial Switch Switch Channel Dest. Switch Channel Dest. Switch Channel Dest. Ack Returned Ack Returned Ack Returned In-use In-use In-use Replace Replace Replace Figure 3. Status registers of the PCS routing control unit Figure 5. Structure of the Circuit Cache registers

ble of wave pipelining between them. On the other hand, splitting physical channels into narrower physical channels shares bandwidth in a very inflexible way. Additionally, a few control signals must be provided for each individual physical channel. Thus, it is not recommended to split each channel into many narrow physical channels. In addition to a higher network bandwidth and lower latency for messages using pre-established physical circuits, the proposed router architecture has some interesting advantages. Circuits are established by sending a probe. The probe uses the MB-m protocol, being allowed to backtrack if it cannot proceed forward. This protocol is very resilient to static faults in the network, as indicated in [12]. Also, once a circuit has been established between two nodes, in-order delivery is guaranteed for all the messages transmitted between those nodes. Additionally, for message passing, software overhead associated with message transmission can be considerably reduced if message buffers are allocated at both ends when the circuit is established. Those buffers will be reused by all the messages using the circuit. If the circuit is explicitly established by the programmer and/or the compiler for a set of messages, buffer size is determined by the longest message of the set. On the other hand, if the circuit is automatically established the first time a node sends a message to another node, the size of the longest message using that circuit is not known at that time. A reasonably large buffer can be allocated. In this case, buffers may have the be re-allocated for longer messages. Also, note that channel width in low dimensional topologies is limited by pin count. More wires are allowed across the network bisection [1, 6]. It can be seen in Figure 2 that only a few control signals connect the PCS routing control unit with each switch S i ; i 2 [1::k]. For very high performance, several switches per node can be used, each one being implemented in its own chip. In this case, channel bandwidth does not decrease when the number of switches increases, assuming that this number is small. As a consequence, scalability is excellent because the number of switches (chips) per node can increase as network size increases, thus compensating the higher average distance traveled by messages. Such an architecture design follows a multi-chip implementation approach similar to the Cray T3D routers wherein each dimensional crossbar is a separate chip. This effectively removes from consideration the pin out limitations between adjacent nodes since the pin requirements between these chips is quite small. The interesting design question then becomes how best to use the bisection bandwidth resource that is determined by the packaging technology. Finally, the proposed router architecture is very flexible. It can be tailored to different requirements. Several parameters can be adjusted, including the number of fast switches, the number of virtual channels for wormhole switching, and the routing protocols for wormhole switching and PCS. The simplest version of wave router is obtained by setting k =1 and w =0. In this case, all the messages use PCS. 3. Routing Protocols In this section, we propose two routing protocols for wave switching. The first routing protocol automatically establishes a circuit when a node sends a message to another node and no circuit existed between them. When a circuit is being established and all the requested channels have been previously reserved by other circuits, a replacement algorithm selects a circuit. This circuit is torn down, releasing the channels forming it, and allowing the establishment of the new circuit. This protocol works like cache protocols. A cache line is brought from main memory every time a miss occurs. When a line is required and the cache is full, a replacement algorithm selects a line to be removed from the cache. This protocol will be referred to as Cache-Like Routing Protocol (CLRP). The second protocol relies on the programmer and/or the compiler to determine when a circuit should be established or torn down. Circuits are only established when there is enough temporal communication locality so that it is worth establishing a circuit. When communication between two nodes is not frequent enough, messages are sent using wormhole switching. This protocol will be referred to as Compiler Aided Routing Protocol (CARP). A circuit should only be requested if there are enough free physical channels to build it. However, the routing protocol should consider the case where all the requested channels are busy. This protocol is similar to prefetching for caches. When a circuit is going to be heavily used, it is established in advance. The main difference between caches and networks regarding this protocol is that a circuit should be explicitly torn down when it is no longer needed. We believe that the CARP protocol is able to achieve a higher performance because a circuit is only established when there is enough temporal communication locality. By doing so, every message or set of messages uses the best suited switching technique. In particular, the CARP protocol does not establish circuits for individual short messages. Moreover, when buffers are allocated for transmitting a set of messages, buffer size is large enough to store the longest message, therefore avoiding buffer re-allocation. Additionally, channels are a scarce resource. The CARP protocol allows the use of global optimization algorithms. However, developing a suitable compiler support for the CARP protocol may take several years. On the other hand, the CLRP protocol does not need any compiler or programmer support. Therefore, it can be used for the first generation of multiprocessors implementing wave switching in their interconnection network. 3.1. Cache-Like Routing Protocol The CLRP protocol for message transmission uses the Force bit. When the Force bit is not set, the probe backtracks if it does not find a free valid channel. If the Force bit is set, the probe does not backtrack, tearing circuits down to ob-

tain the required channels. The most general form of this protocol is as follows: When a node sends a message to another node, the source node reads the Circuit Cache register to see if a circuit exists for the requested destination. If it exists, the circuit is used. Otherwise, a circuit is established in several phases, possibly tearing-down other circuits, and storing the corresponding entry in the Circuit Cache. In the first phase, a switch S i ;i 2[1::k] with a free output channel is selected at the source node, and a probe with the Force bit reset is sent to establish a physical circuit. The selected switch is recorded in the Initial Switch field of the corresponding entry in the Circuit Cache register. It is convenient that neighboring nodes try to use different initial switches. For example, in a 2D-mesh, node (x; y) can first try switch 1+(x+y)modk. At each intermediate node, a free output channel from switch S i is selected. As mentioned above, the probe uses the MB-m protocol, being allowed to misroute up to m times, and to backtrack if it cannot proceed forward. If the circuit is successfully reserved, an acknowledgment is returned. If it were not possible to establish a circuit, and the probe backtracks up to the source node, the next switch modulo k is tried. The current switch is recorded in the Switch field. The Initial Switch field prevents the probe from searching the same circuit twice. If it were not possible to establish a circuit across any switch, the Force bit is set in the probe, sending it again across switch Initial Switch. This is the second phase of the protocol. If the probe cannot proceed forward at some node (including the source node), it selects a circuit from the Circuit Cache such that it uses one of the requested channels and has the Ack Returned bit set. This circuit starts at the current node. Therefore, it can be torn down without interrupting any message in transit. Once the message currently using that circuit (if any) has been sent, the In-use bit is reset and the circuit is torn down, thus releasing the requested channel. It may happen that all the requested channels belong to circuits being set up or circuits that cross the current node but start at different nodes. In that case, a circuit crossing the node is selected among those that returned the acknowledgment flit (those that have the Ack Returned bit set in the PCS routing control unit). A control flit is sent towards the source node of that circuit, requesting its source node to release it. Once the message currently using that circuit (if any) has been sent, the circuit is torn down, thus releasing the requested channel. Therefore, the probe can proceed reserving the circuit. In the very unlikely case that all the outgoing channels of a node belong to circuits currently being established, the probe backtracks even if the Force bit is set. If it were not possible to establish a circuit, and the probe backtracks up to the source node, the next switch modulo k is tried. If it were not possible to establish a circuit across any switch, the protocol enters the third phase. In this phase, the message is transmitted using wormhole switching through S 0 switches. The CLRP protocol can be simplified in several ways. First, when a circuit cannot be established by using Initial Switch, the Force bit can be set without trying the remaining switches. Similarly, the second phase may try a single switch. Second, the Force bit can be set when the probe is first sent to establish the circuit, therefore skipping phase one. The optimal protocol depends on the number of physical switches per node, and on the applications. It can only be tuned by using traces from real applications. This falls out of the scope of this paper. 3.2. Compiler Aided Routing Protocol The CARP protocol is much simpler. However, it requires that the compiler and/or the programmer generate instructions to set up and tear-down a circuit. It works as follows: The compiler and/or the programmer decide whether a physical circuit shouldbe established for a set of messages. For those messages not requiring pre-established circuits, wormhole switching is used across S 0 switches. When a physical circuit is requested, a switch S i ; i 2 [1::k] is selected and a probe is sent to establish it. As mentioned above, it is convenient that neighboring nodes try to use different initialswitches. At each intermediate node, a free output channel from switch S i is selected. As mentioned above, the probe uses the MB-m protocol, being allowed to misroute up to m times, and to backtrack if it cannot proceed forward. If the circuit is successfully reserved, it will be used for the corresponding messages. When the circuit is no longer required, it is explicitly torn down by generating the appropriate instructions. If it were not possible to establish a circuit, the partially reserved path is torn down and the next switch modulo k is tried. If it were not possible to establish a circuit across any switch, messages requesting that circuit will have to use wormhole switching through S 0 switches. 4. Deadlock and Livelock Avoidance For the proof of deadlock freedom, we rely on the properties of routing protocols for PCS and wormhole switching. In [12], it was shown that the MB-m protocol is deadlockfree. The basic idea is that a probe can always backtrack up to the source, therefore releasing all the resources it previously reserved, and allowing other probes to advance. Also, in [5, 8, 9], it was shown how to design deadlock-free routing algorithms for wormhole switching. Basically, for deterministic routing algorithms there should be no cyclic dependencies between channels. For adaptive routing, cyclic dependencies are allowed provided that there exists a subset of channels without cyclic dependencies between them. The following theorems prove that the proposed protocols are deadlock-free. Theorem 1 The CLRP protocol is deadlock-free. Proof: The proof proceeds by analyzing all the possible cases. The use of previously reserved circuits does not produce deadlock because messages do not request more resources. No deadlock can arise while establishing a circuit

with the Force bit reset in phase one because the misrouting backtracking protocol MB-m is deadlock-free. When the Force bit is set (phase two), the probe may block at a node waiting on a previously established circuit to be torn down. If that circuit starts at the current node, it is immediately torn down unless there is a message in transit. If there is a message in transit, it will take a finite amount of time to end the transmission because messages have a finite length, andthedestinationnodewillaccept the message because the circuit was previously established. If the circuit does not start at the current node, an already established circuit crossing that node is selected. The selection is done in finite time. After selecting a circuit, a control flit is sent towards the source node of the circuit using control channels. Those control channels are free because once the acknowledgment for path setup is returned, no other traffic crosses those channels towards the source node. When the control flit reaches the source node, the circuit will be released in finite time, just after ending the transmission of the current message (if any). It may happen that the circuit is being released while the control flit advances towards the source node of the circuit. In this case, the control flit is discarded at some intermediate node and the circuit is released in finite time. Also, it may happen that two different nodes send control flits requesting the same circuit to be released. The first control flit will initiate circuit releasing. The second control flit will be discarded, as indicated above. Both probes requesting channels from the circuit at different nodes will be able to reserve those channels in finite time. If all the requested channels at a given node belong to circuits being established, the probe does not block, avoiding deadlock by backtracking to the previous node. If the probe is not able to establish any circuit, it uses wormhole switching, blocking if necessary on busy channels. This does not produce deadlock because the routing algorithm used for wormhole switching is deadlock-free. Additionally, PCS and wormhole switching do not interact. Each switching technique uses its own set of resources (routing control unit, switches and channels). Therefore, the CLRP protocol is deadlock-free. 2 It could be thought that a probe may block at a node if all the requested channels belong to circuits being established. However, waiting on busy channels while keeping previously reserved channels produces channel dependencies. As probes use the MB-m protocol, there would be cyclic dependencies between channels. Therefore, each probe could be waiting for a channel that will never be released, producing a deadlock. Thus, deadlock is avoided by backtracking. Theorem 2 The CARP protocol is deadlock-free. Proof: Establishing a circuit cannot produce deadlock because the misrouting backtracking protocol MB-m is deadlock-free. If the probe is not able to establish any circuit, the message uses wormhole switching, blocking if necessary on busy channels. This does not produce deadlock because the routing algorithm used for wormhole switching is deadlock-free. Additionally, PCS and wormhole switching do not interact. Each switching technique uses its own set of resources. Therefore, the CARP protocol is deadlock-free. 2 Guaranteeing the absence of deadlock is not enough because the proposed routing protocols use misrouting and backtracking. Therefore, a probe could be trying to establish a circuit forever, never blocking but never reaching its destination. As above, we rely on the properties of routing protocols for PCS and wormhole switching. The misrouting backtracking protocol MB-m is livelock-free because misrouting is limited to m misroutes. Also, when a probe backtracks, it does not search again the same path because the History Store in the PCS routing control unit keeps track of the paths already searched. As the number of paths in a network is finite, MB-m is livelock-free. Also, minimal routing algorithms for wormhole switching are livelock-free. The following theorems prove that the proposed protocols are livelock-free. Theorem 3 The CLRP protocol is livelock-free. Proof: The proof proceeds by analyzing all the possible cases. As MB-m is livelock-free, aprobewiththeforcebit reset will either succeed reserving a path or will return to the source node after exhausting the search of paths using switch S i. The number of switches is finite and the Initial Switch field prevents a probe from using the same switch twice. Therefore, a probe cannot be trying to establish circuits forever in phase one. When the Force bit is set and there exists a previously established circuit starting at the current node or crossing it, that circuit will be released, and the probe will be able to make progress toward its destination. If all the output channels at a node belong to circuits currently being established, the probe backtracks. Livelock is avoided by using the History Store, therefore preventing the probe from visiting a previously visited node. If the probe backtracks up to the source node with the Force bit set after exhausting the search of all the paths, the protocol enters the third phase and minimal routing is used through S 0 switches. As minimal routing is livelock-free, the CLRP protocol is livelock-free. 2 Theorem 4 The CARP protocol is livelock-free. Proof: As MB-m is livelock-free, a probe will either succeed reserving a path or will return to the source node after exhausting the search of paths using switch S i. The number of switches is finite and the Initial Switch field prevents a probe from using the same switch twice. Therefore, a probe cannot be trying to establish circuits forever. If the probe backtracks up to the source node after exhausting the search of all the paths, minimal routingis used through S 0 switches. As minimal routing is livelock-free, the CARP protocol is livelock-free. 2

5. Conclusions Wave switching is a new hybrid switching technique that exploits communication locality by combining circuit switching and wormhole switching. A wave router has two or more switches. One of the switches uses standard wormhole switching. The remaining switches implement circuit switching. These switches achieve a higher bandwidth by using wave pipelining across switches and channel wires. As shown in [10], wave switching is able to reduce latency and increase throughput by a factor higher than three if messages are long enough ( 128 flits), even if circuits are not reused. The wormhole switch can be used to transmit individual messages for which circuit switching is not efficient. The new switching technique is aimed at reducing latency and increasing throughput by taking advantage of spatial and temporal communication locality. Instead of optimizing the transmission of individual messages, the new switching technique optimizes the overall communication between pairs of processors by allowing the construction of high-bandwidth physical circuits. The new switching technique also allows to reduce the overhead of the software messaging layer in multicomputers by offering a better hardware support. In particular, message buffers can be allocated at both ends when the physical circuit is established. Those buffers will be reused by all the messages using the physical circuit. Additionally, in-order delivery and tolerance to static faults in the network is guaranteed for all the messages using physical circuits. In this paper, we have proposed two routing protocols for wave switching. The first routing protocol (Cache-Like Routing Protocol, CLRP) automatically establishes a circuit when a node sends a message to another node and no circuit existed between them. When a circuit is being established and all the requested channels have been previously reserved by other circuits, a replacement algorithm selects a circuit. This circuit is torn down, releasing the channels forming it, and allowing the establishment of the new circuit. The second protocol (Compiler Aided Routing Protocol, CARP) relies on the programmer and/or the compiler to determine when a circuit should be established or torn down. Circuits are only established when there is enough temporal communication locality so that it is worth establishing a circuit. When communication between two nodes is not frequent enough, messages are sent using wormhole switching. Additionally, we have shown that the proposed protocols are deadlock-free and livelock-free, therefore guaranteeing that every message will reach its destination in finite time. References [1] A. Agarwal, Limits on interconnection network performance, IEEE Trans. Parallel and Distributed Systems, vol. 2, no. 4, pp. 398 412, October 1991. [2] W.C. Athas and C.L. Seitz, Multicomputers: Messagepassing concurrent computers, IEEE Computer, vol. 21, no. 8, pp. 9 24, August 1988. [3] S. Borkar et al., iwarp: An integrated solution to high-speed parallel computing, in Proc. Supercomputing 88, November 1988. [4] A.A. Chien, A cost and speedmodel for k-ary n-cube wormhole routers, in Proc. Hot Interconnects 93, August 1993. [5] W.J. Dally and C.L. Seitz, Deadlock-free message routing in multiprocessor interconnection networks, IEEE Trans. Computers, vol. C-36, no. 5, pp. 547 553, May 1987. [6] W.J. Dally, Express cubes: Improving the performance of k-ary n-cube interconnection networks, IEEE Trans. Computers, vol. C 40, no. 9, pp. 1016 1023, September 1991. [7] W.J. Dally, Virtual-channel flow control, IEEE Trans. Parallel and Distributed Systems, vol. 3, no. 2, pp. 194 205, March 1992. [8] J. Duato, A new theory of deadlock-free adaptive routing in wormhole networks, IEEE Trans. Parallel and Distributed Systems, vol. 4, no. 12, pp. 1320 1331, December 1993. [9] J. Duato, A necessary and sufficient condition for deadlockfree adaptive routing in wormhole networks, IEEE Trans. Parallel and Distributed Systems, vol. 6, no. 10, pp. 1055 1067, October 1995. [10] J. Duato, P. López, F. Silla and S. Yalamanchili, A high performance router architecture for interconnection networks, in Proc. 1996 Int. Conf. Parallel Processing, August 1996. [11] P.T. Gaughanand S. Yalamanchili, Adaptive routing protocols for hypercube interconnection networks, IEEE Computer, vol. 26, no. 5, pp. 12 23, May 1993. [12] P.T. GaughanandS. Yalamanchili, A family of fault tolerant routing protocols for direct multiprocessor networks, IEEE Trans. Parallel and Distributed Systems, vol. 6, no. 5, pp. 482 497, May 1995. [13] J.-M. Hsu and P. Banerjee, Hardware support for message routing in a distributed memory multicomputer, in Proc. 1990 Int. Conf. Parallel Processing, August 1990. [14] Intel Scalable Systems Division, Intel Paragon Systems Manual, Intel Corporation. [15] V. Karamcheti and A.A. Chien, Do faster routers imply faster communication?, in Parallel Computer Routing and Communication, K. Bolding and L. Snyder (ed.), Springer- Verlag, pp. 1 15, 1994. [16] R.E. Kessler and J.L. Schwarzmeier, CRAY T3D: A new dimension for Cray Research, in Compcon, pp. 176 182, Spring 1993. [17] D. Lenoski, J. Laudon, K. Gharachorloo, W. Weber, A. Gupta, J. Hennessy, M. Horowitz and M. Lam, The Stanford DASH multiprocessor, IEEE Computer,vol. 25, no. 3, pp. 63 79, March 1992. [18] T. Mowry, M. Lam and A. Gupta, Design and evaluation of a compiler algorithm for prefetching, in Proc. 5th Int. Conf. Architectural Support for Programming Languages and Operating Systems, October 1992. [19] S.L. Scott and J.R. Goodman, The impact of pipelined channels on k-ary n-cube networks, IEEE Trans. Parallel and Distributed Systems, vol. 5, no. 1, pp. 2 16, January 1994. [20] T. von Eicken, D. E. Culler, S. C. Goldstein, and K. E. Schauser, Active messages: a mechanism for integrated communication and computation, Proc. 19th Int. Symp. Computer Architecture, June 1992.