Deadlock- and Livelock-Free Routing Protocols for Wave Switching

Size: px
Start display at page:

Download "Deadlock- and Livelock-Free Routing Protocols for Wave Switching"

Transcription

1 Deadlock- and Livelock-Free Routing Protocols for Wave Switching José Duato,PedroLópez Facultad de Informática Universidad Politécnica de Valencia P.O.B Valencia, SPAIN jduato@gap.upv.es Sudhakar Yalamanchili Computer Systems Research Laboratory School of Electrical and Computer Engineering Georgia Institute of Technology Atlanta, Georgia sudha@ee.gatech.edu Abstract Wave switching is a hybrid switching technique for high performance routers. It combines wormhole switching and circuit switching in the same router architecture. Wave switching achieves very high performance by exploiting communication locality. When two nodes are going to communicate frequently, a physical circuit is established between them. By combining circuit switching, pre-established physical circuits and wave pipelining across channels and switches, it is possible to increase network bandwidth considerably, also reducing latency for communications that use pre-established physical circuits. In this paper, we propose two protocols for routers implementing wave switching. The first protocol handles the network as a cache of circuits, automatically establishing a circuit when two nodes are going to communicate. Subsequent communications use the previously established circuit. When a new circuit requests channels belonging to another circuit, a replacement algorithm selects the circuit to be torn down. The second protocol relies on the programmer and/or the compiler to decide when a circuit should be established or torn down for a set of messages. Also, we show that the proposed protocols are always able to deliver messages, and are deadlock- and livelock-free. 1. Introduction Distributed memory multiprocessors rely on an interconnection network to exchange information between nodes. In multicomputers [2], each processor has a local address space and the interconnection network is used for message passing between processors. In distributed shared-memory multiprocessors (DSMs), the interconnectionnetwork is used either to access remote memory locations [16] or to support a cache coherence protocol [17]. Nowadays, these architectures use a similar hardware support to implement the interconnection network [14, 16, 17]. State-of-the-art inter- Supported by the Spanish CICYT under Grant TIC C02 01 connection networks use low dimensional topologies and wormhole switching [5]. Multicomputers usually send messages by calling a system function. This system call has a considerable overhead due to buffer allocation at source and destination nodes, message copying between user and kernel space, packetization, in-order delivery and end-to-end flow control. Even for a very efficient messaging layer based on active messages [20], software overhead accounts for 50 70% of the total cost [15]. Therefore, reducing the network hardware latency has a minimal impact on performance. On the other hand, messages are directly sent by the hardware in DSMs, as a consequence of remote memory accesses or coherence commands. Reducing the network hardware latency and increasing network throughput is crucial to improve the performance of DSMs. Satisfying the requirements of multicomputers and DSMs is not a trivial task. Wormhole routers are simple and fast. The main limitation of wormhole switching is the contention produced by blocked messages. Those messages remain in the network, preventing the use of the channels they occupy and wasting channel bandwidth. Virtual channels can increase throughput considerably by dynamically sharing the physical bandwidth among several messages [7]. Another approach to reduce contention and improve channel utilization consists of using adaptive routing [11]. Adaptive routing algorithms must be carefully designed to avoid deadlocks [8, 9]. Virtual channels can be combined with adaptive routing to maximize throughput. Unfortunately, virtual channels and adaptive routing make the router more complex, increasing node delay [4]. As a consequence, latency may increase. Latency can also be reduced by using an appropriate mapping of processes to processors, exploiting spatial locality in communications. In many cases, this locality is not only spatial but also temporal. If the router architecture supports circuit switching, the compiler could generate instructions that instruct the router to set up a path or circuit that will be heavily used during a certain period of time. Also, once a circuit has been established there is no contention for the messages using that circuit.

2 It is possible to design a router architecture such that a circuit is set up and left open for future transmission of data items. As far as we know, this technique was first proposed in [3] for systolic communication. It has also been proposed in [13] for message passing. The underlying idea behind pre-established circuits is similar to the use of cache memory in a processor. A common limitation of both, caches and networks, is that they are limited resources. If a new circuit requires the use of some channels belonging to previously established circuits, they must be torn down. Another common aspect of both, caches and networks, is that they may require compiler support to work more efficiently. Prefetching is an efficient technique to hide memory latency in case of cache misses [18]. Similarly, when two nodes are going to exchange several messages, the compiler could generate instructions that instruct the router to set up a circuit between those nodes before that circuit is needed. When data are available, the circuit has already been established. Thus, message transfer will be faster because header routing time and contention have been eliminated. There is an important difference between caches and circuits. Caches are much faster than core memory. However, circuits offer the same bandwidth as regular channels. If circuits were able to use faster channels, network performance would increase considerably because bandwidth would be allocated to messages that really need that bandwidth. In a previous paper, we proposed a new hybrid switching technique called wave switching as well as a router architecture supporting it [10]. Wave switching implements wormhole switching and circuit switching concurrently. By combining circuit switching, pre-established physical circuits 1 and wave pipelining across channels and switches, it is possible to increase network bandwidth considerably, also reducing latency for communications that use pre-established physical circuits. As shown in [10], wave switching is able to reduce latency and increase throughput by a factor higher than three if messages are long enough ( 128 flits), even if circuits are not reused. For short messages, wave switching can only improve performance if circuits are reused. In this paper, we propose two routing protocols to establish and tear-down circuits. The first protocol handles the network as a cache of circuits. The second protocol relies on the programmer and/or the compiler to decide when a circuit should be established or torn down for a set of messages. Also, we show that the proposed protocolsare deadlock- and livelock-free. The router architecture for wave switching proposed in [10] is described in section 2. Section 3 proposes two routing protocols for wave switching. Those protocols are shown to be deadlock-free and livelock-free in section 4. Finally, the paper presents some concluding remarks and directions for future work. 1 A physical circuit is a circuit made of physical channels. A virtual circuit is a circuit made of virtual channels. Where it is irrelevant to the discussion, we will refer to them simply as circuits. Input channels Input queues (virtual channels) Switch Routing control unit From/to local processor Output queues (virtual channels) Figure 1. Typical architecture of a wormhole router 2. Router Architecture for Wave Switching When the network topology has more dimensions than the number of physical dimensions used to implement it, some channels require long wires. In this case, channel delay has a major impact on clock frequency. Some researchers have proposed the use of pipelined channels to reduce the impact of wire length on clock frequency [19]. Pipelined channels use wave pipelining. New data are injected into a channel before previously injected data reached the other end of the channel. Propagation speed is only limited by wire capacitance. At a given time, several data are propagating along a channel. As wires have some capacitance, wave front is not sharp, limiting the maximum frequency at which data can be pipelined. The use of pipelined channels allows the designer to compute clock frequency independently of wire delay. Although long wires do not affect the speed of the router, clock frequency is still selected by considering routing delay and switch delay. In this section, we present a router architecture that allows higher clock frequencies, increasing bandwidth considerably. It is based on the use of wave pipelining across both, switches and channels. Assume that the router supports circuit switching and that a circuit has been established. Let us analyze the requirements to support wave pipelining. If we consider the router architecture shown in Figure 1, we can see that it is possible to pipeline flits across the switch as fast as flit buffers are able to deliver flits. Similarly, flits can be pipelined into physical channels at the same speed. However, using a higher clock frequency implies that round-trip delay for sending flits across channels and receiving acknowledgments requires more clock cycles. As a consequence, a windowing protocol with a longer window should be used. A longer window also requires mux mux Output channels

3 deeper buffers. Finally, deeper buffers require a longer delay to reach the first empty buffer in the worst case, therefore increasing latency and limiting the increase in clock frequency. Therefore, a different approach is required if we want to take full advantage of wave pipelining across switches and channels. Taking into account that the circuit has been previously established, flits will not find any busy channel in their way towards the destination node. If there is no contention with other messages then there is no need for flow control, unless virtual channels are used. Assume that physical channels are not split into virtual channels. In this case, circuits use physical channels and flow control can be removed. As a consequence, flit buffers are no longer required and every switch transmits information directly to the switch in the next node. However, a careful design is required to minimize the skew between wires in a parallel data path. Synchronizers are required at each delivery channel. Synchronizers may also be required at each switch input to reduce the skew. This switching technique is known as circuit switching. With circuit switching, it is possible to use wave pipelining across switches and physical channels, clocking at a very high frequency. Basically, clock frequency is limited by memory bandwidth, by signal skew and by latch setup time. With a proper router and memory design, this frequency can be much higher than the one used in current routers, increasing channel bandwidth and network throughputaccordingly. As shown in [10], circuit simulations using Spice indicated that clock frequency could be up to four times higher than in a wormhole router using the same technology. Note that latency for sending information in pre-established physical circuits is also considerably reduced because flit buffers have been removed. However, pre-established physical circuits require the use of dedicated physical channels to allow multiple circuits between routers. Also, some mechanism is required to set up and tear-down circuits. This can be solved by using a hybrid router architecture as shown in Figure 2. This architecture implements wormhole switching and circuit switching concurrently. Circuits are built using physical channels. However, physical circuits are set up and torn down by using a set of dedicated virtual channels, which will be referred to as control channels. Wormhole switching uses another set of dedicated virtual channels. This router architecture has several switches S 0 ;S 1 ;:::;S k and two routing control units. One of them, together with switch S 0, implements wormhole switching. The second routing control unit implements pipelined circuit switching (PCS) [12] as follows: The remaining switches S 1 ;:::;S k implement circuit switching on pre-established physical circuits using wave pipelining. As circuit switching does not provide flow control at the link level, physical channels are split into narrower physical channels. Although there is no flow control at the link level, end-to-end flow control is required between the injection buffer at the source node and the delivery buffer at the destination node. Taking Input Channels Pipelined Input Channels Sync Sync Sync Sync Switch S Switch S Control Channels k 1 Switch S 0 WH Routing Control Unit PCS Routing Control Unit From/to Local Processor mux mux mux Figure 2. New router architecture Pipelined Output Channels Output Channels into account the round-trip delay for control signals (i.e., acknowledgments), a windowing protocol is implemented. This protocol requires deep delivery buffers to prevent buffer overflow while acknowledgments are transmitted towards the source node of the message. Each physical channel in switch S 0 is split into k +w virtual channels. Among them, k channels are the control channels associated with the corresponding physical channels in switches S 1 ;:::;S k. Control channels are only used to set up and tear-down physical circuits. These channels have capacity for a single flit because they only transmit control flits. Control channels are handled by the PCS routing control unit. The remaining w virtual channels are used to transmit messages using wormhole switching and require deeper buffers. They are handled by the wormhole routing control unit, as mentioned above. The hybrid switching technique

4 implemented by this router architecture will be referred to as wave switching. The proposed router architecture allows sending messages using wormhole switching. Messages are routed using either a deterministic or an adaptive routing algorithm, blocking if necessary on busy channels. Thus, the routing algorithm must be deadlock-free. The router also allows establishing physical circuits on switches S 1 ;:::;S k. These circuits are established by sending a probe through the control channels. In order to maximize the probability of establishing a circuit, a misrouting backtracking protocol with a maximum of m misroutes is used (MB-m) [12]. Once the physical circuit has been reserved, an acknowledgment is returned. The only difference with respect to PCS proposed in [12] is that the path being reserved is formed by a different set of channels, i.e., those using switches S 1 ;:::;S k. Therefore, the probe only reserves a fragment of a physical circuit if the corresponding channels (a bidirectional control channel and the associated physical channel in switch S i ;i 2[1::k]) are free. Both of them are reserved at the same time. Therefore, status registers are only needed for control channels in the PCS routing control unit. The control channel is needed to backtrack if the need arises, to return the acknowledgment, and to release the physical circuit. Once a circuit is no longer required, it is torn down by sending a control flit through the same path using the control channels. Figure 3 shows the status registers associated with the PCS routing control unit. The Channel Status registers indicate whether the corresponding channel is free or busy. It can be easily extended to handle faulty channels. These registers are associated with each output control channel. The direct and reverse mappings between input and output channels are stored in the Direct Channel Mappings and Reverse Channel Mappings registers, respectively. As mentioned above, the reverse path is required to return acknowledgments. The History Store keeps track of the output links that have already been searched by the probe. Together, the History Store registers of all the network keep track of the paths already searched by each probe, therefore avoiding the repeated search of the same path. By storing information about searched links in the History Store register, the probe is kept small. Finally, a control bit associated with each output control channel (Ack Returned) indicates whether the acknowledgment for path setup has been returned through that channel. Channel Status Direct Channel Mappings Reverse Channel Mappings History Store Ack Returned Header Backtrack Misroute Force X1-offset Xn-offset Figure 4. Format of a routing probe The format of a routing probe is shown in Figure 4. The Header bit identifies the flit as a probe. The Backtrack bit indicates whether the probe is progressing or backtracking. The Misroute field indicates the number of misrouting operations performed by the probe. The Force bit is used by one of the routing protocols to force channel release. The remaining fields are offsets from the destination node. The circuits starting at each node are recorded in a special set of registers denoted as Circuit Cache. Those registers are located in the network interface of every node. Figure 5 shows the structure of those registers. When a circuit uses a switch S i ;i 2[1::k] at the source node, it uses the same switch S i at every intermediate node. The Switch field indicates the switch being searched by the probe or the switch used by the circuit once the path has been set up. The Channel field indicates the output channel used by the circuit at the source node. The Dest field indicates the destination node of the circuit. If a probe does not succeed to establish a circuit across some switch, it may try other switches depending on the routing protocol. The Initial Switch field records the first switch tried to avoid repeating the search. The Ack Returned field indicates that the acknowledgment of path setup has been returned and the circuit is ready to be used. The remaining fields are used for replacement purposes. The In-use bit is set when there is a message in transit, thus preventing the circuit from being released until message transmission has finished. This bit is reset when the source node receives the acknowledgment for the last fragment of the message. The Replace field stores accounting information regarding the use of the circuit. The meaning of this field depends on the replacement algorithm. Depending on the requirements of the applications running on the machine, the interconnection network may be designed with a single or several wave pipelined switches. Note that if a single wave pipelined switch per node is used, each pair of adjacent routers can only have one link capa- Initial Switch Initial Switch Initial Switch Switch Channel Dest. Switch Channel Dest. Switch Channel Dest. Ack Returned Ack Returned Ack Returned In-use In-use In-use Replace Replace Replace Figure 3. Status registers of the PCS routing control unit Figure 5. Structure of the Circuit Cache registers

5 ble of wave pipelining between them. On the other hand, splitting physical channels into narrower physical channels shares bandwidth in a very inflexible way. Additionally, a few control signals must be provided for each individual physical channel. Thus, it is not recommended to split each channel into many narrow physical channels. In addition to a higher network bandwidth and lower latency for messages using pre-established physical circuits, the proposed router architecture has some interesting advantages. Circuits are established by sending a probe. The probe uses the MB-m protocol, being allowed to backtrack if it cannot proceed forward. This protocol is very resilient to static faults in the network, as indicated in [12]. Also, once a circuit has been established between two nodes, in-order delivery is guaranteed for all the messages transmitted between those nodes. Additionally, for message passing, software overhead associated with message transmission can be considerably reduced if message buffers are allocated at both ends when the circuit is established. Those buffers will be reused by all the messages using the circuit. If the circuit is explicitly established by the programmer and/or the compiler for a set of messages, buffer size is determined by the longest message of the set. On the other hand, if the circuit is automatically established the first time a node sends a message to another node, the size of the longest message using that circuit is not known at that time. A reasonably large buffer can be allocated. In this case, buffers may have the be re-allocated for longer messages. Also, note that channel width in low dimensional topologies is limited by pin count. More wires are allowed across the network bisection [1, 6]. It can be seen in Figure 2 that only a few control signals connect the PCS routing control unit with each switch S i ; i 2 [1::k]. For very high performance, several switches per node can be used, each one being implemented in its own chip. In this case, channel bandwidth does not decrease when the number of switches increases, assuming that this number is small. As a consequence, scalability is excellent because the number of switches (chips) per node can increase as network size increases, thus compensating the higher average distance traveled by messages. Such an architecture design follows a multi-chip implementation approach similar to the Cray T3D routers wherein each dimensional crossbar is a separate chip. This effectively removes from consideration the pin out limitations between adjacent nodes since the pin requirements between these chips is quite small. The interesting design question then becomes how best to use the bisection bandwidth resource that is determined by the packaging technology. Finally, the proposed router architecture is very flexible. It can be tailored to different requirements. Several parameters can be adjusted, including the number of fast switches, the number of virtual channels for wormhole switching, and the routing protocols for wormhole switching and PCS. The simplest version of wave router is obtained by setting k =1 and w =0. In this case, all the messages use PCS. 3. Routing Protocols In this section, we propose two routing protocols for wave switching. The first routing protocol automatically establishes a circuit when a node sends a message to another node and no circuit existed between them. When a circuit is being established and all the requested channels have been previously reserved by other circuits, a replacement algorithm selects a circuit. This circuit is torn down, releasing the channels forming it, and allowing the establishment of the new circuit. This protocol works like cache protocols. A cache line is brought from main memory every time a miss occurs. When a line is required and the cache is full, a replacement algorithm selects a line to be removed from the cache. This protocol will be referred to as Cache-Like Routing Protocol (CLRP). The second protocol relies on the programmer and/or the compiler to determine when a circuit should be established or torn down. Circuits are only established when there is enough temporal communication locality so that it is worth establishing a circuit. When communication between two nodes is not frequent enough, messages are sent using wormhole switching. This protocol will be referred to as Compiler Aided Routing Protocol (CARP). A circuit should only be requested if there are enough free physical channels to build it. However, the routing protocol should consider the case where all the requested channels are busy. This protocol is similar to prefetching for caches. When a circuit is going to be heavily used, it is established in advance. The main difference between caches and networks regarding this protocol is that a circuit should be explicitly torn down when it is no longer needed. We believe that the CARP protocol is able to achieve a higher performance because a circuit is only established when there is enough temporal communication locality. By doing so, every message or set of messages uses the best suited switching technique. In particular, the CARP protocol does not establish circuits for individual short messages. Moreover, when buffers are allocated for transmitting a set of messages, buffer size is large enough to store the longest message, therefore avoiding buffer re-allocation. Additionally, channels are a scarce resource. The CARP protocol allows the use of global optimization algorithms. However, developing a suitable compiler support for the CARP protocol may take several years. On the other hand, the CLRP protocol does not need any compiler or programmer support. Therefore, it can be used for the first generation of multiprocessors implementing wave switching in their interconnection network Cache-Like Routing Protocol The CLRP protocol for message transmission uses the Force bit. When the Force bit is not set, the probe backtracks if it does not find a free valid channel. If the Force bit is set, the probe does not backtrack, tearing circuits down to ob-

6 tain the required channels. The most general form of this protocol is as follows: When a node sends a message to another node, the source node reads the Circuit Cache register to see if a circuit exists for the requested destination. If it exists, the circuit is used. Otherwise, a circuit is established in several phases, possibly tearing-down other circuits, and storing the corresponding entry in the Circuit Cache. In the first phase, a switch S i ;i 2[1::k] with a free output channel is selected at the source node, and a probe with the Force bit reset is sent to establish a physical circuit. The selected switch is recorded in the Initial Switch field of the corresponding entry in the Circuit Cache register. It is convenient that neighboring nodes try to use different initial switches. For example, in a 2D-mesh, node (x; y) can first try switch 1+(x+y)modk. At each intermediate node, a free output channel from switch S i is selected. As mentioned above, the probe uses the MB-m protocol, being allowed to misroute up to m times, and to backtrack if it cannot proceed forward. If the circuit is successfully reserved, an acknowledgment is returned. If it were not possible to establish a circuit, and the probe backtracks up to the source node, the next switch modulo k is tried. The current switch is recorded in the Switch field. The Initial Switch field prevents the probe from searching the same circuit twice. If it were not possible to establish a circuit across any switch, the Force bit is set in the probe, sending it again across switch Initial Switch. This is the second phase of the protocol. If the probe cannot proceed forward at some node (including the source node), it selects a circuit from the Circuit Cache such that it uses one of the requested channels and has the Ack Returned bit set. This circuit starts at the current node. Therefore, it can be torn down without interrupting any message in transit. Once the message currently using that circuit (if any) has been sent, the In-use bit is reset and the circuit is torn down, thus releasing the requested channel. It may happen that all the requested channels belong to circuits being set up or circuits that cross the current node but start at different nodes. In that case, a circuit crossing the node is selected among those that returned the acknowledgment flit (those that have the Ack Returned bit set in the PCS routing control unit). A control flit is sent towards the source node of that circuit, requesting its source node to release it. Once the message currently using that circuit (if any) has been sent, the circuit is torn down, thus releasing the requested channel. Therefore, the probe can proceed reserving the circuit. In the very unlikely case that all the outgoing channels of a node belong to circuits currently being established, the probe backtracks even if the Force bit is set. If it were not possible to establish a circuit, and the probe backtracks up to the source node, the next switch modulo k is tried. If it were not possible to establish a circuit across any switch, the protocol enters the third phase. In this phase, the message is transmitted using wormhole switching through S 0 switches. The CLRP protocol can be simplified in several ways. First, when a circuit cannot be established by using Initial Switch, the Force bit can be set without trying the remaining switches. Similarly, the second phase may try a single switch. Second, the Force bit can be set when the probe is first sent to establish the circuit, therefore skipping phase one. The optimal protocol depends on the number of physical switches per node, and on the applications. It can only be tuned by using traces from real applications. This falls out of the scope of this paper Compiler Aided Routing Protocol The CARP protocol is much simpler. However, it requires that the compiler and/or the programmer generate instructions to set up and tear-down a circuit. It works as follows: The compiler and/or the programmer decide whether a physical circuit shouldbe established for a set of messages. For those messages not requiring pre-established circuits, wormhole switching is used across S 0 switches. When a physical circuit is requested, a switch S i ; i 2 [1::k] is selected and a probe is sent to establish it. As mentioned above, it is convenient that neighboring nodes try to use different initialswitches. At each intermediate node, a free output channel from switch S i is selected. As mentioned above, the probe uses the MB-m protocol, being allowed to misroute up to m times, and to backtrack if it cannot proceed forward. If the circuit is successfully reserved, it will be used for the corresponding messages. When the circuit is no longer required, it is explicitly torn down by generating the appropriate instructions. If it were not possible to establish a circuit, the partially reserved path is torn down and the next switch modulo k is tried. If it were not possible to establish a circuit across any switch, messages requesting that circuit will have to use wormhole switching through S 0 switches. 4. Deadlock and Livelock Avoidance For the proof of deadlock freedom, we rely on the properties of routing protocols for PCS and wormhole switching. In [12], it was shown that the MB-m protocol is deadlockfree. The basic idea is that a probe can always backtrack up to the source, therefore releasing all the resources it previously reserved, and allowing other probes to advance. Also, in [5, 8, 9], it was shown how to design deadlock-free routing algorithms for wormhole switching. Basically, for deterministic routing algorithms there should be no cyclic dependencies between channels. For adaptive routing, cyclic dependencies are allowed provided that there exists a subset of channels without cyclic dependencies between them. The following theorems prove that the proposed protocols are deadlock-free. Theorem 1 The CLRP protocol is deadlock-free. Proof: The proof proceeds by analyzing all the possible cases. The use of previously reserved circuits does not produce deadlock because messages do not request more resources. No deadlock can arise while establishing a circuit

7 with the Force bit reset in phase one because the misrouting backtracking protocol MB-m is deadlock-free. When the Force bit is set (phase two), the probe may block at a node waiting on a previously established circuit to be torn down. If that circuit starts at the current node, it is immediately torn down unless there is a message in transit. If there is a message in transit, it will take a finite amount of time to end the transmission because messages have a finite length, andthedestinationnodewillaccept the message because the circuit was previously established. If the circuit does not start at the current node, an already established circuit crossing that node is selected. The selection is done in finite time. After selecting a circuit, a control flit is sent towards the source node of the circuit using control channels. Those control channels are free because once the acknowledgment for path setup is returned, no other traffic crosses those channels towards the source node. When the control flit reaches the source node, the circuit will be released in finite time, just after ending the transmission of the current message (if any). It may happen that the circuit is being released while the control flit advances towards the source node of the circuit. In this case, the control flit is discarded at some intermediate node and the circuit is released in finite time. Also, it may happen that two different nodes send control flits requesting the same circuit to be released. The first control flit will initiate circuit releasing. The second control flit will be discarded, as indicated above. Both probes requesting channels from the circuit at different nodes will be able to reserve those channels in finite time. If all the requested channels at a given node belong to circuits being established, the probe does not block, avoiding deadlock by backtracking to the previous node. If the probe is not able to establish any circuit, it uses wormhole switching, blocking if necessary on busy channels. This does not produce deadlock because the routing algorithm used for wormhole switching is deadlock-free. Additionally, PCS and wormhole switching do not interact. Each switching technique uses its own set of resources (routing control unit, switches and channels). Therefore, the CLRP protocol is deadlock-free. 2 It could be thought that a probe may block at a node if all the requested channels belong to circuits being established. However, waiting on busy channels while keeping previously reserved channels produces channel dependencies. As probes use the MB-m protocol, there would be cyclic dependencies between channels. Therefore, each probe could be waiting for a channel that will never be released, producing a deadlock. Thus, deadlock is avoided by backtracking. Theorem 2 The CARP protocol is deadlock-free. Proof: Establishing a circuit cannot produce deadlock because the misrouting backtracking protocol MB-m is deadlock-free. If the probe is not able to establish any circuit, the message uses wormhole switching, blocking if necessary on busy channels. This does not produce deadlock because the routing algorithm used for wormhole switching is deadlock-free. Additionally, PCS and wormhole switching do not interact. Each switching technique uses its own set of resources. Therefore, the CARP protocol is deadlock-free. 2 Guaranteeing the absence of deadlock is not enough because the proposed routing protocols use misrouting and backtracking. Therefore, a probe could be trying to establish a circuit forever, never blocking but never reaching its destination. As above, we rely on the properties of routing protocols for PCS and wormhole switching. The misrouting backtracking protocol MB-m is livelock-free because misrouting is limited to m misroutes. Also, when a probe backtracks, it does not search again the same path because the History Store in the PCS routing control unit keeps track of the paths already searched. As the number of paths in a network is finite, MB-m is livelock-free. Also, minimal routing algorithms for wormhole switching are livelock-free. The following theorems prove that the proposed protocols are livelock-free. Theorem 3 The CLRP protocol is livelock-free. Proof: The proof proceeds by analyzing all the possible cases. As MB-m is livelock-free, aprobewiththeforcebit reset will either succeed reserving a path or will return to the source node after exhausting the search of paths using switch S i. The number of switches is finite and the Initial Switch field prevents a probe from using the same switch twice. Therefore, a probe cannot be trying to establish circuits forever in phase one. When the Force bit is set and there exists a previously established circuit starting at the current node or crossing it, that circuit will be released, and the probe will be able to make progress toward its destination. If all the output channels at a node belong to circuits currently being established, the probe backtracks. Livelock is avoided by using the History Store, therefore preventing the probe from visiting a previously visited node. If the probe backtracks up to the source node with the Force bit set after exhausting the search of all the paths, the protocol enters the third phase and minimal routing is used through S 0 switches. As minimal routing is livelock-free, the CLRP protocol is livelock-free. 2 Theorem 4 The CARP protocol is livelock-free. Proof: As MB-m is livelock-free, a probe will either succeed reserving a path or will return to the source node after exhausting the search of paths using switch S i. The number of switches is finite and the Initial Switch field prevents a probe from using the same switch twice. Therefore, a probe cannot be trying to establish circuits forever. If the probe backtracks up to the source node after exhausting the search of all the paths, minimal routingis used through S 0 switches. As minimal routing is livelock-free, the CARP protocol is livelock-free. 2

8 5. Conclusions Wave switching is a new hybrid switching technique that exploits communication locality by combining circuit switching and wormhole switching. A wave router has two or more switches. One of the switches uses standard wormhole switching. The remaining switches implement circuit switching. These switches achieve a higher bandwidth by using wave pipelining across switches and channel wires. As shown in [10], wave switching is able to reduce latency and increase throughput by a factor higher than three if messages are long enough ( 128 flits), even if circuits are not reused. The wormhole switch can be used to transmit individual messages for which circuit switching is not efficient. The new switching technique is aimed at reducing latency and increasing throughput by taking advantage of spatial and temporal communication locality. Instead of optimizing the transmission of individual messages, the new switching technique optimizes the overall communication between pairs of processors by allowing the construction of high-bandwidth physical circuits. The new switching technique also allows to reduce the overhead of the software messaging layer in multicomputers by offering a better hardware support. In particular, message buffers can be allocated at both ends when the physical circuit is established. Those buffers will be reused by all the messages using the physical circuit. Additionally, in-order delivery and tolerance to static faults in the network is guaranteed for all the messages using physical circuits. In this paper, we have proposed two routing protocols for wave switching. The first routing protocol (Cache-Like Routing Protocol, CLRP) automatically establishes a circuit when a node sends a message to another node and no circuit existed between them. When a circuit is being established and all the requested channels have been previously reserved by other circuits, a replacement algorithm selects a circuit. This circuit is torn down, releasing the channels forming it, and allowing the establishment of the new circuit. The second protocol (Compiler Aided Routing Protocol, CARP) relies on the programmer and/or the compiler to determine when a circuit should be established or torn down. Circuits are only established when there is enough temporal communication locality so that it is worth establishing a circuit. When communication between two nodes is not frequent enough, messages are sent using wormhole switching. Additionally, we have shown that the proposed protocols are deadlock-free and livelock-free, therefore guaranteeing that every message will reach its destination in finite time. References [1] A. Agarwal, Limits on interconnection network performance, IEEE Trans. Parallel and Distributed Systems, vol. 2, no. 4, pp , October [2] W.C. Athas and C.L. Seitz, Multicomputers: Messagepassing concurrent computers, IEEE Computer, vol. 21, no. 8, pp. 9 24, August [3] S. Borkar et al., iwarp: An integrated solution to high-speed parallel computing, in Proc. Supercomputing 88, November [4] A.A. Chien, A cost and speedmodel for k-ary n-cube wormhole routers, in Proc. Hot Interconnects 93, August [5] W.J. Dally and C.L. Seitz, Deadlock-free message routing in multiprocessor interconnection networks, IEEE Trans. Computers, vol. C-36, no. 5, pp , May [6] W.J. Dally, Express cubes: Improving the performance of k-ary n-cube interconnection networks, IEEE Trans. Computers, vol. C 40, no. 9, pp , September [7] W.J. Dally, Virtual-channel flow control, IEEE Trans. Parallel and Distributed Systems, vol. 3, no. 2, pp , March [8] J. Duato, A new theory of deadlock-free adaptive routing in wormhole networks, IEEE Trans. Parallel and Distributed Systems, vol. 4, no. 12, pp , December [9] J. Duato, A necessary and sufficient condition for deadlockfree adaptive routing in wormhole networks, IEEE Trans. Parallel and Distributed Systems, vol. 6, no. 10, pp , October [10] J. Duato, P. López, F. Silla and S. Yalamanchili, A high performance router architecture for interconnection networks, in Proc Int. Conf. Parallel Processing, August [11] P.T. Gaughanand S. Yalamanchili, Adaptive routing protocols for hypercube interconnection networks, IEEE Computer, vol. 26, no. 5, pp , May [12] P.T. GaughanandS. Yalamanchili, A family of fault tolerant routing protocols for direct multiprocessor networks, IEEE Trans. Parallel and Distributed Systems, vol. 6, no. 5, pp , May [13] J.-M. Hsu and P. Banerjee, Hardware support for message routing in a distributed memory multicomputer, in Proc Int. Conf. Parallel Processing, August [14] Intel Scalable Systems Division, Intel Paragon Systems Manual, Intel Corporation. [15] V. Karamcheti and A.A. Chien, Do faster routers imply faster communication?, in Parallel Computer Routing and Communication, K. Bolding and L. Snyder (ed.), Springer- Verlag, pp. 1 15, [16] R.E. Kessler and J.L. Schwarzmeier, CRAY T3D: A new dimension for Cray Research, in Compcon, pp , Spring [17] D. Lenoski, J. Laudon, K. Gharachorloo, W. Weber, A. Gupta, J. Hennessy, M. Horowitz and M. Lam, The Stanford DASH multiprocessor, IEEE Computer,vol. 25, no. 3, pp , March [18] T. Mowry, M. Lam and A. Gupta, Design and evaluation of a compiler algorithm for prefetching, in Proc. 5th Int. Conf. Architectural Support for Programming Languages and Operating Systems, October [19] S.L. Scott and J.R. Goodman, The impact of pipelined channels on k-ary n-cube networks, IEEE Trans. Parallel and Distributed Systems, vol. 5, no. 1, pp. 2 16, January [20] T. von Eicken, D. E. Culler, S. C. Goldstein, and K. E. Schauser, Active messages: a mechanism for integrated communication and computation, Proc. 19th Int. Symp. Computer Architecture, June 1992.

Optimal Topology for Distributed Shared-Memory. Multiprocessors: Hypercubes Again? Jose Duato and M.P. Malumbres

Optimal Topology for Distributed Shared-Memory. Multiprocessors: Hypercubes Again? Jose Duato and M.P. Malumbres Optimal Topology for Distributed Shared-Memory Multiprocessors: Hypercubes Again? Jose Duato and M.P. Malumbres Facultad de Informatica, Universidad Politecnica de Valencia P.O.B. 22012, 46071 - Valencia,

More information

SOFTWARE BASED FAULT-TOLERANT OBLIVIOUS ROUTING IN PIPELINED NETWORKS*

SOFTWARE BASED FAULT-TOLERANT OBLIVIOUS ROUTING IN PIPELINED NETWORKS* SOFTWARE BASED FAULT-TOLERANT OBLIVIOUS ROUTING IN PIPELINED NETWORKS* Young-Joo Suh, Binh Vien Dao, Jose Duato, and Sudhakar Yalamanchili Computer Systems Research Laboratory Facultad de Informatica School

More information

A Simple and Efficient Mechanism to Prevent Saturation in Wormhole Networks Λ

A Simple and Efficient Mechanism to Prevent Saturation in Wormhole Networks Λ A Simple and Efficient Mechanism to Prevent Saturation in Wormhole Networks Λ E. Baydal, P. López and J. Duato Depto. Informática de Sistemas y Computadores Universidad Politécnica de Valencia, Camino

More information

Deadlock. Reading. Ensuring Packet Delivery. Overview: The Problem

Deadlock. Reading. Ensuring Packet Delivery. Overview: The Problem Reading W. Dally, C. Seitz, Deadlock-Free Message Routing on Multiprocessor Interconnection Networks,, IEEE TC, May 1987 Deadlock F. Silla, and J. Duato, Improving the Efficiency of Adaptive Routing in

More information

Software-Based Deadlock Recovery Technique for True Fully Adaptive Routing in Wormhole Networks

Software-Based Deadlock Recovery Technique for True Fully Adaptive Routing in Wormhole Networks Software-Based Deadlock Recovery Technique for True Fully Adaptive Routing in Wormhole Networks J. M. Martínez, P. López, J. Duato T. M. Pinkston Facultad de Informática SMART Interconnects Group Universidad

More information

Lecture 12: Interconnection Networks. Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E)

Lecture 12: Interconnection Networks. Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E) Lecture 12: Interconnection Networks Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E) 1 Topologies Internet topologies are not very regular they grew

More information

A Hybrid Interconnection Network for Integrated Communication Services

A Hybrid Interconnection Network for Integrated Communication Services A Hybrid Interconnection Network for Integrated Communication Services Yi-long Chen Northern Telecom, Inc. Richardson, TX 7583 kchen@nortel.com Jyh-Charn Liu Department of Computer Science, Texas A&M Univ.

More information

Combining In-Transit Buffers with Optimized Routing Schemes to Boost the Performance of Networks with Source Routing?

Combining In-Transit Buffers with Optimized Routing Schemes to Boost the Performance of Networks with Source Routing? Combining In-Transit Buffers with Optimized Routing Schemes to Boost the Performance of Networks with Source Routing? J. Flich 1,P.López 1, M. P. Malumbres 1, J. Duato 1, and T. Rokicki 2 1 Dpto. Informática

More information

A New Theory of Deadlock-Free Adaptive Multicast Routing in. Wormhole Networks. J. Duato. Facultad de Informatica. Universidad Politecnica de Valencia

A New Theory of Deadlock-Free Adaptive Multicast Routing in. Wormhole Networks. J. Duato. Facultad de Informatica. Universidad Politecnica de Valencia A New Theory of Deadlock-Free Adaptive Multicast Routing in Wormhole Networks J. Duato Facultad de Informatica Universidad Politecnica de Valencia P.O.B. 22012, 46071 - Valencia, SPAIN E-mail: jduato@aii.upv.es

More information

Performance Evaluation of a New Routing Strategy for Irregular Networks with Source Routing

Performance Evaluation of a New Routing Strategy for Irregular Networks with Source Routing Performance Evaluation of a New Routing Strategy for Irregular Networks with Source Routing J. Flich, M. P. Malumbres, P. López and J. Duato Dpto. Informática de Sistemas y Computadores Universidad Politécnica

More information

Lecture: Interconnection Networks

Lecture: Interconnection Networks Lecture: Interconnection Networks Topics: Router microarchitecture, topologies Final exam next Tuesday: same rules as the first midterm 1 Packets/Flits A message is broken into multiple packets (each packet

More information

Combining In-Transit Buffers with Optimized Routing Schemes to Boost the Performance of Networks with Source Routing

Combining In-Transit Buffers with Optimized Routing Schemes to Boost the Performance of Networks with Source Routing Combining In-Transit Buffers with Optimized Routing Schemes to Boost the Performance of Networks with Source Routing Jose Flich 1,PedroLópez 1, Manuel. P. Malumbres 1, José Duato 1,andTomRokicki 2 1 Dpto.

More information

Input Buffering (IB): Message data is received into the input buffer.

Input Buffering (IB): Message data is received into the input buffer. TITLE Switching Techniques BYLINE Sudhakar Yalamanchili School of Electrical and Computer Engineering Georgia Institute of Technology Atlanta, GA. 30332 sudha@ece.gatech.edu SYNONYMS Flow Control DEFITION

More information

Adaptive Multimodule Routers

Adaptive Multimodule Routers daptive Multimodule Routers Rajendra V Boppana Computer Science Division The Univ of Texas at San ntonio San ntonio, TX 78249-0667 boppana@csutsaedu Suresh Chalasani ECE Department University of Wisconsin-Madison

More information

Generalized Theory for Deadlock-Free Adaptive Wormhole Routing and its Application to Disha Concurrent

Generalized Theory for Deadlock-Free Adaptive Wormhole Routing and its Application to Disha Concurrent Generalized Theory for Deadlock-Free Adaptive Wormhole Routing and its Application to Disha Concurrent Anjan K. V. Timothy Mark Pinkston José Duato Pyramid Technology Corp. Electrical Engg. - Systems Dept.

More information

Basic Low Level Concepts

Basic Low Level Concepts Course Outline Basic Low Level Concepts Case Studies Operation through multiple switches: Topologies & Routing v Direct, indirect, regular, irregular Formal models and analysis for deadlock and livelock

More information

Lecture 13: Interconnection Networks. Topics: lots of background, recent innovations for power and performance

Lecture 13: Interconnection Networks. Topics: lots of background, recent innovations for power and performance Lecture 13: Interconnection Networks Topics: lots of background, recent innovations for power and performance 1 Interconnection Networks Recall: fully connected network, arrays/rings, meshes/tori, trees,

More information

Fault-Tolerant Routing Algorithm in Meshes with Solid Faults

Fault-Tolerant Routing Algorithm in Meshes with Solid Faults Fault-Tolerant Routing Algorithm in Meshes with Solid Faults Jong-Hoon Youn Bella Bose Seungjin Park Dept. of Computer Science Dept. of Computer Science Dept. of Computer Science Oregon State University

More information

EE 6900: Interconnection Networks for HPC Systems Fall 2016

EE 6900: Interconnection Networks for HPC Systems Fall 2016 EE 6900: Interconnection Networks for HPC Systems Fall 2016 Avinash Karanth Kodi School of Electrical Engineering and Computer Science Ohio University Athens, OH 45701 Email: kodi@ohio.edu 1 Acknowledgement:

More information

Fault-Tolerant Routing in Fault Blocks. Planarly Constructed. Dong Xiang, Jia-Guang Sun, Jie. and Krishnaiyan Thulasiraman. Abstract.

Fault-Tolerant Routing in Fault Blocks. Planarly Constructed. Dong Xiang, Jia-Guang Sun, Jie. and Krishnaiyan Thulasiraman. Abstract. Fault-Tolerant Routing in Fault Blocks Planarly Constructed Dong Xiang, Jia-Guang Sun, Jie and Krishnaiyan Thulasiraman Abstract A few faulty nodes can an n-dimensional mesh or torus network unsafe for

More information

Improving Network Performance by Reducing Network Contention in Source-Based COWs with a Low Path-Computation Overhead Λ

Improving Network Performance by Reducing Network Contention in Source-Based COWs with a Low Path-Computation Overhead Λ Improving Network Performance by Reducing Network Contention in Source-Based COWs with a Low Path-Computation Overhead Λ J. Flich, P. López, M. P. Malumbres, and J. Duato Dept. of Computer Engineering

More information

A New Theory of Deadlock-Free Adaptive. Routing in Wormhole Networks. Jose Duato. Abstract

A New Theory of Deadlock-Free Adaptive. Routing in Wormhole Networks. Jose Duato. Abstract A New Theory of Deadlock-Free Adaptive Routing in Wormhole Networks Jose Duato Abstract Second generation multicomputers use wormhole routing, allowing a very low channel set-up time and drastically reducing

More information

Routing and Deadlock

Routing and Deadlock 3.5-1 3.5-1 Routing and Deadlock Routing would be easy...... were it not for possible deadlock. Topics For This Set: Routing definitions. Deadlock definitions. Resource dependencies. Acyclic deadlock free

More information

Networks: Routing, Deadlock, Flow Control, Switch Design, Case Studies. Admin

Networks: Routing, Deadlock, Flow Control, Switch Design, Case Studies. Admin Networks: Routing, Deadlock, Flow Control, Switch Design, Case Studies Alvin R. Lebeck CPS 220 Admin Homework #5 Due Dec 3 Projects Final (yes it will be cumulative) CPS 220 2 1 Review: Terms Network characterized

More information

A Fully Adaptive Fault-Tolerant Routing Methodology Based on Intermediate Nodes

A Fully Adaptive Fault-Tolerant Routing Methodology Based on Intermediate Nodes A Fully Adaptive Fault-Tolerant Routing Methodology Based on Intermediate Nodes N.A. Nordbotten 1, M.E. Gómez 2, J. Flich 2, P.López 2, A. Robles 2, T. Skeie 1, O. Lysne 1, and J. Duato 2 1 Simula Research

More information

Lecture 24: Interconnection Networks. Topics: topologies, routing, deadlocks, flow control

Lecture 24: Interconnection Networks. Topics: topologies, routing, deadlocks, flow control Lecture 24: Interconnection Networks Topics: topologies, routing, deadlocks, flow control 1 Topology Examples Grid Torus Hypercube Criteria Bus Ring 2Dtorus 6-cube Fully connected Performance Bisection

More information

Lecture 16: On-Chip Networks. Topics: Cache networks, NoC basics

Lecture 16: On-Chip Networks. Topics: Cache networks, NoC basics Lecture 16: On-Chip Networks Topics: Cache networks, NoC basics 1 Traditional Networks Huh et al. ICS 05, Beckmann MICRO 04 Example designs for contiguous L2 cache regions 2 Explorations for Optimality

More information

Communication Performance in Network-on-Chips

Communication Performance in Network-on-Chips Communication Performance in Network-on-Chips Axel Jantsch Royal Institute of Technology, Stockholm November 24, 2004 Network on Chip Seminar, Linköping, November 25, 2004 Communication Performance In

More information

EE482, Spring 1999 Research Paper Report. Deadlock Recovery Schemes

EE482, Spring 1999 Research Paper Report. Deadlock Recovery Schemes EE482, Spring 1999 Research Paper Report Deadlock Recovery Schemes Jinyung Namkoong Mohammed Haque Nuwan Jayasena Manman Ren May 18, 1999 Introduction The selected papers address the problems of deadlock,

More information

4. Networks. in parallel computers. Advances in Computer Architecture

4. Networks. in parallel computers. Advances in Computer Architecture 4. Networks in parallel computers Advances in Computer Architecture System architectures for parallel computers Control organization Single Instruction stream Multiple Data stream (SIMD) All processors

More information

Removing the Latency Overhead of the ITB Mechanism in COWs with Source Routing Λ

Removing the Latency Overhead of the ITB Mechanism in COWs with Source Routing Λ Removing the Latency Overhead of the ITB Mechanism in COWs with Source Routing Λ J. Flich, M. P. Malumbres, P. López and J. Duato Dpto. of Computer Engineering (DISCA) Universidad Politécnica de Valencia

More information

Wormhole Routing Techniques for Directly Connected Multicomputer Systems

Wormhole Routing Techniques for Directly Connected Multicomputer Systems Wormhole Routing Techniques for Directly Connected Multicomputer Systems PRASANT MOHAPATRA Iowa State University, Department of Electrical and Computer Engineering, 201 Coover Hall, Iowa State University,

More information

Module 17: "Interconnection Networks" Lecture 37: "Introduction to Routers" Interconnection Networks. Fundamentals. Latency and bandwidth

Module 17: Interconnection Networks Lecture 37: Introduction to Routers Interconnection Networks. Fundamentals. Latency and bandwidth Interconnection Networks Fundamentals Latency and bandwidth Router architecture Coherence protocol and routing [From Chapter 10 of Culler, Singh, Gupta] file:///e /parallel_com_arch/lecture37/37_1.htm[6/13/2012

More information

TDT Appendix E Interconnection Networks

TDT Appendix E Interconnection Networks TDT 4260 Appendix E Interconnection Networks Review Advantages of a snooping coherency protocol? Disadvantages of a snooping coherency protocol? Advantages of a directory coherency protocol? Disadvantages

More information

Lecture 12: Interconnection Networks. Topics: dimension/arity, routing, deadlock, flow control

Lecture 12: Interconnection Networks. Topics: dimension/arity, routing, deadlock, flow control Lecture 12: Interconnection Networks Topics: dimension/arity, routing, deadlock, flow control 1 Interconnection Networks Recall: fully connected network, arrays/rings, meshes/tori, trees, butterflies,

More information

Lecture 3: Flow-Control

Lecture 3: Flow-Control High-Performance On-Chip Interconnects for Emerging SoCs http://tusharkrishna.ece.gatech.edu/teaching/nocs_acaces17/ ACACES Summer School 2017 Lecture 3: Flow-Control Tushar Krishna Assistant Professor

More information

Flow Control can be viewed as a problem of

Flow Control can be viewed as a problem of NOC Flow Control 1 Flow Control Flow Control determines how the resources of a network, such as channel bandwidth and buffer capacity are allocated to packets traversing a network Goal is to use resources

More information

Lecture 7: Flow Control - I

Lecture 7: Flow Control - I ECE 8823 A / CS 8803 - ICN Interconnection Networks Spring 2017 http://tusharkrishna.ece.gatech.edu/teaching/icn_s17/ Lecture 7: Flow Control - I Tushar Krishna Assistant Professor School of Electrical

More information

Interconnection Networks: Flow Control. Prof. Natalie Enright Jerger

Interconnection Networks: Flow Control. Prof. Natalie Enright Jerger Interconnection Networks: Flow Control Prof. Natalie Enright Jerger Switching/Flow Control Overview Topology: determines connectivity of network Routing: determines paths through network Flow Control:

More information

Routing Algorithm. How do I know where a packet should go? Topology does NOT determine routing (e.g., many paths through torus)

Routing Algorithm. How do I know where a packet should go? Topology does NOT determine routing (e.g., many paths through torus) Routing Algorithm How do I know where a packet should go? Topology does NOT determine routing (e.g., many paths through torus) Many routing algorithms exist 1) Arithmetic 2) Source-based 3) Table lookup

More information

Recall: The Routing problem: Local decisions. Recall: Multidimensional Meshes and Tori. Properties of Routing Algorithms

Recall: The Routing problem: Local decisions. Recall: Multidimensional Meshes and Tori. Properties of Routing Algorithms CS252 Graduate Computer Architecture Lecture 16 Multiprocessor Networks (con t) March 14 th, 212 John Kubiatowicz Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~kubitron/cs252

More information

Total-Exchange on Wormhole k-ary n-cubes with Adaptive Routing

Total-Exchange on Wormhole k-ary n-cubes with Adaptive Routing Total-Exchange on Wormhole k-ary n-cubes with Adaptive Routing Fabrizio Petrini Oxford University Computing Laboratory Wolfson Building, Parks Road Oxford OX1 3QD, England e-mail: fabp@comlab.ox.ac.uk

More information

Packet Switch Architecture

Packet Switch Architecture Packet Switch Architecture 3. Output Queueing Architectures 4. Input Queueing Architectures 5. Switching Fabrics 6. Flow and Congestion Control in Sw. Fabrics 7. Output Scheduling for QoS Guarantees 8.

More information

Packet Switch Architecture

Packet Switch Architecture Packet Switch Architecture 3. Output Queueing Architectures 4. Input Queueing Architectures 5. Switching Fabrics 6. Flow and Congestion Control in Sw. Fabrics 7. Output Scheduling for QoS Guarantees 8.

More information

Generic Methodologies for Deadlock-Free Routing

Generic Methodologies for Deadlock-Free Routing Generic Methodologies for Deadlock-Free Routing Hyunmin Park Dharma P. Agrawal Department of Computer Engineering Electrical & Computer Engineering, Box 7911 Myongji University North Carolina State University

More information

The Odd-Even Turn Model for Adaptive Routing

The Odd-Even Turn Model for Adaptive Routing IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 11, NO. 7, JULY 2000 729 The Odd-Even Turn Model for Adaptive Routing Ge-Ming Chiu, Member, IEEE Computer Society AbstractÐThis paper presents

More information

Interconnection topologies (cont.) [ ] In meshes and hypercubes, the average distance increases with the dth root of N.

Interconnection topologies (cont.) [ ] In meshes and hypercubes, the average distance increases with the dth root of N. Interconnection topologies (cont.) [ 10.4.4] In meshes and hypercubes, the average distance increases with the dth root of N. In a tree, the average distance grows only logarithmically. A simple tree structure,

More information

Lecture 15: PCM, Networks. Today: PCM wrap-up, projects discussion, on-chip networks background

Lecture 15: PCM, Networks. Today: PCM wrap-up, projects discussion, on-chip networks background Lecture 15: PCM, Networks Today: PCM wrap-up, projects discussion, on-chip networks background 1 Hard Error Tolerance in PCM PCM cells will eventually fail; important to cause gradual capacity degradation

More information

Interconnection Networks

Interconnection Networks Lecture 18: Interconnection Networks Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Credit: many of these slides were created by Michael Papamichael This lecture is partially

More information

Deadlock and Livelock. Maurizio Palesi

Deadlock and Livelock. Maurizio Palesi Deadlock and Livelock 1 Deadlock (When?) Deadlock can occur in an interconnection network, when a group of packets cannot make progress, because they are waiting on each other to release resource (buffers,

More information

Communication in Multicomputers with Nonconvex Faults?

Communication in Multicomputers with Nonconvex Faults? In Proceedings of EUROPAR 95 Communication in Multicomputers with Nonconvex Faults? Suresh Chalasani 1 and Rajendra V. Boppana 2 1 Dept. of ECE, University of Wisconsin-Madison, Madison, WI 53706-1691,

More information

On Topology and Bisection Bandwidth of Hierarchical-ring Networks for Shared-memory Multiprocessors

On Topology and Bisection Bandwidth of Hierarchical-ring Networks for Shared-memory Multiprocessors On Topology and Bisection Bandwidth of Hierarchical-ring Networks for Shared-memory Multiprocessors Govindan Ravindran Newbridge Networks Corporation Kanata, ON K2K 2E6, Canada gravindr@newbridge.com Michael

More information

Lecture: Interconnection Networks. Topics: TM wrap-up, routing, deadlock, flow control, virtual channels

Lecture: Interconnection Networks. Topics: TM wrap-up, routing, deadlock, flow control, virtual channels Lecture: Interconnection Networks Topics: TM wrap-up, routing, deadlock, flow control, virtual channels 1 TM wrap-up Eager versioning: create a log of old values Handling problematic situations with a

More information

Interconnection Networks: Topology. Prof. Natalie Enright Jerger

Interconnection Networks: Topology. Prof. Natalie Enright Jerger Interconnection Networks: Topology Prof. Natalie Enright Jerger Topology Overview Definition: determines arrangement of channels and nodes in network Analogous to road map Often first step in network design

More information

Interconnection Network

Interconnection Network Interconnection Network Recap: Generic Parallel Architecture A generic modern multiprocessor Network Mem Communication assist (CA) $ P Node: processor(s), memory system, plus communication assist Network

More information

Resource Deadlocks and Performance of Wormhole Multicast Routing Algorithms

Resource Deadlocks and Performance of Wormhole Multicast Routing Algorithms IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 9, NO. 6, JUNE 1998 535 Resource Deadlocks and Performance of Wormhole Multicast Routing Algorithms Rajendra V. Boppana, Member, IEEE, Suresh

More information

Improving the Performance of Bristled CC-NUMA Systems Using Virtual Channels and Adaptivity

Improving the Performance of Bristled CC-NUMA Systems Using Virtual Channels and Adaptivity Appears in Intl. Conf. on Supercomputing (ICS), pp. 22-29, June 999. Improving the Performance of Bristled CC-NUMA Systems Using Virtual Channels and Adaptivity José F. Martínez, Josep Torrellas Department

More information

Fault-Tolerant Wormhole Routing Algorithms in Meshes in the Presence of Concave Faults

Fault-Tolerant Wormhole Routing Algorithms in Meshes in the Presence of Concave Faults Fault-Tolerant Wormhole Routing Algorithms in Meshes in the Presence of Concave Faults Seungjin Park Jong-Hoon Youn Bella Bose Dept. of Computer Science Dept. of Computer Science Dept. of Computer Science

More information

Multi-path Routing for Mesh/Torus-Based NoCs

Multi-path Routing for Mesh/Torus-Based NoCs Multi-path Routing for Mesh/Torus-Based NoCs Yaoting Jiao 1, Yulu Yang 1, Ming He 1, Mei Yang 2, and Yingtao Jiang 2 1 College of Information Technology and Science, Nankai University, China 2 Department

More information

Communication in Multicomputers with Nonconvex Faults

Communication in Multicomputers with Nonconvex Faults Communication in Multicomputers with Nonconvex Faults Suresh Chalasani Rajendra V. Boppana Technical Report : CS-96-12 October 1996 The University of Texas at San Antonio Division of Computer Science San

More information

Boosting the Performance of Myrinet Networks

Boosting the Performance of Myrinet Networks IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. XX, NO. Y, MONTH 22 1 Boosting the Performance of Myrinet Networks J. Flich, P. López, M. P. Malumbres, and J. Duato Abstract Networks of workstations

More information

Synchronized Progress in Interconnection Networks (SPIN) : A new theory for deadlock freedom

Synchronized Progress in Interconnection Networks (SPIN) : A new theory for deadlock freedom ISCA 2018 Session 8B: Interconnection Networks Synchronized Progress in Interconnection Networks (SPIN) : A new theory for deadlock freedom Aniruddh Ramrakhyani Georgia Tech (aniruddh@gatech.edu) Tushar

More information

Multicomputer distributed system LECTURE 8

Multicomputer distributed system LECTURE 8 Multicomputer distributed system LECTURE 8 DR. SAMMAN H. AMEEN 1 Wide area network (WAN); A WAN connects a large number of computers that are spread over large geographic distances. It can span sites in

More information

Lecture 26: Interconnects. James C. Hoe Department of ECE Carnegie Mellon University

Lecture 26: Interconnects. James C. Hoe Department of ECE Carnegie Mellon University 18 447 Lecture 26: Interconnects James C. Hoe Department of ECE Carnegie Mellon University 18 447 S18 L26 S1, James C. Hoe, CMU/ECE/CALCM, 2018 Housekeeping Your goal today get an overview of parallel

More information

CONNECTION-BASED ADAPTIVE ROUTING USING DYNAMIC VIRTUAL CIRCUITS

CONNECTION-BASED ADAPTIVE ROUTING USING DYNAMIC VIRTUAL CIRCUITS Proceedings of the International Conference on Parallel and Distributed Computing and Systems, Las Vegas, Nevada, pp. 379-384, October 1998. CONNECTION-BASED ADAPTIVE ROUTING USING DYNAMIC VIRTUAL CIRCUITS

More information

The final publication is available at

The final publication is available at Document downloaded from: http://hdl.handle.net/10251/82062 This paper must be cited as: Peñaranda Cebrián, R.; Gómez Requena, C.; Gómez Requena, ME.; López Rodríguez, PJ.; Duato Marín, JF. (2016). The

More information

Evaluation of NOC Using Tightly Coupled Router Architecture

Evaluation of NOC Using Tightly Coupled Router Architecture IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 18, Issue 1, Ver. II (Jan Feb. 2016), PP 01-05 www.iosrjournals.org Evaluation of NOC Using Tightly Coupled Router

More information

Network-on-chip (NOC) Topologies

Network-on-chip (NOC) Topologies Network-on-chip (NOC) Topologies 1 Network Topology Static arrangement of channels and nodes in an interconnection network The roads over which packets travel Topology chosen based on cost and performance

More information

MESH-CONNECTED networks have been widely used in

MESH-CONNECTED networks have been widely used in 620 IEEE TRANSACTIONS ON COMPUTERS, VOL. 58, NO. 5, MAY 2009 Practical Deadlock-Free Fault-Tolerant Routing in Meshes Based on the Planar Network Fault Model Dong Xiang, Senior Member, IEEE, Yueli Zhang,

More information

NOW Handout Page 1. Outline. Networks: Routing and Design. Routing. Routing Mechanism. Routing Mechanism (cont) Properties of Routing Algorithms

NOW Handout Page 1. Outline. Networks: Routing and Design. Routing. Routing Mechanism. Routing Mechanism (cont) Properties of Routing Algorithms Outline Networks: Routing and Design Routing Switch Design Case Studies CS 5, Spring 99 David E. Culler Computer Science Division U.C. Berkeley 3/3/99 CS5 S99 Routing Recall: routing algorithm determines

More information

Interconnect Technology and Computational Speed

Interconnect Technology and Computational Speed Interconnect Technology and Computational Speed From Chapter 1 of B. Wilkinson et al., PARAL- LEL PROGRAMMING. Techniques and Applications Using Networked Workstations and Parallel Computers, augmented

More information

EE382C Lecture 1. Bill Dally 3/29/11. EE 382C - S11 - Lecture 1 1

EE382C Lecture 1. Bill Dally 3/29/11. EE 382C - S11 - Lecture 1 1 EE382C Lecture 1 Bill Dally 3/29/11 EE 382C - S11 - Lecture 1 1 Logistics Handouts Course policy sheet Course schedule Assignments Homework Research Paper Project Midterm EE 382C - S11 - Lecture 1 2 What

More information

A closer look at network structure:

A closer look at network structure: T1: Introduction 1.1 What is computer network? Examples of computer network The Internet Network structure: edge and core 1.2 Why computer networks 1.3 The way networks work 1.4 Performance metrics: Delay,

More information

Deadlock: Part II. Reading Assignment. Deadlock: A Closer Look. Types of Deadlock

Deadlock: Part II. Reading Assignment. Deadlock: A Closer Look. Types of Deadlock Reading Assignment T. M. Pinkston, Deadlock Characterization and Resolution in Interconnection Networks, Chapter 13 in Deadlock Resolution in Computer Integrated Systems, CRC Press 2004 Deadlock: Part

More information

A First Implementation of In-Transit Buffers on Myrinet GM Software Λ

A First Implementation of In-Transit Buffers on Myrinet GM Software Λ A First Implementation of In-Transit Buffers on Myrinet GM Software Λ S. Coll, J. Flich, M. P. Malumbres, P. López, J. Duato and F.J. Mora Universidad Politécnica de Valencia Camino de Vera, 14, 46071

More information

Multiprocessor Interconnection Networks

Multiprocessor Interconnection Networks Multiprocessor Interconnection Networks Todd C. Mowry CS 740 November 19, 1998 Topics Network design space Contention Active messages Networks Design Options: Topology Routing Direct vs. Indirect Physical

More information

Lecture: Transactional Memory, Networks. Topics: TM implementations, on-chip networks

Lecture: Transactional Memory, Networks. Topics: TM implementations, on-chip networks Lecture: Transactional Memory, Networks Topics: TM implementations, on-chip networks 1 Summary of TM Benefits As easy to program as coarse-grain locks Performance similar to fine-grain locks Avoids deadlock

More information

A Survey of Routing Techniques in Store-and-Forward and Wormhole Interconnects

A Survey of Routing Techniques in Store-and-Forward and Wormhole Interconnects SANDIA REPORT SAND2008-0068 Unlimited Release Printed January 2008 A Survey of Routing Techniques in Store-and-Forward and Wormhole Interconnects David M. Holman and David S. Lee Prepared by Sandia National

More information

Network on Chip Architecture: An Overview

Network on Chip Architecture: An Overview Network on Chip Architecture: An Overview Md Shahriar Shamim & Naseef Mansoor 12/5/2014 1 Overview Introduction Multi core chip Challenges Network on Chip Architecture Regular Topology Irregular Topology

More information

Power and Performance Efficient Partial Circuits in Packet-Switched Networks-on-Chip

Power and Performance Efficient Partial Circuits in Packet-Switched Networks-on-Chip 2013 21st Euromicro International Conference on Parallel, Distributed, and Network-Based Processing Power and Performance Efficient Partial Circuits in Packet-Switched Networks-on-Chip Nasibeh Teimouri

More information

NOC: Networks on Chip SoC Interconnection Structures

NOC: Networks on Chip SoC Interconnection Structures NOC: Networks on Chip SoC Interconnection Structures COE838: Systems-on-Chip Design http://www.ee.ryerson.ca/~courses/coe838/ Dr. Gul N. Khan http://www.ee.ryerson.ca/~gnkhan Electrical and Computer Engineering

More information

Deadlock-free Fault-tolerant Routing in the Multi-dimensional Crossbar Network and Its Implementation for the Hitachi SR2201

Deadlock-free Fault-tolerant Routing in the Multi-dimensional Crossbar Network and Its Implementation for the Hitachi SR2201 Deadlock-free Fault-tolerant Routing in the Multi-dimensional Crossbar Network and Its Implementation for the Hitachi SR2201 Yoshiko Yasuda, Hiroaki Fujii, Hideya Akashi, Yasuhiro Inagami, Teruo Tanaka*,

More information

Routing Algorithms, Process Model for Quality of Services (QoS) and Architectures for Two-Dimensional 4 4 Mesh Topology Network-on-Chip

Routing Algorithms, Process Model for Quality of Services (QoS) and Architectures for Two-Dimensional 4 4 Mesh Topology Network-on-Chip Routing Algorithms, Process Model for Quality of Services (QoS) and Architectures for Two-Dimensional 4 4 Mesh Topology Network-on-Chip Nauman Jalil, Adnan Qureshi, Furqan Khan, and Sohaib Ayyaz Qazi Abstract

More information

Lecture 28: Networks & Interconnect Architectural Issues Professor Randy H. Katz Computer Science 252 Spring 1996

Lecture 28: Networks & Interconnect Architectural Issues Professor Randy H. Katz Computer Science 252 Spring 1996 Lecture 28: Networks & Interconnect Architectural Issues Professor Randy H. Katz Computer Science 252 Spring 1996 RHK.S96 1 Review: ABCs of Networks Starting Point: Send bits between 2 computers Queue

More information

MIMD Overview. Intel Paragon XP/S Overview. XP/S Usage. XP/S Nodes and Interconnection. ! Distributed-memory MIMD multicomputer

MIMD Overview. Intel Paragon XP/S Overview. XP/S Usage. XP/S Nodes and Interconnection. ! Distributed-memory MIMD multicomputer MIMD Overview Intel Paragon XP/S Overview! MIMDs in the 1980s and 1990s! Distributed-memory multicomputers! Intel Paragon XP/S! Thinking Machines CM-5! IBM SP2! Distributed-memory multicomputers with hardware

More information

The Cray T3E Network:

The Cray T3E Network: The Cray T3E Network: Adaptive Routing in a High Performance 3D Torus Steven L. Scott and Gregory M. Thorson Cray Research, Inc. {sls,gmt}@cray.com Abstract This paper describes the interconnection network

More information

ECE 669 Parallel Computer Architecture

ECE 669 Parallel Computer Architecture ECE 669 Parallel Computer Architecture Lecture 21 Routing Outline Routing Switch Design Flow Control Case Studies Routing Routing algorithm determines which of the possible paths are used as routes how

More information

Interprocessor Communication. Basics of Network Routing

Interprocessor Communication. Basics of Network Routing Interprocessor Communication There are two main differences between sequential computers and parallel computers -- multiple processors and the hardware to connect them together. That hardware is the most

More information

Interconnection Networks

Interconnection Networks Lecture 17: Interconnection Networks Parallel Computer Architecture and Programming A comment on web site comments It is okay to make a comment on a slide/topic that has already been commented on. In fact

More information

Deadlock-free Routing in InfiniBand TM through Destination Renaming Λ

Deadlock-free Routing in InfiniBand TM through Destination Renaming Λ Deadlock-free Routing in InfiniBand TM through Destination Renaming Λ P. López, J. Flich and J. Duato Dept. of Computing Engineering (DISCA) Universidad Politécnica de Valencia, Valencia, Spain plopez@gap.upv.es

More information

On Constructing the Minimum Orthogonal Convex Polygon in 2-D Faulty Meshes

On Constructing the Minimum Orthogonal Convex Polygon in 2-D Faulty Meshes On Constructing the Minimum Orthogonal Convex Polygon in 2-D Faulty Meshes Jie Wu Department of Computer Science and Engineering Florida Atlantic University Boca Raton, FL 33431 E-mail: jie@cse.fau.edu

More information

Deadlock-Free Connection-Based Adaptive Routing with Dynamic Virtual Circuits

Deadlock-Free Connection-Based Adaptive Routing with Dynamic Virtual Circuits Computer Science Department Technical Report #TR050021 University of California, Los Angeles, June 2005 Deadlock-Free Connection-Based Adaptive Routing with Dynamic Virtual Circuits Yoshio Turner and Yuval

More information

Design and Evaluation of a Fault-Tolerant Adaptive Router for Parallel Computers

Design and Evaluation of a Fault-Tolerant Adaptive Router for Parallel Computers Design and Evaluation of a Fault-Tolerant Adaptive Router for Parallel Computers Tsutomu YOSHINAGA, Hiroyuki HOSOGOSHI, Masahiro SOWA Graduate School of Information Systems, University of Electro-Communications,

More information

A Literature Review of on-chip Network Design using an Agent-based Management Method

A Literature Review of on-chip Network Design using an Agent-based Management Method A Literature Review of on-chip Network Design using an Agent-based Management Method Mr. Kendaganna Swamy S Dr. Anand Jatti Dr. Uma B V Instrumentation Instrumentation Communication Bangalore, India Bangalore,

More information

Rajendra V. Boppana. Computer Science Division. for example, [23, 25] and the references therein) exploit the

Rajendra V. Boppana. Computer Science Division. for example, [23, 25] and the references therein) exploit the Fault-Tolerance with Multimodule Routers Suresh Chalasani ECE Department University of Wisconsin Madison, WI 53706-1691 suresh@ece.wisc.edu Rajendra V. Boppana Computer Science Division The Univ. of Texas

More information

Lecture 25: Interconnection Networks, Disks. Topics: flow control, router microarchitecture, RAID

Lecture 25: Interconnection Networks, Disks. Topics: flow control, router microarchitecture, RAID Lecture 25: Interconnection Networks, Disks Topics: flow control, router microarchitecture, RAID 1 Virtual Channel Flow Control Each switch has multiple virtual channels per phys. channel Each virtual

More information

Lecture 18: Communication Models and Architectures: Interconnection Networks

Lecture 18: Communication Models and Architectures: Interconnection Networks Design & Co-design of Embedded Systems Lecture 18: Communication Models and Architectures: Interconnection Networks Sharif University of Technology Computer Engineering g Dept. Winter-Spring 2008 Mehdi

More information

The Adaptive Bubble Router 1

The Adaptive Bubble Router 1 The Adaptive Bubble Router 1 V. Puente, C. Izu y, R. Beivide, J.A. Gregorio, F. Vallejo and J.M. Prellezo Universidad de Cantabria, 395 Santander, Spain y University of Adelaide, SA 55 Australia The design

More information

MMR: A High-Performance Multimedia Router - Architecture and Design Trade-Offs

MMR: A High-Performance Multimedia Router - Architecture and Design Trade-Offs MMR: A High-Performance Multimedia Router - Architecture and Design Trade-Offs Jose Duato 1, Sudhakar Yalamanchili 2, M. Blanca Caminero 3, Damon Love 2, Francisco J. Quiles 3 Abstract This paper presents

More information

EE/CSCI 451: Parallel and Distributed Computation

EE/CSCI 451: Parallel and Distributed Computation EE/CSCI 451: Parallel and Distributed Computation Lecture #8 2/7/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Outline From last class

More information