A Hybrid Interconnection Network for Integrated Communication Services

Size: px

Start display at page:

Download "A Hybrid Interconnection Network for Integrated Communication Services"

Elijah Clark
5 years ago
Views:

1 A Hybrid Interconnection Network for Integrated Communication Services Yi-long Chen Northern Telecom, Inc. Richardson, TX 7583 Jyh-Charn Liu Department of Computer Science, Texas A&M Univ. College Station, TX Abstract This paper presents a interconnection network architecture to support integrated communication services for multicomputer-based database and multimedia systems. Our study shows that existing wormhole routing networks are inefficient in transfer of long files. We demonstrate the feasibility of integrating different network techniques based on virtual channels and flexible routing mechanisms. 1. Introduction Parallel computing systems based on high performance interconnection communication networks are being used in on-line multimedia and database applications. In a video server, for example, a large number of disks can be connected through an interconnection network to telecommunication ports which are linked to customers. A commercial example of such systems is the Oracle s Media Server on the ncube supercomputer [1]. Different from the conventional scientific computing applications, such systems are communication and data intensive, and require efficient transmission of messages of heterogeneous types to support integrated service types. Packet-switching, circuit-switching and virtual cutthrough are major communication switching mechanisms for interconnection networks [2, 3]. Packet-switching transfers packets in a store-and-forward manner so that the network latency is proportional to the distance of the source and destination nodes. In circuit switching, a path from the source to the destination needs to be initially established and data needs not be stored at the intermediate nodes during the transmission. Virtual cut-through improves packet switching by not buffering messages that are able to proceed immediately on the next channel. Among these techniques, the wormhole routing, a special case of virtual cut-through, is most widely adopted for current interconnection networks. In wormhole routing, a message is broken into small flits, and the data flits are transmitted in a pipelined fashion after its header flit reaches the destination. Messages can be routed either deterministically or dynamically. A deterministic routing algorithm routes messages along fixed paths which are independent of current network conditions. Adaptive routing algorithms allow alternative paths to be used in message routing, but may need additional resources to preserve deadlock and livelock freedom. A common approach to avoid deadlock and to increase sharing of network resources is to use virtual channels [4, 5, 6]. Virtual channels are time-multiplexed over physical channels with bandwidth allocated to each virtual channel as needed. The self-routed, wormhole-switched networks are commonly used in number-crunching applications to handle short messages. Such networks, however, may suffer performance degradation in transfer of long files [3]. Worse yet, delay-sensitive short messages may suffer from the starvation problem and long messages may not be evenly routed over all available paths when these messages share the network. Although adaptive routing schemes can reduce the network delay when most messages are short, the performance of these algorithms in a heterogeneous message environment has not been fully discussed. The circuit switched mechanism is generally most effective for long messages, but may become quite inefficient in transfer of short messages due to the overhead of establishing routing paths. In this paper, we present a interconnection network architecture to support heterogeneous messages for integrated communication services. For simplicity, we assume that a service mainly requires either short or long messages. Our scheme integrates self-routed wormhole routing and circuit switched techniques based on virtual channels. We divide virtual channels into the short-message and longmessage channels to serve messages of different types. The short-message channels are used to support interactive data users and system management functions, while the longmessage channels are designed especially for transfer of bulk files such as video and image files. Each directed physical transmission link can have one or more short-message and long-message channels which are operated in a time 1

2 sharing basis. The short messages are routed deterministically for low overhead, while the routing paths of long messages are determined globally through exchange of control messages to optimize the distribution of network traffic. We compare our scheme with a representative deterministic routing scheme, the [7], and a routing scheme called star-channel [5] in a hypercube network. The scheme is widely used in commercial systems, and the star-channel is shown to have the best performance among the existing adaptive routing schemes. Simulation results indicate that the existing wormhole routing mechanism is not best suited to networks with heterogeneous messages, and the proposed architecture can effectively and efficiently transmit both short and long messages. 2. The System Architecture We assume that each node has a local processor, a router for communication, and a dedicated physical communication link on each direction. Each physical link is timemultiplexed between a short-message channel and a longmessage channel. Messages shorter than a threshold L are routed based on a deterministic, self-routed mechanism such as, while messages longer than L basedona pipelined circuit switching mechanism. (The idea of using pipelined circuit switching can also be found in [8] for faulttolerant routing.) Messages are assumed to be divided into fixed-length flits for transmission, as defined in wormhole routing. A short message may consist of a header flit, additional address flits, if necessary, and data flits, and its routing is controlled by the header flit. A long message consists of only data flits but needs one or more control messages (short messages) to establish the routing path. A global channel allocation algorithm, which is implemented by a control message exchange protocol, is used to optimize the routing paths for long messages. We use the pipelined routing mechanism to illustrate our model, where pipelining of the flits of a message is done asynchronously using low-level handshaking signals [3]. In this architecture, each of short- and long-message channels has its own flit buffer, routing control mechanism, and data paths to the physical links. The bandwidth of a physical link is dynamically shared by its short-message and long-message channels such that one channel can use the full bandwidth when the other is idle. At each router, time slots are switched between the transmission types of short-message and long-message. The sending side multiplexes data from the short- and long-message buffers over the physical link. Only the channel whose output buffer is not empty and whose input buffer is not full may use the physical link. The receiving side is responsible for buffering the received message to the corresponding message buffer based on the message type. The decision on which message will be transmitted in the current time slot is based on the current transmission type and the buffer states on both the sending and receiving sides. 3. Global Message Routing In this section, we discuss a global routing scheme to establish the shortest paths for long messages based on the depth-first search method. To avoid deadlock and livelock of the control messages, a partially ordered relation of nodes has to be defined, which serves as the basis for routing of control messages and determination of routing paths [7, 9]. In this relation, some neighbors of a node are called its ancestors and descendants, such that the control messages for probing routing paths can only be routed from ancestor to descendant. We define the partially ordering relation based on the notion of broadcast addresses which are relative addresses with respect to a source-destination pair. In an n-dimensional hypercube network, for example, a node N x is represented by its n-bit binary address B x.let N s and N d denote the source and the destination nodes of a long message, and assume that B s and B d differ at k bit positions (dimensions), fl 1 ;l 2 ;;l k g. Since a shortest path from N s to N d consists of channels only on these dimensions, N x is in the ancestor-descendant relationship related to N s and N d, if and only if B x and B s (or B d ) are same at all bit positions except for those in fl 1 ;l 2 ;;l k g. Let B sd x denote the broadcast address of N x,thenb sd = B x s B x,where is the XOR operation. Let N x and N y be two adjacent nodes whose addresses differ at bit position l,thenn x is an ancestor of N y with respect to N s and N d,if(1)bothn x and N y are in the ancestor-descendant relationship related to N s and N d, and (2) the lth bit value of B sd x is less than that of B sd. y To send a long message, the source node first initiates a probing control message to establish a routing path. Probing omessages are routed based on the ancestor-descendant relation which can be identified based on the sourcedestination information stored in the messages. When a node receives a probing message from one of its ancestors, it can only send the message to one of its descendants based the depth-first rule. To keep track of the paths that have been probed, each intermediate node maintains two variables, d in and d out, which store the dimensions the probing message is received from and sent to, respectively. d out is updated whenever a probing message is sent out, and the idle long-message channel at the smallest dimension which is greater than d out is chosen for the next channel to route the probing message. There is no path available through this node, if no such long-message channel is available. Then backtracking occurs after the node sends an unsuccess acknowledging control message along d in. This simple depth- 2

3 first rule guarantees the deadlock free. A long-message path is found when the destination receives a probing control message. The destination then sends a success acknowledging message (short message) back to the source along the path traversed by the probing control message, to indicate that a routing path is found. Each intermediate node will allocate the longmessage channel after it receives the acknowledging message, and the path is completely allocated when the source receives the acknowledging message. For simplicity, concurrent requests competing the same long-message channels are assumed to be resolved by a FCFS discipline, and the allocation is aborted if the acknowledging message fails in the competition. That is, if the node finds that the longmessage channel requested by a message has already been allocated to another message, it will stop relaying the acknowledging message but generate an aborting message to abort the channel allocation related to the message. The back-off strategies used in the CSMA/CD protocol may be applicable here. To avoid the starvation problem, messages can be assigned higher priority if they have been blocked for a long time, so that an acknowledging message with lower priority cannot allocate the long-message channel which is also requested by a higher priority message. To avoid excessive contention between these messages, we can restrict the maximum number of the pending long messages in a node. Neither deadlock nor livelock can happen in the proposed scheme because short messages are routed deterministically and a waiting long message does not hold any channels untill all channels on the routing path are allocated. The livelock of control messages cannot happen since dimensions are tried at each intermediate node in a fixed order. Since only one control message is in transmission for each long message at a time, only a small number of control messages are needed if network contention is moderate. 4. Performance Evaluation and Discussion In this section, we compare the scheme with the e- cube routing and the star-channel algorithm through simulation study. The is a dimension-ordereddeterministic algorithm in which the dimensions that a message needs to correct to reach its destination have to be chosen in an increasing or decreasing order. The star-channel scheme needs four virtual channels per bidirectional link for hypercubes. In this scheme, the two virtual channels in one directed link are assigned to be the star and nonstar channels. The header of a message can use nonstar channels arbitrarily but can use only the star channel whose dimension is the most significant of dimensions that the message has to correct. To be fair in comparison, we implement these three schemes using four virtual channels. The classic e- cube implementation only uses two virtual channels, but it has been shown that using an extra pair of virtual channels can greatly increase throughput of [4]. We simulate the time-step operations at the unit (flit) level in a 1-dimensional hypercube. The network performance is evaluated by the average communication latency of messages, which is defined as the average elapsed time after the messages are injected into the network at their source nodes until the whole messages reache their destinations. Message latency is measured in terms of link cycles, where during each link cycle a unit of a message can be sent over a unidirectional link. We assume that long and short messages can be dynamically generated at any node, following a Possion distribution with an average generation rate of l and s,respectively. The lengths of these two types of messages are normally distributed with an average of L l and L s units, respectively, and that of control messages are assumed to be 5-unit. We use the offered link utilization, U, to describe the system workload. Since the total network traffic is (2 n l h l L l +2 n s h s L s ),wherenis the hypercube dimension, and h l and h s are the average transmission distances of long and short messages, respectively, and the total number of links is n2 n, we compute U as (2 n l h l L l +2 n s h s L s ). Let s = K l,then s can be n2 n KUn2 n described as s = (2 n h l L l + K2 n h s l. s ) The effect of the message length on network performance is the major concern of this simulation study. Two traffic patterns are used to describe different traffic characteristics. For the random pattern, we choose the uniform distribution upon which each node has an equal probability to become the destination of a source. A commonly used nonuniformtrafficpatternisthefixed permutations in which a permutation is defined in advance and applied to generate the destination address based on the source address. We simulate the following permutations: Complement: source x n,1x n,2 x 1 x ) destination x n,1x n,2 x 1 x ; Transpose: source x n,1x n,2 x 1 x )destination x n=2,1 x x n,1x n=2, where x i is the complement of x i,andn is assumed to be even. We first compare different schemes under various traffic patterns and message length distributions with a buffer size of 2. Figure 1 plots the communication latency versus link utilization under the uniform traffic pattern, where the average lengths of long and short messages are 1 and 2 ( s = l =5). In the scheme, the back-off time of the scheme is 2 cycles and the maximum number of pending long messages is 1. It is shown that can 3

4 only sustain about 2% of link utilization. The first observation on the routing scheme is that the short message latency increases sharply when the link utilization reaches 35%, while the long message latency remains stable. This is because under a light traffic, short messages can effectively detour around long messages due to the adaptive routing capability. However, the possibility of short messages being blocked by long messages is increased rapidly with the increase of network traffic. The performance of short messages suffers if they are blocked for a long time. The proposed scheme can support effective transmission of short messages even under high network loads. It is noted that the scheme is slightly worse than the method on long messages when the system is lightly loaded. This is because in the scheme long messages may be affected by the short messages that use the same physical links due to the deterministic nature of short message routing. When the network traffic is moderate, the interference between long messages becomes the dominant factor on the long message performance. The scheme can more evenly distribute long messages over the network and thus has a better performance. The performance of different routing schemes under the complement traffic pattern is depicted in Figure 2, assuming the same system environment as above. Similar performance trends are observed in which the network saturates for short messages soon with and the routing when the network load is increased. The routing performs stable and better than the others in this case. For the transpose traffic pattern, all the schemes can only sustain less than 1% link utilization. Our results comfort with those obtained in [5] that the method outperforms others in this pattern. The scheme performs as poorly as because it uses for short messages. However, the adaptive routing method also fails to reach higher utilization, The performance effect of the message length is further illustrated in Figure 3, where the average length of long messages is increased to 2 units. The performance of the is not shown in this figure since it saturates even under the 5% load. It can be seen that the performance of the and the routing scheme are more sensitive to the lengths of long messages. The scheme also performs better than the others for long messages in this case because it reduces the contention among long messages more effectively. We also study the performance impact of various system parameters for the scheme. When using different back-off times from to 1, only minor performance difference is observed due to the low control message overhead, so that a small back-off time is suggested for a lightly or moderately loaded network. It is also noticed that control AVERAGE LONG MESSAGE LATENCY Figure 1. Communication latency versus link utilization under uniform traffic pattern. messages only consume about.6% to 1% of the bandwidth used by data messages. The buffer size affects the performance significantly in all the schemes compared. The network performance is upper bounded to 4% of link utilization in the scheme when a single buffer is used. It is also noticed that when the buffer size is larger than the average length of short messages, further increasing the buffer size only has minor performance improvement. This might be because a short message can reside completely at a single node and its residential time only depends on the status of the next node. Since the comparison results of different schemes under various buffer sizes are coincident with what we have demonstrated above, we omit the results in this article. 5. Conclusion In this paper, we discussed a interconnection network architecture for communication-intensive applications. We proposed an alternative approach to use vir- 4

5 AVERAGE LONG MESSAGE LATENCY Figure 2. Communication latency versus link utilization under complement traffic pattern mode 15 Figure 3. Communication latency versus link utilization under the uniform traffic pattern. tual channels, and suited to the applications with integrated communication service types. demonstrated the necessity and the feasibility of integrating different network technologies. References [1] R. Buck, The Oracle media server for ncube massively parallel systems, Proc. of the 8th Int l Parallel Processing Symp., pp , April, [2] S.A. Felperin, L. Gravano, G.D. Pifarre, and J.L.C. Sanz, Routing techniques for massively parallel communications, Proceedings of the IEEE, vol. 79, pp , April, [3] L. Ni and P. McKinley, A survey of wormhole routing techniques in direct networks, IEEE Computer, vol. 26, no. 2, pp , Feb., [4] P.T. Gaughan and S. Yalamanchili, Adaptive Routing Protocols for Hypercube Interconnection Networks, IEEE Computer, vol. 26, no. 5, pp , May, [5] G. Pifarre, L. Gravano, S. Felperin, and J. Sanz, Fully adaptive minimal deadlock-free packet routing in hypercubes, meshes, and other networks: Algorithms and simulations, IEEE Trans. on Parallel and Distributed Systems, vol. 5, no. 3, pp , [6] C. Glass and L. Ni, The turn model for adaptive routing, Proc. of the 19th Annual Int l Symposium on Computer Architecture, pp , May [7] W.J. Dally and C.L. Seitz, Deadlock-free message routing in multiprocessor interconnection networks, IEEE Trans. on Computers, vol. 36, pp , May, [8] P. T. Gaughan and S. Yalamanchili, A Family of Fault-Tolerant Routing Protocols for Direct Multiprocessor Networks, IEEE Trans. on Parallel and Distributed Systems, vol. 6, no. 5, pp , July, [9] Y.-L. Chen and J.-C. Liu, A Fault-Tolerant Distributed Subcube Management Scheme for Hypercube Multicomputers, IEEE Trans. on Parallel and Distributed Systems, vol. 6, no. 7, pp , July,

Fault-Tolerant Routing in Fault Blocks. Planarly Constructed. Dong Xiang, Jia-Guang Sun, Jie. and Krishnaiyan Thulasiraman. Abstract.

Fault-Tolerant Routing in Fault Blocks Planarly Constructed Dong Xiang, Jia-Guang Sun, Jie and Krishnaiyan Thulasiraman Abstract A few faulty nodes can an n-dimensional mesh or torus network unsafe for