PERFORMANCE AND IMPLEMENTATION OF 4x4 SWITCHING NODES IN AN INTERCONNECTION NETWORK FOR PASM

Size: px

Start display at page:

Download "PERFORMANCE AND IMPLEMENTATION OF 4x4 SWITCHING NODES IN AN INTERCONNECTION NETWORK FOR PASM"

Matilda Porter
5 years ago
Views:

1 PERFORMANCE AND IMPLEMENTATION OF 4x4 SWITCHING NODES IN AN INTERCONNECTION NETWORK FOR PASM Robert J. McMillen, George B. Adams III, and Howard Jay Siegel School of Electrical Engineering, Purdue University West Lafayette, IN Abstract Design issues for the multistage Generalized Cube network are discussed in this paper. An analysis of the merits of 2-input/2-output interchange boxes versus 4-input/4-output crossbars for interconnection network implementation is made. The cost and performance of each network for the two switching node alternatives are examined. Discussion of the suitability of each approach for VLSI implementation is included. It is shown that in a packet switching environment, 4x4 crossbars outperform, and are less expensive to implement than the four interchange boxes they replace. framework for discussing modifications. In Section III, the performance of two network implementations are compared. Implementation considerations are presented in Section IV. For further details of all this material see [14]. II. DEFINITIONS A partitionable SIMD/MIMD system is a parallel processing system which can be structured as one or more independent SIMD and/or MIMD machines [4] of varying sizes. PASM is a partitionable SIMD/MIMD system for image processing and pattern recognition [16]. TheBMD testbed should have the flexibility to perform as a partitionable SIMD/MIMD machine. The cube network described here can function efficiently in such an environment. The Generalized Cube network (Fig. 1) is a multistage cube-type network topology which was introduced in [173. It has been shown that this topology is equivalent to that used by the omega [7], indirect binary n-cube [113, STARAN [1], and SW-banyan (F=S=2) [63 networks [17, 203. An N input/output Generalized Cube topology has jn = log_n stages, where each stage consists of a set of N lines connected to N/2 interchange boxes. Each interchange box is a 2-input/2-output device. The labels of the input/output (I/O) lines entering the upper and lower inputs of an interchange box are used as the labels for the upper and lower outputs, respectively. The labels are the integers from 0 to N-1. Each interchange box can be set to one of four states as shown in Fig. 1. The connections in this network are based on the cube interconnection functions [133. Stage i of the generalized cube topology pairs I/O lines that differ only in the i-th bit position. The name cube network will be used to refer to the network consisting of the Generalized Cube topology and four-state interchange boxes. Each interchange box will be controlled independently through the use of routing tags [7, 153. I. INTRODUCTION The choice of interconnection network is a central issue in the design of large-scale, multimicroprocessor-based distributed and parallel systems. The Ballistic Missile Defense (BMD) Agency is designing a test bed for evaluating such systems as they may apply to BMD tasks [83. PASM is a multimicroprocessor system being designed at Purdue University for a variety of image processing and pattern recognition problems [163. In both cases a highly flexible network is needed for communication among processors and memories. The Generalized Cube network has a cube-type topology and is constructed from 2-input/2-output crossbars or interchange boxes [17]. A more general form of interchange box is an a-input/aoutput (a x a) switching A relative of the Generalized Cube network can be constructed from a a switching nodes using cube-type connections between stages. Many papers in the literature discuss using larger than 2x2 interchange boxes for implementing multistage cube-type networks- [2, 7, 10, 11, 12, 183. In the following, design options for 4x4 switching nodes are considered. The performances of two designs are evaluated and their implementation in discrete logic (e.g., TTL) and VLSI is considered. It will be shown that a 4x4 crossbar performs better and costs less than four 2x2 crossbars in a packet switching environment. The logical structure of the Generalized Cube network is defined in Section II to provide a This work was supported by the Ballistic Missile Defense Agency under grant number DASG60-80-C-0022 and the Air Force Office of Scientific Research, Air Force Systems Command, USAF, under AFOSR The United States Government is authorized to reproduce and distribute reprints for Government purposes non-withstanding any copyright notation here on. The views, opinions, and/or findings contained in this report are those of the author(s) and should not be construed as an official Department of the Army position, policy, or decision, unless so designated by other official documentation. Figure 1(a): (b): Generalized Cube topology for N=8. Four states of an interchange box /81/0000/0229$ IEEE

It is assumed that processors and memories are paired to form processing elements (PE/j;). The network is configured such that PE i is connected to input i and output i, CKi<N.

2 It is assumed that processors and memories are paired to form processing elements (PE/j;). The network is configured such that PE i is connected to input i and output i, CKi<N. The packet switching mode, in which packets move from stage to stage in the network as paths between stages become available, is assumed. They do not require that their entire path be established prior to entering the network. A packet consists of a routing tag and a number of data items. Packet switching in multistage networks has been discussed in [3,19]. The primary goal here is to investigate the cost-effectiveness of constructing multistage cube networks from 4x4 crossbars versus 2x2 crossbars (interchange boxes). Since a single 2x2 interchange box is not functionally comparable to a 4x4 crossbar (i.e., it can only handle two items at a time instead of four), the 4x4 crossbar is compared with a 4x4 composition of four 2x2 interchange boxes. This configuration is called a composite node and is shown in Fig. 2. A network constructed from properly connected (to be specified later) composite nodes is identical to a cube network constructed from 2x2 interchange boxes. The external connections of the crossbar (Fig. 3) are identical to those of the composite node, so it can be directly substituted for a 4x4 composite Many options for the implementation of 2x2 interchange boxes were discussed in [9]. To avoid repetition, one of the configurations discussed in that paper will be assumed here. It is assumed that packet switching is implemented and that an entire packet is transferred between adjacent stages during one network clock cycle. Furthermore, the size of each input queue in a switching node is assumed to be an integral multiple of the packet size. The packet size is thus not restricted to be any particular number of words. III. PERFORMANCE ANALYSIS The 4x4 crossbar node and composite node will be compared in their performance at both a local and global level. On a local level blocking within a node is examined. On the global level, the permuting ability of two networks constructed from the respective 4x4 switching nodes is compared. Consider the local level. Let level 1 of a composite node be the two interchange boxes connected to the inputs of the node and level 2 be the two interchange boxes connected to the outputs. The composite node can perform 16 permutation connections (each box either straight or exchange) and the crossbar node can perform all 4! possible permutation connections. For those permutations where there is no conflict in either node, the messages traverse the composite node in twice the time required by those in the crossbar node due to the two levels of interchange boxes. When conflicts occur in the crossbar node, the delay due to waiting diminishes the speedup achieved. Consider situations where there are conflicts in a switch For this analysis it is assumed that the destination of any message is a uniformly distributed random variable. Also, it is assumed that each message has only one destination (i.e., no broadcasting). Both the composite node and the 4x4 crossbar node have four inputs and four outputs so there are 4 =256 distinct patterns in which messages may need to be routed through the boxes. Since the destinations are assumed to be random and uniformly distributed, the distinct data patterns of routing are all equally likely. Assuming four simultaneous inputs is somewhat of a worst case, since in MIMD mode this would be con- Figure 2: A 4x4 composite node constructed from four 2x2 interchange boxes. Figure 3: A 4x4 crossbar 230

sidered heavy loading and in SIMD mode destinations are not random but structured and chosen to avoid conflicts. The node is assumed initially empty.

PCr=D = 24/256, P(r=2) = 180/256, P(r=3) = 48/256, and P(r=4) = 4/256. The expected time to pass all four messages through the crossbar node is given by: 4 i P(r=i) = 2.125 network clock cycles.

3 sidered heavy loading and in SIMD mode destinations are not random but structured and chosen to avoid conflicts. The node is assumed initially empty. Consider the 4x4 crossbar Let r be the maximum number of messages desiring any given output of the 4x4 crossbar The total time required for all four messages to pass through the node is r. PCr=D = 24/256, P(r=2) = 180/256, P(r=3) = 48/256, and P(r=4) = 4/256. The expected time to pass all four messages through the crossbar node is given by: 4 i P(r=i) = network clock cycles. i=1 That is, given that four messages arrive at an empty crossbar node simultaneously, on the average it will take network clock cycles for the node to empty. Now consider the composite The following notation will be used in the ensuing equations, where i=1 or 2: P(iU) = P(no conflict level i, upper box) = 1/2; P(iL) = P(no conflict level i, lower box) = 1/2; P(iX) = 1/2, where X = U or L; and P(i) = P(no conflict in level i) = 1/4. Now consider the probabilities of different amounts of time, t, to pass four input messages through the composite The minimum time possible is 2 network clock cycles because there are two levels. P(t=2) = P(1U) P(1L) P(2U) P(2L) = 1/16. For a total time of 3 network clock cycles there are 5 cases to consider. First assume no conflicts occur in level 1. P(t=3, case 1) = P(1) d-p(2)) = 3/16. Next, assume exactly one level 1 interchange box has a conflict. P(t=3, case 2) = C(1-P(1U)) P(1L)+P(1U) (1-P(1L))3 P(2X) = 1/4. For case 3, there is one conflict at each level, but the maximum delay is 3 cycles. P(t=3, case 3) = C(1-P(1U)) P(1L)+P(1U) (1-P(1D): (1-P(2X)) (1/2) P(2X) = 1/16. The first factor is the probability that exactly one box at level 1 has a conflict. The next factor is the probability that the first message from the level 1 box which had a conflict, call this message M, also has a conflict at level 2. The (1/2) is the probability that M will be chosen to pass through the Level 2 box first. The last factor is the probability that the two delayed messages do not conflict. Case 4 assumes that there is a conflict in both level 1 boxes and that both level 2 boxes receive messages (this happens half the time there are two conflicts in level 1). P(t=3, case 4) = (1/2) d-p(1u)) d-p(1d) = 1/8. Finally, assume conflict in both level 1 boxes but only one level 2 box receives messages and there is no conflict for either pair that passes through: P(t=3), case 5) = (1/2) (1-P(1U)) (1-P(1L)) P(2X) P(2X) = 1/32. The probability that all messages pass through the composite node in 3 network clock cycles is For a time of 4, there are four cases to consider. The first case is where there is one conflict at each level. There are two ways to obtain a time of 4 from this situation: (1) the delayed message enters a non-empty queue in level 2 and (2) the delayed message enters an empty queue but conflicts with the other remaining message: P(t=4, case 1) = C(1-P(1U)> P(1L)+P(1U> C1-P(1L))3 C(1/2) (1-P(2X))+(1/2) (1-P(2X)) (1-P(2X)):=3/16. Now assume conflict in both level 1 boxes and that only one level 2 box receives messages (this happens half the time there are two conflicts in level 1). Given this occurs, there are three ways (cases 2, 3, and 4) a time of 4 occurs. In case 2, the first two messages reaching the box in level 2 conflict, but there are no subsequent conflicts: P(t=4, case 2) = (1/2) <1-P(1U)) <1-P(1L>) (1-P(2X)) P(2X) = 1/32. In case 3, the first pair of messages do not conflict but the second pair do: P(t=4, case 3) = (1/2) (1-P(1U)> (1-P(1L)> P(2X) (1-P(2X)) = 1/32. In case 4, the first and second pair of messages conflict. When the second pair conflicts, one queue will contain two messages. For a time of 4 the queue with two items must be selected to resolve the second conflict and a third conflict must not occur. P(t=4,case 4) = (1/2) (1-P(1U)) (1-P(1D) d-p(2x)) (1-P(2X)) (1/2) P(2X) = 1/128. The probability of a time of 4 is: P(t=4) = 3/16 + 1/32 + 1/32 + 1/128 = 33/128. The time of 5 happens when either of the two conditions of case 4 for a time of 4 are not met. P(t=5) = (1/2) (1-P(1U)) (1-P(1L)) (1-P(2X>) C(1/2)(1-P(2X))+(1/2)(1-P(2X))(1-P(2X)): = 3/128. The expected time for all four messages to pass through the composite node is: This time is 53% longer than the network clock cycles expected with the crossbar Consider the global level. To construct a network from m/2 stages of N/4 4x4 switching nodes, assume all connection lines in the network are labeled in base 4 and that the stages are numbered (m/2)-1,,1,0 (from input to output). At stage i, the four input lines to a node are those that differ only in the i-th position of their base 4 representation. The line with a 0 in the i-th position connects to the top input, 2 to the next input, 1 to the next input, and 3 to the bottom input. The output lines of the 4x4 switching nodes have the same labels as the input lines, but in increasing order, i.e., the top output line label has a 0 in the i-th position, next 1, next 2, and the bottom 3. When composite nodes are used, making connections in the above manner creates a cube network. When crossbars nodes are used, a network is created whose capabilities are a super- A composite node network consists of Nra/2 in- Nm/? terchange boxes, allowing 2 permutations. Assuming m is even, a 4x4 crossbar node network con- Nm/8 sists of Nm/8 nodes, permitting (4!) permutations. If m is odd and one stage is constructed by 4x4 crossbar nodes limited to act as a 2x2 231

232 IV. IMPLEMENTATION To control the network, the destination tags defined in [7] are used. Let the destination ad-, dress D be represented in binary as d ^ d.d. A switching node in stage i examines bits d-,.

. If the bit examined is 0, the upper output link of the interchange box is selected and if the bit is 1, the lower link is selected. For the crossbar node, both bits are examined simultaneously.

4 232 IV. IMPLEMENTATION To control the network, the destination tags defined in [7] are used. Let the destination ad-, dress D be represented in binary as d ^ d.d. A switching node in stage i examines bits d-,. +, and dj.j. For the composite node, the first level interchange boxes examine only bit d~... and the second level interchange boxes examine only bit d_.. If the bit examined is 0, the upper output link of the interchange box is selected and if the bit is 1, the lower link is selected. For the crossbar node, both bits are examined simultaneously. Together they are considered a single base four digit which corresponds to one of the outputs labeled 0 through 3. To add a broadcast capability, an m-bit broadcast mask is appended [15]. Let the mask B be represented in binary as b. b. b. A switching node in stage i now examines b_. +1, b_, d- +«and dp - For the composite node, first level interchange boxes examine bits with index 2i+1 and second level boxes examine bits with index 2i. If the broadcast mask bit is 0, the destination tag bit is interpreted as before. If the mask bit is 1, the destination bit is ignored and both output links of the interchange box are selected. For the crossbar node the four bits are all examined simultaneously. They are interpreted so as to es- tablish the same connections as those that would be obtained in the composite Five kinds of broadcasts are defined for either type of 4x4 switching Hardware Without Broadcast Capability For simplicity, designs for the composite node and the crossbar node initially will be developed assuming no broadcast capability. Then, those portions of the designs affected by inclusion of a broadcast capability will be modified and compared. In the following analysis, hardware complexity is measured in terms of logic gate count and chip count. The gate counts are used as a first approximation to compare VLSI implementations. Designs using this technology must also consider wiring complexity [5]. The chip counts are used to compare discrete logic (e.g. TTL) implementations, assuming standard gate-per-chip packaging. Examining Figs. 2 and 3, the first difference noted is that the crossbar node requires half as many queues as the composite Depending on the actual queue size, a considerable savings in logic may be realized in the implementation of the crossbar To compare multiplexer requirements, typical implementations of 2-to-1 and 4-to-1 multiplexers were examined [14]. Eight 2-to-1 multiplexers require 20% more gates (regardless of path width) than four 4-to-1 multiplexers. The chip counts are equal. Since the number of external connections for data and control lines is the same for both designs, any buffering/signal conditioning logic will be comparable. In a VLSI design, this implies identical pin counts. Thus far the crossbar node appears to be the better choice. It is however, decidedly more complicated to arbitrate the requests of four packets simultaneously (as opposed to two) while assuring each packet equal access to each output link on the average. To determine whether one 4x4 control unit is actually more complex than four 2x2 control units, the functional components of the control units are considered. The control unit of a 2x2 interchange box contains two sets of queue control logic, input request arbitration (IRA) logic, output request arbitration (ORA) logic, and timing. The control unit for a 4x4 crossbar node contains four sets of queue control logic. The remaining components are the functional equivalents of those for the 2x2 interchange box. The most obvious difference between the two designs is that four 2x2 control units contain twice as many sets of queue control logic as one 4x4 control unit. One set of queue control logic contains two registers which store pointers, one to the front and one to the back of its associated queue. If the queue is Q words long, log-,q bits are required for each register. The IRA logic is quite simple. If a request is made for the i-th input, (i=0,1 for the 2x2; i=0,1,2,3 for the 4x4), it will be granted if the i-th queue is not full. Once again, four 2x2 control units require twice as much IRA logic as one 4x4 control unit. The timing logic is identical in both cases. Three clock phases are generated. A request/grant/transfer protocol is implemented (see [9]). None of the logic discussed thus far is affected by the inclusion of a broadcast capability. Thus, its analysis is equally applicable to the next subsection, which includes broadcast capabilities. The most important and by far the most complex component of the control unit is the ORA logic. It is responsible for examining the routing tag bits and generating signals to set the multiplexers and make requests. It must also examine the grant signals and generate control signals for the "increment front pointer" input of each set of queue control logic. The complexity of this logic arises from arbitrating conflicting requests for access to the output ports. To compare the ORA logic, equations are derived for all its output signals as a function of the tag bits and grant signals [14]. The total (NAND) gate count for 4 sets 2x2 of control unit logic is 104 gates. This corresponds to 24 chips. The control unit for the 4x4 crossbar node requires 124 gates. There is a 19% increase in the number of gates required by the crossbar In a discrete logic design, the chip count is 32. This is a 33% increase over the 24 chips required in the composite The excess in ORA logic can be compensated for, since a 4x4 crossbar node requires half the queue control and IRA logic of a 4x4 composite From the equations derived, 20 extra gates or eight extra chips are required for the 4x4 crossbar ORA logic. Assuming one of the eight sets of queue control and IRA logic in a composite node will require more than 5 gates or 2 chips, the 4x4 crossbar node is actually less expensive to build. Despite the higher wiring complexity of

the 4x4 crossbar node, the total design effort is comparable to that required by the 4x4 composite Hardware With Broadcast Capability Adding a broadcast capability requires the ORA logic to examine

A broadcasting capability costs 52 gates or 24 chips beyond the requirements for a 4x4 composite node without it. More details can be found in [14].

5 the 4x4 crossbar node, the total design effort is comparable to that required by the 4x4 composite Hardware With Broadcast Capability Adding a broadcast capability requires the ORA logic to examine the broadcast mask bits in addition to the routing tag bits. The revised equations for the 2x2 control unit require 33 gates, which multiplied by 4 is 156. This is equivalent to 48 chips. A broadcasting capability costs 52 gates or 24 chips beyond the requirements for a 4x4 composite node without it. More details can be found in [14]. The circuitry needed to add the same broadcast capability to 4x4 crossbar nodes as was added to the composite nodes requires 233 gates, a 49% increase over the 156 required for the composite The chip count is 74, a 54% increase over 48. In this case it is likely that one of the eight sets of queue control and IRA logic will require more than 20 gates or 7 chips. If not, the savings in queue gates will compensate for the difference. Again the crossbar node is less expensive than a composite node where both have the same broadcast capability. V. CONCLUSIONS At a local level, the crossbar node is always faster at passing four messages that arrive simultaneously than the composite If the connection requests do not conflict in the composite, the crossbar is twice as fast. When the connection requests of the messages form a permutation which the composite node cannot pass without conflict, it takes 3 times longer for all messages to exit the composite Assuming each message chooses each output with equal probability, on the average it takes approximately 53% more time for all messages to pass through the composite node than through the crossbar The ORA logic is the only logic requiring more hardware in a crossbar node than in a composite Otherwise, a crossbar node requires half as much queue control and IRA logic, and half as many queues. The multiplexer logic is less than or comparable to that needed by the composite The net result is that when packet switching is implemented, the 4x4 crossbar node requires less hardware and significantly out-performs a composite If circuit switching is implemented, no queues or their associated control logic are required. In this case, the crossbar node does contain more hardware. However, it offers a significant improvement in connectivity/permuting ability. If the switching nodes are implemented as VLSI chips, since both nodes require the same number of pins, the gate/pin ratio is improved with a crossbar implementation. Only in the case where circuit switching is implemented in discrete logic is further consideration required. Without a broadcast capability (which is less important in a circuit switching environment), there is only a small difference in the chip count. In summary, the implementation of cube-type networks using 2x2 and 4x4 crossbars were compared. It was shown that for packet switching the 4x4 crossbar is a more cost-effective approach. REFERENCES 1 K. Batcher, "The flip network in STARAN," 1976 Int. Conf. Parallel Processing, pp , Aug L. Ciminiera, A. Serra, "Modular interconnection networks with asynchronous control," 14th Hawaii Int. Conf. System Sciences, pp , Jan D. Dias, J. Jump, "Packet communication in multistage shuffle-exchange networks," 1980 Int. Conf. Parallel Processing, pp , Aug M. Flynn, "Very high-speed computing systems," Proc. IEEE, Vol. 54, pp , Dec M. Franklin, "VLSI performance comparison of banyan and crossbar communications networks," Workshop on Interconnection Networks for Parallel and Distributed Processing, pp Apr G. Goke, G. J. Lipovski, "Banyan networks for partitioning multiprocessor systems," 1st Symp. Comp. Arch., pp , Dec D. Lawrie, "Access and alignment of data in an array processor," IEEE Trans. Comp., Vol. C-24, pp , Dec W. McDonald, J. Williams, "The advanced data processing test bed," Compsac, pp , Mar R. J. McMillen, H. J. Siegel, "The hybrid cube network," Distributed Data Acquisition, Computing and Control Symp., pp , Dec, J. Patel, "Processor-memory interconnections for multiprocessors," 6th Symp. Comp. Arch., pp , Apr M. Pease, "The indirect binary n-cube microprocessor array," IEEE Trans. Comp., Vol. C-26, pp , May U. Premkumar, et al., "Design and implementation of the banyan interconnection network in TRAC," NCC, pp , June H. J. Siegel, "A model of SIMD machines and a comparison of various interconnection networks," IEEE Trans. Comp., Vol. C-28, pp , Dec H. J. Siegel, et al., Parallel/Distributed Multimicroprocessor Systems for Ballistic Missile Defense, Purdue, EE School, TR-EE 81-12, June H. J. Siegel, R. J. McMillen, "The cube network as a distributed processing test bed switch," 2nd Int. Conf. Distributed Computing Systems, pp , Apr H. J. Siegel, et al., "PASM: A partitionable SIMD/MIMD system for image processing and pattern recognition," IEEE Trans. Comp., to appear. 17 H. J. Siegel, S. D. Smith, "Study of multistage SIMD interconnection networks," 5th Symp. Comp. Arch., pp , Apr S. D. Smith, "LSI design considerations for multistage interconnection networks for parallel processing systems," 14th Hawaii Int. Conf. System Sciences, pp , Jan A. Tripathi, G. J. Lipovski, "Packet switching banyan networks," 6th Symp. Comp. Arch., pp , Apr C. Wu, T. Feng, "On a class of multistage interconnection networks," IEEE Trans. Comp., Vol. C-29, pp , Aug

FAULT LOCATION IN DISTRIBUTED CONTROL INTERCONNECTION NETWORKS

FAULT LOCATION IN DISTRIBUTED CONTROL INTERCONNECTION NETWORKS Nathaniel J. Davis IV William Tsun-Yuk Hsu Howard Jay Siegel PASM Parallel Processing Laboratory School of Electrical Engineering Purdue University