Design Principles for Practical Self-Routing Nonblocking Switching Networks with O(N log N) Bit-Complexity

Similar documents
Intensive Hypercube Communication: Prearranged Communication in Link-Bound Machines 1 2

SURVIVABLE IP OVER WDM: GUARANTEEEING MINIMUM NETWORK BANDWIDTH

Supporting Fully Adaptive Routing in InfiniBand Networks

Queueing Model and Optimization of Packet Dropping in Real-Time Wireless Sensor Networks

Message Transport With The User Datagram Protocol

1 Surprises in high dimensions

Almost Disjunct Codes in Large Scale Multihop Wireless Network Media Access Control

EFFICIENT ON-LINE TESTING METHOD FOR A FLOATING-POINT ADDER

Disjoint Multipath Routing in Dual Homing Networks using Colored Trees

Offloading Cellular Traffic through Opportunistic Communications: Analysis and Optimization

Lecture 1 September 4, 2013

Optimal Oblivious Path Selection on the Mesh

Performance Modelling of Necklace Hypercubes

Loop Scheduling and Partitions for Hiding Memory Latencies

Questions? Post on piazza, or Radhika (radhika at eecs.berkeley) or Sameer (sa at berkeley)!

State Indexed Policy Search by Dynamic Programming. Abstract. 1. Introduction. 2. System parameterization. Charles DuHadway

Non-homogeneous Generalization in Privacy Preserving Data Publishing

Particle Swarm Optimization Based on Smoothing Approach for Solving a Class of Bi-Level Multiobjective Programming Problem

Architecture Design of Mobile Access Coordinated Wireless Sensor Networks

Generalized Edge Coloring for Channel Assignment in Wireless Networks

Learning convex bodies is hard

Generalized Edge Coloring for Channel Assignment in Wireless Networks

Secure Network Coding for Distributed Secret Sharing with Low Communication Cost

Online Appendix to: Generalizing Database Forensics

Improving Spatial Reuse of IEEE Based Ad Hoc Networks

Random Clustering for Multiple Sampling Units to Speed Up Run-time Sample Generation

Yet Another Parallel Hypothesis Search for Inverse Entailment Hiroyuki Nishiyama and Hayato Ohwada Faculty of Sci. and Tech. Tokyo University of Scien

EDOVE: Energy and Depth Variance-Based Opportunistic Void Avoidance Scheme for Underwater Acoustic Sensor Networks

A Classification of 3R Orthogonal Manipulators by the Topology of their Workspace

Coupling the User Interfaces of a Multiuser Program

Considering bounds for approximation of 2 M to 3 N

Transient analysis of wave propagation in 3D soil by using the scaled boundary finite element method

Skyline Community Search in Multi-valued Networks

Kinematic Analysis of a Family of 3R Manipulators

MODULE VII. Emerging Technologies

All-to-all Broadcast for Vehicular Networks Based on Coded Slotted ALOHA

Distributed Line Graphs: A Universal Technique for Designing DHTs Based on Arbitrary Regular Graphs

Overview. Operating Systems I. Simple Memory Management. Simple Memory Management. Multiprocessing w/fixed Partitions.

Learning Subproblem Complexities in Distributed Branch and Bound

An Adaptive Routing Algorithm for Communication Networks using Back Pressure Technique

AnyTraffic Labeled Routing

Adjacency Matrix Based Full-Text Indexing Models

Computer Organization

Adaptive Load Balancing based on IP Fast Reroute to Avoid Congestion Hot-spots

Analysis of half-space range search using the k-d search skip list. Here we analyse the expected time for half-space

On the Placement of Internet Taps in Wireless Neighborhood Networks

BIJECTIONS FOR PLANAR MAPS WITH BOUNDARIES

Switch Fabrics. Switching Technology S P. Raatikainen Switching Technology / 2006.

Characterizing Decoding Robustness under Parametric Channel Uncertainty

Study of Network Optimization Method Based on ACL

Optimal Distributed P2P Streaming under Node Degree Bounds

MORA: a Movement-Based Routing Algorithm for Vehicle Ad Hoc Networks

On the Role of Multiply Sectioned Bayesian Networks to Cooperative Multiagent Systems

Baring it all to Software: The Raw Machine

IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 31, NO. 4, APRIL

Dual Arm Robot Research Report

Lab work #8. Congestion control

Chalmers Publication Library

Uninformed search methods

Switch Fabrics. Switching Technology S Recursive factoring of a strict-sense non-blocking network

Throughput Characterization of Node-based Scheduling in Multihop Wireless Networks: A Novel Application of the Gallai-Edmonds Structure Theorem

Chapter 9 Memory Management

Depth Sizing of Surface Breaking Flaw on Its Open Side by Short Path of Diffraction Technique

Divide-and-Conquer Algorithms

Computer Organization

Crossbar - example. Crossbar. Crossbar. Combination: Time-space switching. Simple space-division switch Crosspoints can be turned on or off

PERFECT ONE-ERROR-CORRECTING CODES ON ITERATED COMPLETE GRAPHS: ENCODING AND DECODING FOR THE SF LABELING

2-connected graphs with small 2-connected dominating sets

An Algorithm for Building an Enterprise Network Topology Using Widespread Data Sources

Comparison of Methods for Increasing the Performance of a DUA Computation

Robust PIM-SM Multicasting using Anycast RP in Wireless Ad Hoc Networks

6.823 Computer System Architecture. Problem Set #3 Spring 2002

Solution Representation for Job Shop Scheduling Problems in Ant Colony Optimisation

Backpressure-based Packet-by-Packet Adaptive Routing in Communication Networks

Research Article REALFLOW: Reliable Real-Time Flooding-Based Routing Protocol for Industrial Wireless Sensor Networks

Additional Divide and Conquer Algorithms. Skipping from chapter 4: Quicksort Binary Search Binary Tree Traversal Matrix Multiplication

Parallel Directionally Split Solver Based on Reformulation of Pipelined Thomas Algorithm

A quasi-nonblocking self-routing network which routes packets in log 2 N time.

PAPER. 1. Introduction

A Formal Model and Efficient Traversal Algorithm for Generating Testbenches for Verification of IEEE Standard Floating Point Division

Virtual Circuit Blocking Probabilities in an ATM Banyan Network with b b Switching Elements

Ad-Hoc Networks Beyond Unit Disk Graphs

Optimal Routing and Scheduling for Deterministic Delay Tolerant Networks

Scalable Deterministic Scheduling for WDM Slot Switching Xhaul with Zero-Jitter

Control of Scalable Wet SMA Actuator Arrays

Coordinating Distributed Algorithms for Feature Extraction Offloading in Multi-Camera Visual Sensor Networks

Backpressure-based Packet-by-Packet Adaptive Routing in Communication Networks

Performance of Multihop Communications Using Logical Topologies on Optical Torus Networks

Exploring Context with Deep Structured models for Semantic Segmentation

Learning Polynomial Functions. by Feature Construction

CS 106 Winter 2016 Craig S. Kaplan. Module 01 Processing Recap. Topics

Threshold Based Data Aggregation Algorithm To Detect Rainfall Induced Landslides

The Reconstruction of Graphs. Dhananjay P. Mehendale Sir Parashurambhau College, Tilak Road, Pune , India. Abstract

Probabilistic Medium Access Control for. Full-Duplex Networks with Half-Duplex Clients

Improving Performance of Sparse Matrix-Vector Multiplication

A New Search Algorithm for Solving Symmetric Traveling Salesman Problem Based on Gravity

Verifying performance-based design objectives using assemblybased vulnerability

6 Gradient Descent. 6.1 Functions

Bends, Jogs, And Wiggles for Railroad Tracks and Vehicle Guide Ways

Provisioning Virtualized Cloud Services in IP/MPLS-over-EON Networks

Transcription:

IEEE TRANSACTIONS ON COMPUTERS, VOL. 46, NO. 10, OCTOBER 1997 1 Design Principles for Practical Self-Routing Nonblocking Switching Networks with O(N log N) Bit-Complexity Te H. Szymanski, Member, IEEE Computer Society Abstract Principles for esigning practical self-routing nonblocking N N circuit-switche connection networks with optimal q(n log N) harware at the bit-level of complexity are escribe. The overall principles behin the architecture can be escribe as Expan-Route-Contract. A self-routing nonblocking network with w-bit wie atapaths can be achieve by expaning the atapaths to w + z inepenent bit-serial connections, routing these connections through self-routing networks with blocking, an by contracting the ata at the output an recovering the w-bit wie atapaths. For an appropriate reunancy z, the blocking probability can be mae arbitrarily small an the fault tolerance arbitrarily high. By using efficient space omain concentrators, the architecture yiels self-routing nonblocking switching networks with an optimal O(N log N) bits of memory or O(N log N log log log N) logic gates. By using a linear-cost time omain concentrator, the architecture yiels self-routing nonblocking switching networks with an optimal q(n log N) bits of memory or logic gates. These esigns meet Shannon s lower boun on memory requirements, establishe in the 1950s. The number of stages of crossbars can match the theoretical minimum, which has not been achieve by previous self-routing networks. The architecture is feasible with existing electrical or optical technologies. The esigns of electrical an optical switch cores with Terabits of bisection banwith for Networks-of-Workstations (NOWs) are escribe. Inex Terms Multistage, networks, self-routing, nonblocking, circuit-switching, scalable, ranomization, electrical, optical. 1 INTRODUCTION A N N N nonblocking Connection Network is a circuit-switche network capable of achieving any of the N! permutations of its N input ports onto its N output ports [35]. Such networks are often use for ATM switching or multiprocessor communication. The harware cost is efine as the number of logic gates or bits of memory require in its construction. The epth is efine as the number of logic gates along the longest path between an input port an an output port. A self-routing network is one in which a circuit-switche connection can be establishe by the harware as it propagates forwar through the network, with reliance only on the local information available at each noe; there is no nee for any off-line path precomputation. The setup time is efine as the propagation elay of all logic gates traverse in the establishment of a connection between some input port an some output port. This paper presents a esign for practical an scalable connection networks, i.e., nonblocking switching networks which easily an efficiently scale to large sizes. This problem is historically significant an it is important in the esign of Gigabit an Terabit optical networks [36]. Using recently evelope free-space optical technologies [2], [12], [18], [21], complex electronic switching noes can be implemente on an Opto-Electronic Integrate Circuit (OEIC) The author is with the Department of Electrical Engineering, McGill University, Montreal, PQ, Canaa H3A 2A7. E-mail: tes@macs.ee.mcgill.ca. Manuscript receive 9 Aug. 1993; revise 10 June 1996 an 17 May 1997. For information on obtaining reprints of this article, please sen e-mail to: tc@computer.org, an reference IEEECS Log Number 105193. with optical I/O. Base upon inustry projections [2], [18], [28], within a ecae, OEICs containing millions of electronic logic gates an thousans of optical I/Os will be feasible. Each OEIC can implement one stage of an optical multistage network. The optical output of one stage can be fe into the next stage through free-space, where the permutation between stages can be implemente optically. The appeal of these networks is their very high bisection banwiths (in the Terabit per secon range) an the simplicity of their construction, since all interstage wires are implemente optically. It is important to minimize the number of stages in an optical network to minimize cost an maximize reliability. It is also important to avoi complicate backtracking control algorithms within the network, which are infeasible to achieve optically. Perhaps surprisingly, the new optical technologies are highlighting the nee for goo solutions to historic networking problems, such as the esign of scalable switching networks with fast self-routing control algorithms. (A complete set of graphs which allows a reaer to esign nonblocking networks is presente in Section 2.) OEICs with hunres of binary circuit-switching noes have been evelope in many ifferent smart pixel technologies, i.e., [8], [9], [12], [21], [34], [39]. Researchers at the former AT&T have emonstrate a circuit-switche optical multistage network with 60K optical channels which use off-line routing [8], [9], [12]. In spite of the consierable inustrial interest in optical multistage networks, to ate, there oes not exist an efficient scalable nonblocking circuit-switching network architecture with fast self-routing algorithms an with a harware complexity which is asymptotically optimal. 0018-9340/97/$10.00 1997 IEEE J:\PRODUCTION\TC\3-FINAL\09- regularpap KSM 27,136 08/11/97 2:11 PM 1 /

2 IEEE TRANSACTIONS ON COMPUTERS, VOL. 46, NO. 10, OCTOBER 1997 A recent report sponsore by the U.S. National Science Founation (NSF) entitle Research Priorities in Networking an Communications [36] efine 15 key research priorities for the next ecae. These priorities inclue Dynamic Network Control, i.e., the nee for fast routing algorithms to control the Gigabit an Terabit networks of the future, an Switching Systems, i.e., the nee for optimally scalable connection networks. Accoring to the NSF report, Realization of these goals require new avances in switching system theory an esign. To ate, there are no practical architectures for nonblocking multipoint virtual circuit switches that can meet the theoretical limits on optimal scaling with respect to all the characteristics of practical concern (switching network complexity, routing memory requirements) an most systems now being use have poor scaling properties [36]. The Expan-Route-Contract (ERC) network architecture propose in this paper represents one step towar these goals. The ERC network can meet Shannon s asymptotic lower boun on the harware complexity of selfrouting nonblocking networks (see next paragraph), can scale optimally to Gigabit an Terabit banwiths associate with optical technologies an allows for simple an very fast network control algorithms which are provably immune to congestion. In the 1950s, Claue Shannon establishe a lower boun on the harware complexity of self-routing nonblocking circuit-switching networks which are provably immune to congestion (equivalently, they never exhibit blocking given any permutation traffic pattern) [29]. Accoring to Shannon s complexity argument, an optimal self-routing circuitswitche connection network woul require q(n log N) harware, 1 which inclues all bits of memory [29], all logic gates, an all crosspoints, an woul have a epth of O(log N) binary noes. To ate, no known self-routing nonblocking circuit-switching networks with explicit constructions meet Shannon s lower bouns. The famous AKS sorting network [1] meets Shannon s lower bouns in the limit of infinitely large sizes N, but it relies on linear cost concentrators which lack explicit constructions, i.e., it cannot be built in practice. The MultiBenes network propose by Avora, Leighton, an Maggs [3] also meets Shannon s lower bouns, but it also relies on expaner graphs which lack explicit constructions, relies on complex backtracking control algorithms, an it requires an AKS sorting network to acknowlege calls. These two networks are primarily of theoretical significance, an establishe that Shannon s lower bouns on the cost an epth of self-routing connection networks can be met in theory, but not in practice. We point out that a store-an-forwar packetswitche network, where packets are buffere in each stage, coul not possibly meet Shannon s lower boun on cost. Each packet requires at least O(log N) bits to ientify its estination, an, if packets are buffere in every stage, then the harware complexity of the network is at least q(n log 2 N) bits of memory, which is suboptimal by a factor of O(log N). In aition, packet-switche networks are unesirable since they are slow [27]. A pipeline circuit-switche network with fast connection establishment can eliver a 1. All logarithms are to the base 2 unless otherwise inicate. permutation of packets from its input sie to its output sie in roughly the same amount of time a packet-switche network requires to move packets forwar one stage. For these reasons, pipeline circuit switching an the similar worm-hole routing technique have largely eliminate packet switching in recent multicomputer networks [27]. (Packet-switche networks using ranomize routing are escribe in [23], [37], [38].) In practice, many self-routing permutation networks are base on bit-serial circuit-switche versions of Batcher's sorting network [6], with q(n log 2 N) binary noes, q(log 2 N) epth, an q(log 2 N) setup time. These complexities are expresse at the bit-level, an inclue all crosspoints, all bits of internal memory, an all logic gates, where every logic gate is assume to have boune fan-in an boune fan-out. There have been some innovative switch esigns over the years. Douglass has propose a rearrangeable network with O(N log 2.5 N) harware an O(log 2.5 N) setup time [11]. Chien an Oruc propose permutation networks with O(N log N log log N) bit cost an with O(log 3 N) bit elay [7]. Using a numerical analysis, De Biase et al. propose permutation networks with O(N log 2+e N) bit cost (where e > 0) an O(log N) bit elay [10]. The complexity of various networks is illustrate in Table 1. However, the iscrepancies between the best-known theoretical results an practical results are evient in Table 1. In this paper, we propose an architecture for self-routing nonblocking, circuit-switche connection networks, which we call the Expan-Route-Contract (ERC) architecture. An overview of the architecture is shown in Fig. 1. The nonblocking ERC architecture is base on the concept of expaning the incoming ata, routing the bits through inepenent bit-serial networks which exhibit blocking, an compacting the ata at the output. The architecture is also base on probabilistic schemes an ranomization, rather than on eterministic schemes. The expansion can be accomplishe in at least two ways: 1) A w-bit wie ata wor can be encoe with a Forwar Error Correcting Coe to yiel a w + z bit wie wor or 2) The w-bit wie ata wor is submitte to an expaner creating a w + z bit wie wor. After the expansion, the w + z bits of ata are route through w + z inepenent bit-serial self-routing circuitswitche networks, which we call bit-planes. Each selfrouting bit-plane attempts to establish bit-serial circuitswitche connections, an each bit-plane is allowe to have an arbitrary blocking probability. (The bit-serial connections can also be bit-parallel). The expansion z is a esign parameter which is chosen so that the probability that a w-bit wie connection is establishe is sufficiently large. Given any level of blocking in the bit-planes, it is always possible to pick the expansion z so that the mean-time-betweenblocking of a w-bit wie connection is an arbitrarily large amount of time, for example 10 50 years. It is important to recognize the strength of this probabilistic approach: There are only about 10 15 years left in the life of our universe, an, hence, these networks can be esigne so that the meantime-between-blocking can excee the remaining life of our universe. J:\PRODUCTION\TC\3-FINAL\09-97\105193_2.DOC regularpaper97.ot KSM 19,968 08/11/97 2:11 PM 2 / 13

SZYMANSKI: DESIGN PRINCIPLES FOR PRACTICAL SELF-ROUTING NONBLOCKING SWITCHING NETWORKS 3 TABLE 1 ASYMPTOTIC COMPLEXITIES OF VARIOUS NONBLOCKING CIRCUIT-SWITCHED NETWORKS Fig. 1. Example of the propose switch architecture, base on expansion, routing, an contraction. In orer to keep the overall cost at q(n log N), the selfrouting bit-planes must have q(n log N) bit complexity, an the blocking probability of any bit-plane must not approach one, so that the require expansion (base on z) is boune by a constant factor. It has recently been establishe that self-routing ilate banyans can be esigne to have q(n log N) bit complexity an a blocking probability (enote pb) which approaches zero as N approaches infinity [33]. In this paper, it is shown that the ilate banyans with low blocking probabilities can be use in the propose Expan-Route-Contract architecture to yiel harwareefficient nonblocking switches with q(n log N) bit complexity (where the nonblocking property is base on probabilistic arguments). These nonblocking switches can meet Shannon s lower boun first establishe in the 1950s on the harware complexity of self-routing nonblocking switches. (We note that the ERC architecture can yiel a nonblocking network using any self-routing bit-planes, as long as the blocking probability in the bit-planes is less than one. We also point out that the bit-serial connections can be replace by bit-parallel connections.) In practice, each bit of high-spee memory has an equivalent cost of nearly 10 logic gates. Hence, minimization of memory requirements is often more important than minimization of logic gates [29]. The propose ERC network can be esigne with asymptotically optimal memory requirements. In summary, the propose ERC connection network architecture has straightforwar explicit constructions, uses very simple an fast routing algorithms which are easily implemente in harware, an has very fast setup times when compare to other known networks. This paper is organize as follows. Section 2 presents a brief review of multipath elta networks, an erives some upper bouns on the blocking in multipath elta networks. Section 3 escribes the principles behin the ERC architecture. Section 4 iscusses SDM an TDM constructions of multipath elta networks an erives asymptotic complexities. Section 5 escribes the application of the theory to the esign of electrical an optical networks, an Section 6 contains concluing remarks. 2 BLOCKING IN MULTIPATH CIRCUIT-SWITCHED DELTA NETWORKS Delta networks are banyan networks with the self-routing property [26]. A -ilate elta network [4], [19], [31], [32], [33] can be obtaine from a regular banyan by increasing the capacity of each ege to hanle up to connections simultaneously, an by replacing all the crossbar switches with ilate crossbar switches. Each logical input port to a ilate crossbar can receive up to connections simultaneously, an each logical output port of a ilate crossbar can transmit up to connections simultaneously. A two-ilate elta network is shown in Fig. 2a. The theorems in this paper will apply to ilate elta networks, an the more general multipath elta networks (see next paragraph) have comparable performance. A p-path elta network can been efine as a multipath elta network with the following property: In every stage where a routing ecision must be mae, there exist p alternate paths which lea to a given estination [32]. Therefore, in every stage, a p-path elta network has at least p suitable alternate paths which a connection coul take while moving towar its estination. A two-path elta network is shown in Fig. 2b. J:\PRODUCTION\TC\3-FINAL\09-97\105193_2.DOC regularpaper97.ot KSM 19,968 08/11/97 2:11 PM 3 / 13

4 IEEE TRANSACTIONS ON COMPUTERS, VOL. 46, NO. 10, OCTOBER 1997 2.1 Blocking Probability in a Multipath Delta Network THEOREM 2. Given a ranom uniform traffic moel, the conitional blocking probability in a self-routing -ilate b n b n elta network with h connection requests source at each logical input port is upper boune by pb n eh F H G I K J e - h. Fig. 2. (a) Dilate elta network built with two-ilate 2 2 crossbar switches. (b) Multipath elta network built with two-ilate 2 2 crossbar switches. In this section, simple proofs are propose which establish that 1) q(n log log N)connections will survive when route through a p-path N N elta network with a path multiplicity of p = q(log log N), an 2) that the blocking probability pb of an iniviual connection approaches zero as N approaches infinity, given a loaing of q(log log N) connections per I/O port. (Some looser bouns were presente in [33].) Reaers who are not intereste in the mathematical proofs may procee irectly to Section 2.5, where the numeric results are iscusse, without a loss of continuity. Consier a circuit-switche -ilate b n b n elta network, with N b n logical input ports an N logical output ports. (Note: A logical port has a capacity to support connections.) Let there be h connection requests applie at each logical network input port (h ). Connection requests are ranomly istribute over the logical output ports. Connection requests flow from the input sie to the output sie. Whenever + 1 or more requests attempt to exit a logical output port in stage i, requests are selecte at ranom an propagate forwar an the remainer are blocke. Let the ranom variable N in enote the number of requests entering the network, let N out enote the number of requests leaving the network, an let N blocke enote the number of connection requests blocke within the network (each variable assumes a value given a specific state of the network). Define the acceptance probability as pa E[N out ]/e[n in ] an the blocking probability as pb E[N blocke ]/E[N in ], where the expectation is taken over all states of the network. It follows that pb = 1 - pa. (These probabilities are conitional on the fact that a request exists initially). Theorem 1 yiels a concise upper boun on the blocking in a ilate crossbar switch an is state without proof. The proof uses Valiant s version of Chernoff s boun [38] after application of Hoeffing s result [13], to yiel a close form expression on the tail of a binomial istribution. THEOREM 1. Given a ranom uniform traffic moel, the conitional blocking probability pb in a -ilate N N crossbar, with h connection requests source at each logical input port is upper boune by pb eh F H G I K J e - h. PROOF. Suppose that all logical output ports in all stages are assigne unique labels L i,j for 0 i n, 0 j N - 1, where N b n. A path through a network is efine as a sequence of ilate eges (i.e., eges with a capacity of connections). A connection request (for a circuit-switche connection) follows a particular path to its estination as it is route through the network; if it encounters a saturate ege, it blocks, otherwise, it survives. Define B i,j as the number of connection requests that block at L i,j in a given state. By efinition, the en-to-en conitional blocking probability is given by n N-1 Â Â, E Bls ENblocke s= 1 l= 0 pb =. EN Nh in Consier a specific topology, the omega-inverse network, in this section (the result applies to all topologically equivalent elta networks). Due to the ranom uniform traffic moel, the paths must be evenly istribute over the output ports in the last stage of the network. Assuming that there was no blocking in stages 1 to n - 1, then the entire network can be viewe as a -ilate N N crossbar (where N = b n ), where the blocking occurs only at the output ports. By applying the upper boun from Theorem 1, it follows that the expecte number of requests which block at the output ports of the last stage is given by E L NM N-1 Â l= 0 B ln, L NM O eh QP F H G I K J e -h O QP bnhg. The first n - 1 stages of the network can be viewe as two smaller N/2 N/2 ilate banyans. By repeate application of the above argument on the rest of the network, the expecte number of blocke requests in the -ilate b n b n elta network is upper boune as follows; L O n N-1 eh -h E ÂÂBls, n e Nh NM s= l= QP F H G I K J b g. 1 0 Therefore, the conitional blocking probability of the entire network is upper boune as follows; F eh h pb n e H G I K J -. Theorem 2 bouns the blocking probability in a multipath elta network given a ranom uniform traffic moel. 2.2 Worst-Case Traffic Immunity an Ranomize J:\PRODUCTION\TC\3-FINAL\09-97\105193_2.DOC regularpaper97.ot KSM 19,968 08/11/97 2:11 PM 4 / 13

SZYMANSKI: DESIGN PRINCIPLES FOR PRACTICAL SELF-ROUTING NONBLOCKING SWITCHING NETWORKS 5 Routing In the worst case, -ilate N N elta network can establish only O( N ) connections simultaneously. For example, in an eight-ilate 64K 64K banyan, the worst-case traffic pattern only allows about three percent of all connections to pass through. To ensure immunity to worst-case traffic, it is sufficient to ranomize the traffic first, through the aition of another elta network [23], [38]. The concatenation of two ilate elta networks (one for ranomization an one for routing, with the innermost stages merge) yiels a class of general topologies which coul be calle Tanem Dilate Banyan networks, which inclue the Dilate Benes topology as one example. A more general network obtaine from the concatenation of two multipath elta networks yiels a class of topologies which coul be calle Tanem Multipath Delta networks, which inclues the MultiBenes topology [3] as one example. Every connection request picks a ranom estination at the output of the ranomization network an then attempts to establish a connection to that estination. Any given traffic pattern, incluing a worst-case pattern, is transforme into a ranom traffic pattern in the ranomization network [20], [23], [38]. Requests then attempt to establish connections to their original estinations through the routing network. This approach eliminates congestion ue to worst-case traffic patterns, as will be shown. (Note: In practice, the ranomizer can be operate in various manners; see Section 2.4.) To ate, no researchers have manage to rigorously erive a boun on the blocking probability of self-routing circuit-switching networks using ranomize routing. Leighton aresses the problem of eriving a rigorous proof in his textbook. The connection requests which have survive through a ranomizer are not ranomly istribute over its output links: Their positions are correlate an it is not known how to boun the blocking given correlate traffic moels. The ifficulty of hanling correlate traffic moels an the unsolve nature of the problem is escribe in [20]. In this section, we present an alternative approach to boun the blocking in ilate elta networks using ranomize routing. A key point of the proof is to note the istinction between paths an surviving connection requests. At any stage, a surviving connection request is essentially the front-en of a pipeline circuit-switche connection, or, equivalently, the front-en of a worm-hole route connection. The surviving connection requests exiting the ranomizer are not ranomly istribute over its output ports, as observe by Leighton. However, the paths of all connection requests must be ranomly istribute over the output ports, since the path estinations are selecte at ranom. Hence, if we assume all connection requests are surviving connection requests at any given stage, we can exploit the fact of ranom path estinations an thereby upper boun the blocking at the given stage. We may then exploit the symmetry between the ranomizing network an the routing network to boun the blocking in the routing network. Since the loaing at each en is eterministic an ientical (h paths per logical port) an the loaing istribution at the mile is ientical, then the pattern of paths is symmetric about the mile. We may then establish that the upper boun from Theorem 2 is vali in each network; this is formalize in the next theorem. THEOREM 3. Given the concatenation of two -ilate N N banyan networks, with h < connection requests at each input port, which are ranomly an uniformly istribute over the outputs of the first ilate banyan, such that each output port of the secon ilate banyan is the estination of precisely h connection requests, the expecte number of blocke requests in the secon ilate banyan is upper boune as follows: L M N O P Q L N M M 2n N-1 n N-1 eh -h E Bls, E Bls, n e Nh MÂ Â ÂÂ s= n+ l= P = s= l= P F H G I K J 1 0 1 0 pb n eh O P Q Æ F H G I K J e -h. b g PROOF. Follows by symmetry from the proof of Theorem 2, noting that the expectation is upper boune by consiering paths which have ranom estinations, rather than surviving connection requests which have correlate estinations, an observing that the set of paths over which the expectation is taken in the routing network is symmetric with the set of paths in the ranomization network, an observing that the irection of flow of connections is irrelevant. Theorem 3 establishes an upper boun on the blocking in the routing network given any worst-case traffic pattern, an is necessary to establish the existence of self routing nonblocking ERC networks which are immune to worstcase traffic patterns in Section 3. 2.3 Asymptotic Performance We now consier the blocking in a ilate elta network as the network size scales towar infinity. THEOREM 4. Given the concatenation of two ilate banyans, one acting as a ranomization network an one acting as a routing network, then pb Æ 0 as N Æ through the appropriate choice of h an. PROOF. Let the ilation be = K Èlog log N for constant K 1 an the number of traffic sources at each input port h be such that (eh/) 1/2, then K log log N 1 -K log logn 2e lim pb lim n e lim 2 næ F H G I K J F H G næ næ n n K+ c I = KJ (since c > 0 an n = log N). Theorem 4 establishes that pb Æ 0 as N Æ for the concatenation of two self-routing ilate Delta networks. These networks can be mae to have an arbitrarily low pb an immunity to worst-case traffic patterns by the appropriate choice of h an. (We note that even faster convergence to zero can be obtaining by selecting a wier ilation = K Èlog N, in which case, pb lim NÆ n/n K+c = 0, although, in practice, this larger ilation is not necessary. Wier ilations are useful in an asymptotic analysis where all N connections in a permutation o not block simultaneously.) 0 J:\PRODUCTION\TC\3-FINAL\09-97\105193_2.DOC regularpaper97.ot KSM 19,968 08/11/97 2:11 PM 5 / 13

6 IEEE TRANSACTIONS ON COMPUTERS, VOL. 46, NO. 10, OCTOBER 1997 Fig. 3. Blocking probability (bit-pb) versus stages for various ilate banyans: (a) fixe ilation = 2, half-loae, (b) fixe ilation = 4, half-loae, (c) fixe ilation = 8, half-loae, () ilation = loa = O(log log N). Blocking probability approaches zero when ilation grows slowly. Bol ot represents esign example iscusse in Section 5. 2.4 Variations In practice, a number of techniques can be use to either reuce the cost or lower the blocking probability. Blocking in the ranomization network can be eliminate by having the crossbar switches always propagate connection requests forwar in pseuoranom irections. Each switch in the ranomizer can select a nonblocking state at ranom; the incoming connections are permute an always propagate forwar. The cost of the ranomizer can also be reuce by using a one-ilate crossbar rather than -ilate crossbars. This approach eliminates blocking in the ranomizer an reuces the cost of the ranomizer. Alternatively, in practice, the blocking probability can be reuce by employing eflection routing in the ranomization network, i.e., see [9]. The ranomization network attempts to route requests to their intene estinations. If a request encounters a congeste link in stage i, it is eflecte an propagate out over the wrong link. After exiting the ranomization network, some requests will arrive at their estinations, while others will arrive at incorrect estinations, an all requests are launche into the routing network. In the routing network, the requests which were eflecte in the ranomization network will have another chance to be route to their estination. The eflection routing algorithm results in a lower en-to-en blocking probability than preicte by Theorems 2 an 4, but is more complex to implement electronically. 2.5 Numeric Results In this section, the blocking probability of a ilate banyan base routing network is plotte against various parameters. Exact analytic moels for the blocking probability of ilate banyans uner ranom uniform traffic have been publishe in [19], [32]. The reaer is referre to those papers for etails. The blocking probability of a ilate banyan can be reuce by operating at lower loas, i.e., by separating an active input port which supports connections by one or more ile input ports. This approach was also use by Avora, Leighton, an Maggs to lower the loa in the J:\PRODUCTION\TC\3-FINAL\09-97\105193_2.DOC regularpaper97.ot KSM 19,968 08/11/97 2:11 PM 6 / 13

SZYMANSKI: DESIGN PRINCIPLES FOR PRACTICAL SELF-ROUTING NONBLOCKING SWITCHING NETWORKS 7 MultiBenes network [3]. To lower the loa in our system, we assume that each input port which has the capacity to source up to connection requests actually sources fewer requests, i.e., for a half-loa, each ilate input port sources h = /2 connection requests. The blocking probabilities of various ilate banyans, with various ilations an, at half loa (h = /2), are plotte against the number of stages in Figs. 3a, 3b, an 3c. (Blocking in the last stage is eliminate in our traffic moel, since at most h connections arrive at any one logical output port). Given a fixe ilation, the pb will approach one as N Æ, as inicate by Theorem 2, although it oes so very slowly. (In all figures, the bol ots represent an eight-ilate banyan which will be use in the optical esign in Section 5.) Accoring to Theorem 4, for pb to approach 0 as N Æ, the ilation an loaing must grow slowly with N, i.e., = Èlog log N an h = O(). The blocking probabilities of various ilate banyans which meet the conitions of Theorem 4, with a ilation of È4 log log N an at four ifferent loaings (h/ = 0.125, 0.25, 0.375, an 0.5), are plotte in Fig. 3. The number of stages is set to a very large value (300 stages), so that the asymptotic limits of the curves can be examine. As establishe in Theorem 4, asymptotically pb Æ 0 as N Æ. Hence, ilate banyans with fast selfrouting algorithms can be esigne to have arbitrarily small pbs as the network size scales to infinity. These results will be necessary in Section 3 to erive nonblocking ERC networks (base on a probabilistic approach) from networks with blocking. 3 THE EXPAND-ROUTE-CONTRACT NONBLOCKING SWITCH ARCHITECTURE In conventional circuit-switching connection networks, the connection atapaths are usually many bits wie, typically eight, 16, or 32 bits. Typically, all the bits in a connection atapath are switche together as an inivisible entity. If the circuit-switche connection blocks, then all bits in the connection block simultaneously. Similarly, if one bit in the atapath fails, then the entire connection fails. The propose ERC architecture relies on a funamentally ifferent approach. In orer to establish a w-bit wie circuit-switche connection in the ERC network, at the input sie, w + z inepenent bit-serial connection requests are inserte into the network (where w 1). Each bit-serial connection is route through a circuit-switche network calle a bit-plane, typically a one-bit wie ilate banyan with a finite blocking probability. At the output sie, all surviving bit-serial circuit-switche connections are contracte together, an a w-bit wie atapath is establishe if w or more bit-serial connections have survive. In principle, the bit-planes can be any self-routing bit-serial circuitswitching networks with blocking, incluing the conventional bit-serial banyan network. (In principle, we coul also use a Forwar-Error Correcting coe which can correct z bit-errors to expan a w-bit ata wor to w + z bits at the input sie, an the ecoer to contract w + z bits to a vali w-bit coe wor at the output sie. A blocke bit-serial connection appears as a consistent bit-error which can be correcte by the error-correcting coe.) Suppose that every N N bit-plane has a blocking probability enote bit-pb. The goal is to esign an N N switch with w-bit wie atapaths with a given blocking probability per connection (calle the connection-pb ), which can be arbitrarily small, given any level of blocking in the bit-planes. Typically, one may select the connectionpb to be 10-8, although other probabilities, such as 10-20, are easily esigne. (Blocking in networks is occasionally efine as the event that a permutation cannot be route in one pass. We assume the more practical measure in this paper, i.e., the event that a connection cannot be route. Uner the permutation moel, our ilation in Theorem 4 must be wier, i.e., O(log N), an the ERC network still yiels optimal memory bit-complexity. ) Given that we insert w + z bit-serial connection requests into w + z bit-planes, the probability that a w-bit wie ata path is establishe at each output port is given by the following. (In practice, we coul inject multiple bit-serial connections into the same bit-plane.) Let pb = bit - pb, an pa = 1 - bit - pb; an w 1. Therefore, the Pr[w-bit wie ata path establishe] = j= w F H I w+ z -pa j  e b g Â! w + z j w+ z-j j pa 1- pa ª K j= w pa. j In this section, efine the expansion to be the number of extra bits require. For example, an expansion of four an a atapath with of w implies that 4 + w bit-serial connection requests must be operate in parallel to achieve the specifie connection-pb. (If the bit-serial connections are replace by bit-parallel connections, the expansion also applies to bit-parallel connections.) The require expansion epens upon the blocking probability in the bit-plane an the atapath with. Wier atapaths require less expansion to achieve a given connection-pb. Fig. 4 applies for a atapath with of eight bits. Fig. 4a plots the expansion require to achieve a given connection-pb when the bit-pb is in the range of 0.001 to 0.01. Fig. 4b plots the expansion require to achieve a given connection-pb when the bit-pb is in the range of 0.01 to 0.1. Fig. 4c plots the expansion require to achieve a given connection-pb when the bit-pb is in the range of 0.1 to 0.5. The ERC architecture yiels nonblocking networks regarless of the blocking probability in the bit-planes. Bit-planes with more blocking simply require a larger expansion to achieve a given connection-pb. Figs. 3 an 4 supply sufficient ata so that a reaer can esign a self-routing nonblocking network of their choice. For example, to achieve a connection-pb of 10-8 given a bitpb of 0.0066, the expansion is four bits (see the ot on Fig. 4a). It is also possible that each connection is one bit wie. For this case, we set w = 1 an the expansion, or number of extra copies of the request, can be foun from the previous equation. A connection request is successful if at least one bit-serial request reaches the estination. J:\PRODUCTION\TC\3-FINAL\09-97\105193_2.DOC regularpaper97.ot KSM 19,968 08/11/97 2:11 PM 7 / 13

8 IEEE TRANSACTIONS ON COMPUTERS, VOL. 46, NO. 10, OCTOBER 1997 Fig. 4. Contour map of expansion versus bit-pb to achieve a given connection-pb for a atapath with of eight bits: (a) bit-pb in the range 0.001-0.01, (b) bit-pb in the range 0.01-0.1, (c) bit-pb in the range 0.1-0.5. Bol ot correspons to example use in Section 5. 4 HARDWARE-EFFICIENT TDM AND SDM CONSTRUCTIONS In this section, the harware complexity of the ERC network using multipath elta networks is examine. 4.1 Space Division Constructions A -ilate k*k crossbar switch consists of k k-to- concentrators; each concentrator collects the requests with a istinct routing bit between 0... k - 1 an propagates of them forwar. Concentrator Construction 1: A simple concentrator esign calle a Daisy-Chain Concentrator is shown in Fig. 5. The concentrator controller consists of an array of k-by- control cells, where an iniviual control cell is shown in Fig. 5a. Each control cell requires four logic gates an rives an associate crosspoint cell, shown in Fig. 5b. Each vertical column is essentially a aisy-chain, which controls access to one output port. A busy signal travels own the aisy-chain an is asserte by the first active request. No other requests can claim the same output port once its busy signal is asserte. Other requests which encounter a busy signal in one aisy-chain column are forware to the next aisy-chain column to see if it is busy. An example state of a 6-to-4 concentrator is shown in Fig. 5c. The k-to- aisy-chain concentrator requires five gates per cell an k cells. For fixe k, the concentrator requires O( 2 ) logic gates an has a setup time of O() logic gates. Each of the k input ports requires O(log k) bits of memory to ientify the requeste logical output port. In a synchronous moe of operation, the egree k switch requires O(k log k) bits of memory. For fixe k, the switch requires O() bits of memory. THEOREM 5. The use of the aisy-chain concentrator in the ERC architecture, which satisfies the conitions of Theorem 4, yiels a self-routing nonblocking N N connection network with O(N log N) noes, O(N log N) bits of memory, O(N log N log log N) logic gates, an a epth of O(log N log log N) logic gate elays. (The proof follows irectly by substitution, where N is the number of istinct connections supporte by the network.) When the complexity is measure in terms of crossbar noes, the cost is an optimal O(N log N) noes. In terms of bits of memory, the complexity is an optimal O(N log N) bits, which meets Shannon s lower boun establishe in the 1950s. Hence, the switch scales optimally accoring to these important practical metrics. In terms of logic gates, the complexity is a slightly suboptimal O(N log N log log N) logic gates, an the epth is a slightly suboptimal O(N log N log log N) logic gate elays. As integrate circuits will soon support millions of gates, an as logic gate imensions an elays will continue to shrink, these logic gate metrics have iminishing importance in practice. Furthermore, the grow rate in the term O(log log N) is so slow as to be negligible in practice. The simplicity an regular VLSI J:\PRODUCTION\TC\3-FINAL\09-97\105193_2.DOC regularpaper97.ot KSM 19,968 08/11/97 2:11 PM 8 / 13

SZYMANSKI: DESIGN PRINCIPLES FOR PRACTICAL SELF-ROUTING NONBLOCKING SWITCHING NETWORKS 9 Fig. 5. (a) Daisy-chain concentrator control cell with four logic gates, (b) ata plane crosspoint with tristate river gate, (c) 6-to-4 aisy-chain concentrator with 6 4 array of concentrator cells. Three requests are route to the first three output columns. (Dashe cells are not neee.) layout of this concentrator circuit make it useful in practical esigns (see Section 5). (The epth of this construction can be improve an will be reporte elsewhere.) Concentrator Construction 2: For sufficiently large ilations, the logic gate complexity of the -ilate k k crossbar can be improve. The crossbar can be constructe with O(k log k) harware an O(log k) epth by using self-routing concentrators with logarithmic epth, as shown in Fig. 6. Each concentrator consists of a ranking circuit to assign ranks to the active inputs, followe by an omega-inverse network, which acts as a compact concentrator for k = 2. (It has recently been establishe that a single Omega network can act as a zero an one concentrator simultaneously [16], an, hence, only one Omega network is necessary in Fig. 6b.) The ranking of k inputs is achieve by using a bit-serial pipeline binary tree, as shown in Fig. 6a. Each box or circle is a bit-serial aer (the first stage of boxes is not necessary an rawn for symmetry). The ranks an partial sums are compute bit-serially, least significant bit first. This ranker is a pipeline multistage circuit base upon a binary tree ranker escribe in [20]. In the upwar phase, each noe propagates the sum up towar the root an propagates the count from its uppermost chil horizontally. The root noe (in the mile) propagates a zero count to its uppermost chil, an propagates the incoming count from its upper chil to the lower chil. In the ownwar phase, each intermeiate noe propagates the count arriving from above irectly to its uppermost chil, an as the count from above an the count arriving horizontally, an propagates the sum to its lower chil. The leaves a the incoming count with a one if they have a connection, yieling the rank (from one to k) of the request. THEOREM 6. The use of the log-epth concentrator in the ERC architecture, which satisfies the conitions of Theorem 4, yiels a self-routing nonblocking N N connection network with O(N log N) noes, O(N log N log log log N) bits of memory, O(N log N log log log N) logic gates, an a epth of O(log N log log log N) logic gate elays. (Proof follows by substitution.) When the complexity is measure in terms of crossbar noes, the cost is an optimal O(N log N) noes. In terms of bits of memory, the harware complexity is O(N log N log log log N) bits, which is slightly suboptimal by a small factor of O(log log log N) when compare to Shannon s lower boun. In terms of logic gates, the complexity is O(N log N log log log N) an the epth is O(log N log log log N), which is an improvement over the aisy-chain concentrator esign. In situations where gate elays are more important than bits of memory, the log-epth concentrator may be preferable over the aisy-chain concentrator. 4.2 TDM Constructions A time-ivision -ilate k k crossbar can be implemente with a harware cost of O(k 2 ) logic gates an bits of memory an a latency of O(k) by using a circuit calle a time-bit concentrator [33]. A TDM-ilate crossbar switch with four incoming an four outgoing spaceivision links is shown in Fig. 7. Each link carries up to time-multiplexe bit-serial connections. On each incoming link, the bits from the -multiplexe connections keep arriving in the same orer (i.e., for a ilation of eight, bits belonging to connections arrive in orer 1, 2, 3,..., 8, an the cycle repeats). At each input port, a circular buffer J:\PRODUCTION\TC\3-FINAL\09-97\105193_2.DOC regularpaper97.ot KSM 19,968 08/11/97 2:11 PM 9 / 13

10 IEEE TRANSACTIONS ON COMPUTERS, VOL. 46, NO. 10, OCTOBER 1997 Fig. 6. (a) An O(log N)-epth ranking circuit, (b) a -ilate 2 2 crossbar with O(log ) epth an O( log ) bit-complexity. Fig. 7. A 4 4 ilate crossbar using time-bit concentrators. stores the routing bits for the connections. The circular buffer is loae initially as the connection-request heaers pass by. Once the circular buffers are loae with routing bits, the multiplexe ata bits arrive, an each bit is route to a push-pop stack at the esire logical output port. In Fig. 7, we have two eicate push-pop stacks at each output port, so that there is no contention for pushing the stack. One stack is use to store incoming bits, while the other is emptying outgoing bits. (Each stack must allow four simultaneous pushes in constant time, an can be esigne as four separate push-pop stacks for simplicity). Once all incoming bits have been route to the output ports, of them are then transmitte forwar over each space-ivision link by having a rea-out circuit pop the stack(s) (in constant time) in a repeatable orer for time units. Any bits left in the stack(s) after this represent blocke connections, an they are roppe. During the time one stack is emptying, another stack is storing a new set of incoming bits, which supplies the outgoing bits when the cycle repeats itself. It can be verifie that this circuit acts as a -ilate crossbar, has O(k 2 ) harware an O(k) latency in gates. We point out that the time-bit concentrator is fully pipelineable, i.e., new bits can enter an exit the concentrator on every link uring every clock cycle. THEOREM 7. The use the time-bit concentrator in the ERC architecture, which satisfies the conitions of Theorem 4, yiels a self-routing nonblocking N N connection network with O(N log N) noes, O(N log N) bits of memory, O(N log N) logic gates, an a epth of O(log N log log N) logic gate elays. (Proof follows by substitution.) When the complexity is measure in terms of crossbar noes, the cost is an optimal O(N log N) noes. In terms of bits of memory an logic gates, the complexity is an optimal O(N log N), which meets Shannon s lower boun. Hence, the switch scales optimally accoring to these important practical metrics. The epth is O(log N log log N) logic gate elays, which is slightly suboptimal by the small factor of O(log log N). This esign is particularly useful in optical networks, since TDM is a natural mechanism to exploit the banwith avantage that optics offers over electronics. (It is interesting to observe that the complexity of the ERC network is an optimal O(N log N) harware for wier ilations of = O(log N), provie that h = O() an k is constant.) Variations: In practice, the epth of all the ERC self-routing networks can be reuce to an optimal O(log N) logic gate elays by keeping the ilation fixe an letting the expansion increase slowly with N. One may select a multistage network with a fixe ilation an loaing h, which has a complexity of O(N log N) harware an a epth of O(log N) logic gates. From Fig. 3, for fixe ilations, it is observe that pb will rise very slowly as N increases. To keep the connection_pb below a prescribe value, the expansion can be rea from Fig. 4. The analysis an numerical results ini- J:\PRODUCTION\TC\3-FINAL\09-97\105193_2.DOC regularpaper97.ot KSM 19,968 08/11/97 2:11 PM 10 / 13

SZYMANSKI: DESIGN PRINCIPLES FOR PRACTICAL SELF-ROUTING NONBLOCKING SWITCHING NETWORKS 11 TABLE 2 SEMICONDUCTOR INDUSTRY ASSOCIATION PROJECTIONS FOR CMOS TECHNOLOGY [28] Year Feature Size (µ) Gates Area (Sq. mm) Electrical I/O pins On-Chip Clock Off-Chip Clock 1995 0.35 0.8 M 400 900 200 Mhz 100 Mhz 1998 0.25 2 M 600 1,350 350 Mhz 175 Mhz 2001 0.18 5 M 800 2,000 500 Mhz 250 Mhz 2004 0.12 10 M 1,000 2,600 700 Mhz 350 Mhz 2007 0.10 20 M 1,250 3,600 1 Ghz 500 Mhz cate that the expansion is ª q(log log N). Hence, in practice, ERC networks can also be esigne to have O(N log N log log N) harware complexity an O(log N) logic gate elays. 5 A TERABIT SELF-ROUTING NONBLOCKING ATM SWITCH CORE FOR NETWORKS-OF- WORKSTATIONS One application of fast self-routing switching networks is the Networks of Workstations (NOWs) istribute computer architecture. NOWs interconnecte with a centralize ATM switch core base on a multistage elta network are escribe in [27]. The performance of microprocessors an communication networks have been growing exponentially over the last ecae, an these trens are expecte to continue well into the future [22], [27], [28]. By the year 2017, the single chip micros are expecte to have performances of a few Teraflops per secon, an networks are expecte to have capacities of several Terabits per secon [22]. In this section, we consier the esign of a scalable ATM switch core, which can interconnect a large number of networke workstations. Each workstation typically has a CMOS Message- Processor (MP) to hanle the communication protocols. Messages are supplie to the Message-Processors, where they may be fragmente into fixe size packets, assigne sequence numbers for error control, replicate for broacasting, queue, an then transmitte over electrical or fiber links to the centralize switch core. The MPs also perform the receiving protocols. The switch core coul support istribute share memory over a NOW. Consier a pipeline circuit-switche switch core, where the connections are establishe an torn own on a perpacket basis. (The network coul be synchronous or asynchronous.) As a connection moves forwar, a packet is transferre in a pipeline manner byte-by-byte. Pipeline circuit switching is similar to worm-hole routing [27], in that every intermeiate noe buffers a few bits or bytes of a packet, an the packet can be sprea out over many intermeiate noes as it moves through the network. Consier the esign of a 1K 1K switch core with bytewie ata paths, with a blocking probability per connection of 10-8. In practice, protocols will ensure that no estination is overloae, i.e., that a estination receives at most h packets at a time (for some number h). Attempts to transmit to an overloae estination can be flagge with a busy signal an eferre until a short time later. Assume the switch is to be esigne using CMOS technology available in the year 1998. Table 2 illustrates some of the Semiconuctor Inustry Association (SIA) estimates for CMOS technology over the next ecae [28]. The figures in Table 2 will influence the esign example, an are traitionally conservative. Multistage networks can be esigne with many stages of simple binary switches, or fewer stages of larger switches. To minimize the IC count, we will consier multistage networks with fewer stages of moerate size switches. A one-stage switch has minimal cost in terms of ICs. However, it requires very large crossbar switches. Due to electronic pin limitations, it is not possible to implement a 1K 1K switch with eight bit-wie ata paths on a single IC (yieling a one-stage network). However, using the esign principles propose in this paper, one can esign a three, five, seven stage, or arbitrary 2n - 1 stage networks with moerate size crossbars an with arbitrarily low blocking probabilities, which overcome the electrical pin-limitation problem. Consier a three stage CLOS network with moerate size 16 16 crossbar switches in each stage, where each crossbar switch is bit-serial an eight-ilate. Each bitserial Clos network represents an inepenent bit-plane in our ERC switch, an requires 16 crossbars per stage for three stages. Assume that the effective input loaing is one half, i.e., each eight-ilate input port supports four processors, rather than eight. The blocking probability of ilate networks can be rea irectly from Fig. 3c. Accoring to Fig. 3c, the bit-pb of this bit-plane is 0.0066. In other wors, less than one percent of the bit-serial connections will block on average, given a permutation traffic moel. (For a uniform ranom traffic moel, the blocking is higher an the require expansion can be recompute following the metho in Section 3.) To achieve a connection-pb of 10-8, the expansion can be rea from Fig. 4a. The require expansion is four bits. Hence, to establish a byte-wie connection, we launch 12 inepenent bit-serial connection requests into the ERC network. At each output port, a byte-wie connection is establishe if eight or more bit-serial connections survive. An 8-to-12 expaner is neee at each switch input port, to create 12 inepenent bit-serial connections from the original eight. Also, a 12-to-8 concentrator is neee at each switch output port, to compact up to 12 bit-serial requests own to an eight-bit atapath. The aitional gates neee to implement these components are negligible. (A 12-to-8 concentrator requires about 500 gates, which is negligible compare to the number of gates in the MP. Acknowlegment signals can be use to ientify vali bit-serial connections.) The self-routing ilate crossbar switches can be esigne by using the aisy-chain concentrators of Section 4.1. It can be verifie that each eight-ilate bit-serial 16 16 crossbar requires about 1 KBit of memory, which is very small when compare to the memory requirements of a J:\PRODUCTION\TC\3-FINAL\09-97\105193_2.DOC regularpaper97.ot KSM 19,968 08/11/97 2:11 PM 11 / 13

12 IEEE TRANSACTIONS ON COMPUTERS, VOL. 46, NO. 10, OCTOBER 1997 buffere crossbar. (A buffere crossbar supporting ATM cells woul require at least one cell buffer per link, or at least 54 KBits of memory.) Due to electrical I/O pin limitations, each IC has enough I/O pins to support only five of these crossbar switches. It follows that the electrical ERC network requires 116 ICs, an has an aggregate banwith of 1.4 Terabit/sec. The architecture scales to five, seven, or more stages. The five-stage switch has a banwith of ª 22.9 Terabit/sec, an the seven-stage switch has a banwith of ª 367 Terabit/sec. The propose architecture aresses two key networking problems, as ientifie in the NSF report [36]. The problem of fast control of Gigabit/Terabit networks is partially aresse by using the ERC architecture, since it uses very fast self routing algorithms an is provably robust an immune to congestion (as emonstrate by Theorems 1-4). The problem that existing switch architectures have suboptimal harware an memory scaling properties is aresse, as the harware an memory complexity of the propose architecture scales optimally or nearly optimally. Perhaps the only major remaining problem with all electrical architectures is the large number of wires between stages. This problem can be solve by using optics as the next esign example illustrates. 5.1 Design of an Opto-Electronic Switch Core The esign of an opto-electronic switch core is summarize. To minimize cost, a one-stage switch woul be preferable. However, a one-stage switch will require a 1K 1K crossbar, which is very large. Table 3 illustrates projections for the electrical an optical I/O properties of OEICs (hereafter calle ICs). Column 2 is from [18]. Using the aisy-chain concentrator, a 1K 1K crossbar with byte-wie atapaths will require in excess of 10M gates, exceeing the gate capacity of the ICs. Hence, a three-stage ERC construction, using moerate size crossbars, can be use. TABLE 3 PROJECTED CAPACITIES FOR SINGLE CHIP OEICS (BASED ON DATA IN [2], [18]) Year Max # Optical I/O Optical Clock Max. Optical I/O BW 1995 6,000 200 Mhz 0.6 Tb/s 1998 12,000 350 Mhz 2.1 Tb/s 2001 24,000 500 Mhz 6 Tb/s 2004 40,000 700 Mhz 14 Tb/s 2007 50,000 1 Ghz 25 Tb/s Note: BW is prouct of optical I/O time optical clock ivie by two. It can be verifie that each IC has sufficient optical I/O to support a large number of eight-ilate bit-serial 16 16 crossbar switches. However, each IC is limite by logic gates to implement only about 25 of these crossbar switches, which we assume. It follows that the ERC switch core requires 24 OEICs, an has an aggregate banwith of 2.8 Terabit/sec (ouble the banwith of the electrical version, since the optical atapaths operate at the faster optical clock rate). A five-stage network has a banwith of ª 46 Terabit/sec, an a seven-stage network has a banwith of ª 734 Terabit/sec. These esigns are technologically feasible with existing OEIC technology. The atapaths to an from the switch core can be realize with commercially available parallel fiber ribbons, such as the Motorola OPTOBUS [24]. A fiel programmable logic evice with optical I/O, which can be ynamically programme to implement ilate crossbars, has been evelope [30]. Using these technologies an the propose esign principles, one may esign arbitrarily large optical switching networks using multiple stages of moerate size crossbars. 5.2 Comparison with the ALM Network Avora, Leighton, an Maggs escribe a self-routing MultiBenes network which is nonblocking an which has an asymptotically optimal cost of O(N log N) gates an bits of memory [3]. Like the ERC network, the MultiBenes network can be viewe as the concatenation of two multipath elta networks. However, the ALM routing algorithms an noes are consierably more complex than the propose ERC schemes. The ALM network requires complicate backtracking routing algorithms an partial packet buffering in the noes, which rener it unattractive for optical implementations. In aition, to achieve the optimal harware complexity, the ALM network requires linear cost expaners for which no explicit construction is known. Nevertheless, in orer to raw a comparison, we assume that their expaners can be built. The esign in [3] escribes a network with a path multiplicity of 10, a spacing between active logical input ports of 300, an the use of binary switching noes. It follows that each stage in an ALM network with 1K active input ports requires at least 300K connection atapaths. Hence, an optical version of the ALM network has a cost which is several orers of magnitue more than the ERC network. In practice, one may be able to improve the performance ALM network. However, it is more efficient to apply the ERC esign principles on the MultiBenes or Multipath Delta topology, which will yiel an optimal or near-optimal network. 6 CONCLUSIONS Principles for esigning practical self-routing nonblocking switching networks, such as those use in ATM switch cores, were propose. These principles lea to a large class of selfrouting nonblocking switching networks, which are base on the concepts of expansion, routing, an contraction. These networks aress two research priorities ientifie in a recent NSF sponsore report, the nee for fast algorithms to control the Gigabit an Terabit networks of the future, an the nee for optimally scalable switching networks [36]. The propose space omain constructions yiel self-routing nonblocking switching networks with an optimal O(N log N) bits of memory or O(N log N log log log N) logic gates. The propose time omain construction yiels self-routing nonblocking switching networks with an optimal q(n log N) bits of memory or q(n log N) logic gates. These esigns meet Shannon s lower boun on memory requirements establishe in the 1950s, an they reaily scale to large sizes. The propose architecture briges the iscrepancies between the best-known theoretical an practical results, an are attractive for both electrical an optical implementations. Fast self-routing switching networks with Terabits of bisection banwiths can be esigne using these principles. J:\PRODUCTION\TC\3-FINAL\09-97\105193_2.DOC regularpaper97.ot KSM 19,968 08/11/97 2:11 PM 12 / 13

SZYMANSKI: DESIGN PRINCIPLES FOR PRACTICAL SELF-ROUTING NONBLOCKING SWITCHING NETWORKS 13 ACKNOWLEDGMENTS We gratefully acknowlege the comments of the referees which have improve the presentation of the paper. This research was fune by NSERC Canaa Grant OPG0-0121601. REFERENCES [1] M. Ajtai, J. Komlos, an E. Szemerei, An O(NlogN) Sorting Network, Proc. 15th ACM Symp. Theory of Computation, pp. 1-9, 1983. [2] ARPA/COOP/AT&T Hybri-SEED Workshop Notes, George Mason Univ., July 1995. [3] S. Avora, F. Leighton, an B. Maggs, On Line Algorithms for Path Selection in a Nonblocking Network, Proc. 1990 ACM Symp. Theory of Computing, pp. 149-158, 1990. [4] B.D. Alleyne an I. Scherson, Expane Delta Networks for Very Large Parallel Computers, Proc. Int l Conf. Parallel Processing, pp. 127-131, 1992. [5] A. Bassalygo an M.S. Pinsker, Complexity of Optimum Nonblocking Switching Network without Reconnections, Problems of Information Transmission, vol. 9, pp. 64-66, 1974. [6] K.E. Batcher, Sorting Networks an Their Applications, Proc. 1968 Spring Joint Computer Conf., 1968. [7] M.V. Chien an A.Y. Oruc, Aaptive Binary Sorting Schemes an Associate Interconnection Networks, Proc. Int l Conf. Parallel Processing, pp. 289-293, 1992. [8] T.J. Cloonan, G.W. Richars, A.L. Lentine, F.B. McCormick, an J.R. Erickson, Free-Space Photonic Switching Architectures Base on Extene Generalize Shuffles, Applie Optics, vol. 31, no. 35, pp. 7,471-7,492, Dec. 1992. [9] T.J. Cloonan, Comparative Stuy of Optical an Electronic Interconnection Technologies for Large Asynchronous Transfer Moe Packet Switching Applications, Optical Eng., vol. 33, no. 5, pp. 1,512-1,523, May 1994. [10] G.A. De Biase, C. Ferrone, an A. Massini, An O(logN) Depth Asymptotically Nonblocking Self Routing Permutation Network, IEEE Trans. Computers, vol. 44, no. 8, pp. 1,047-1,051, Aug. 1995. [11] B. Douglass, Rearrangeable Three-Stage Interconnection Networks an Their Routing Properties, IEEE Trans. Computers, vol. 42, no. 5, pp. 559-567, May 1993. [12] H.S. Hinton, T.J. Cloonan, F.B. McCormick, A.L. Lentine, an F.A.P.Tooley, Free-Space Digital Optical Systems, Proc. IEEE, vol. 82, no. 11, pp. 1,632-1,649, Nov. 1994. [13] W. Hoeffing, On the Distribution of the Number of Successes in Inepenent Trials, Annals of Math. Statistics, vol. 27, pp. 713-721, 1956. [14] A. Huang an S. Knauer, Starlite: A Wieban Digital Switch, Proc. Globecom, Dec. 1988. [15] C.Y. Jan an A.Y. Oruc, Fast Self-Routing Permutation Switching on an Asymptotically Minimal Cost Network, IEEE Trans. Comm., vol. 42, no. 12, Dec. 1993. [16] R. Kannan, H.F. Joran, K.Y. Lee, an C. Ree, A Bit-Controlle MultiChannel Time Slot Permutation Network, Proc. Secon Int l Conf. Massively Parallel Processing Using Optical Interconnects, pp. 271-278, 1995. [17] D.M. Koppelman an A.Y. Oruc, A Self-Routing Permutation Network, J. Parallel an Distribute Computing, vol. 10, pp. 140-151, 1990. [18] A.V. Krishnamoorthy an D.A.B. Miller, Scaling Optoelectronic- VLSI Circuits into the 21st Century: A Technology Roamap, IEEE J. Selecte Topics in Quantum Electronics, vol. 2, no. 1, pp. 55-76, Apr. 1996. [19] C.P. Kruskal an M. Snir, The Performance of Multistage Interconnection Networks for Multiprocessors, IEEE Trans. Computers, vol. 32, no. 12, pp. 1,091-1,098, Dec. 1983. [20] F.T. Leighton, Parallel Algorithms an Architectures: Arrays, Trees an Hypercubes. Morgan-Kaufmann, 1992. [21] A.L. Lentine et al., 700 Mb/s Operation of Optoelectronic Switching Noes Comprise of Flip-Chip-Bone GaAs/AlGaAS MQW Moulators an Detectors on Silicon CMOS Circuitry, Proc. Conf. Lasers an Electrooptics, 1995. [22] T. Lewis, The Next 10,000 2 Years: Part 1, Computer, Apr. 1996, pp. 64-70, Part 2, pp. 78-86, May 1996. [23] D. Mitra an R.A. Cieslak, Ranomize Parallel Communications on an Extension of the Omega Network, J. ACM, vol. 34, pp. 802-824, 1987. [24] Motorola, OPTOBUS Data Sheet, Logic Integrate Circuits Division, 1995. [25] D. Nassimi an S. Sahni, Parallel Permutation an Sorting Algorithms an a New Generalize Connection Network, J. ACM, pp. 642-667, 1982 [26] J.H. Patel, Performance of Processor-Memory Interconnections for Multiprocessors, IEEE Trans. Computers, vol. 30, no. 10, pp. 771-780, Oct. 1981. [27] J.L. Hennessey an D.A. Patterson, Computer Architecture, A Quantatative Approach, secon eition. San Francisco: Morgan-Kauffman, 1995. [28] Semiconuctor Inustry Association, The National Technology Roamap for Semiconuctors, San Jose, Calif.: SIA, 1994. [29] C.E. Shannon, Memory Requirements in a Telephone Exchange, Bell. Systems Technical J., 1953. [30] S. Sherif, T.H. Szymanski, an H.S. Hinton, Design an Implementation of a Fiel Programmable Smart Pixel Array, Proc. LEOS 96 Conf. Smart Pixels, Keystone, Colo., Aug. 1996. [31] B. Supmonchai an T.H. Szymanski, Fast Self-Routing Concentrators for Optoelectronic Systems, submitte. [32] T.H. Szymanski an V.C. Hamacher, On the Universality of Multipath Multistage Interconnection Networks, Interconnection Networks, I. Scherson an Youseff, es., IEEE CS Press, 1994. [33] T.H. Szymanski an C. Fang, Ranomize Routing of Virtual Connections in Essentially Nonblocking log N-Depth Networks, IEEE Trans. Comm., pp. 2,521-2,531, Sept. 1995. [34] T.H. Szymanski an H.S. Hinton, Reconfigurable Intelligent Optical Backplane for Parallel Computing an Communications, Applie Optics, pp. 1,253-1,268, Mar. 1996. [35] C.D. Thompson, Generalize Connection Networks for Parallel Processor Intercommunication, IEEE Trans. Computers, vol. 27, no. 12, pp. 1,119-1,125, Dec. 1978. [36] U.S. National Science Founation, Research Priorities in Networking an Communications, Report to the NSF Division of Networking an Communications Research an Infrastructure, May 12-14, 1994, Arlington, Va. [37] E. Upfal, S. Felperin, an M Snir, Ranomize Routing with Shorter Paths, IEEE Trans. Parallel an Distribute Systems, vol. 7, no. 4, pp. 356-362, Apr. 1996. [38] L.G. Valiant an G.J. Brebner, Universal Schemes for Parallel Communications, Proc. 13th Ann. ACM Symp. Theory of Computing, pp. 263-277, 1981. [39] M. Yamaguchi, an K-I Yukimatsu, Recent Free-Space Photonic Switches, IEICE Trans. Comm., vol. E77B, no. 2, Feb. 1994. Te H. Szymanski receive his BSc egree in engineering science, an the MASc an PhD egrees in electrical engineering from the University of Toronto. From 1987 to 1991, he was an assistant professor at Columbia University an a principle investigator at the U.S. National Science Founation Center for Telecommunications Research working on ATM switching networks an WDM optical architectures. He is currently an associate professor an the irector of the Microelectronics an Computer Systems Laboratory at McGill University, an a project leaer in the Canaian Institute for Telecommunications Research, leaing a project on Optical Architectures an Applications. An intelligent optical backplane architecture evelope by this project is being constructe in Canaa. He is active professionally, an has serve on the program committees for the 1998 an 1997 Workshops on Optics in Computer Science, the 1998 an 1997 International Conferences on Massively Parallel Processing using Optical Interconnects, the 1998 International Conference on Optical Computing, the 1997 Innovative Systems on Silicon Conference, the 1995 Workshop on High-Spee Network Computing, an the 1998, 1995, an 1994 Canaian Conferences on Programmable Logic Devices. He has also serve as Session Organizer for IEEE Infocom, the International Computing an Communication Conference, an the International Conference on Parallel Processing. His personal interests incluing snowboaring an cycling, an his research interests inclue telecommunication an computing architectures, performance moeling, an optical networks. He is a member of the IEEE Computer an Communications societies. J:\PRODUCTION\TC\3-FINAL\09-97\105193_2.DOC regularpaper97.ot KSM 19,968 08/11/97 2:11 PM 13 / 13