Matrix Unit Cell Scheduler (MUCS) for. Input-Buered ATM Switches. Haoran Duan, John W. Lockwood, and Sung Mo Kang

Similar documents
A High-performance OC-12/OC-48 Queue Design Prototype for. Input-buered ATM Switches. Haoran Duan J. W. Lockwood, S. M. Kang, and J. D.

Scalable Schedulers for High-Performance Switches

Efficient Queuing Architecture for a Buffered Crossbar Switch

Dynamic Scheduling Algorithm for input-queued crossbar switches

Router/switch architectures. The Internet is a mesh of routers. The Internet is a mesh of routers. Pag. 1

Router architectures: OQ and IQ switching

Concurrent Round-Robin Dispatching Scheme in a Clos-Network Switch

Matching Schemes with Captured-Frame Eligibility for Input-Queued Packet Switches

FIRM: A Class of Distributed Scheduling Algorithms for High-speed ATM Switches with Multiple Input Queues

Design and Evaluation of a Parallel-Polled Virtual Output Queued Switch *

Design of a Weighted Fair Queueing Cell Scheduler for ATM Networks

Scheduling Algorithms for Input-Queued Cell Switches. Nicholas William McKeown

Selective Request Round-Robin Scheduling for VOQ Packet Switch ArchitectureI

K-Selector-Based Dispatching Algorithm for Clos-Network Switches

DESIGN OF EFFICIENT ROUTING ALGORITHM FOR CONGESTION CONTROL IN NOC

MULTICAST is an operation to transmit information from

F cepted as an approach to achieve high switching efficiency

Multicast Scheduling in WDM Switching Networks

[29] Newman, P.; Lyon, T.; and Minshall, G.; Flow Labelled IP: A Connectionless Approach to ATM, Proceedings of IEEE INFOCOM 96, San Francisco,

A Partially Buffered Crossbar Packet Switching Architecture and its Scheduling

Integrated Scheduling and Buffer Management Scheme for Input Queued Switches under Extreme Traffic Conditions

Efficient Multicast Support in Buffered Crossbars using Networks on Chip

Title: Implementation of High Performance Buffer Manager for an Advanced Input-Queued Switch Fabric

On Scheduling Unicast and Multicast Traffic in High Speed Routers

Performance Analysis of Cell Switching Management Scheme in Wireless Packet Communications

Generic Architecture. EECS 122: Introduction to Computer Networks Switch and Router Architectures. Shared Memory (1 st Generation) Today s Lecture

An Enhanced Dynamic Packet Buffer Management

Simulation of an ATM{FDDI Gateway. Milind M. Buddhikot Sanjay Kapoor Gurudatta M. Parulkar

Performance Comparison of OTDM and OBS Scheduling for Agile All-Photonic Network

PCRRD: A Pipeline-Based Concurrent Round-Robin Dispatching Scheme for Clos-Network Switches

EECS 122: Introduction to Computer Networks Switch and Router Architectures. Today s Lecture

A New Integrated Unicast/Multicast Scheduler for Input-Queued Switches

Designing of Efficient islip Arbiter using islip Scheduling Algorithm for NoC

Compensation Modeling for QoS Support on a Wireless Network

Space Priority Trac. Rajarshi Roy and Shivendra S. Panwar y. for Advanced Technology in Telecommunications, Polytechnic. 6 Metrotech Center

FB(9,3) Figure 1(a). A 4-by-4 Benes network. Figure 1(b). An FB(4, 2) network. Figure 2. An FB(27, 3) network

Designing Efficient Benes and Banyan Based Input-Buffered ATM Switches

Accurate modelling of the queueing behaviour of shared bu!er ATM switches

Credit-Based Fair Queueing (CBFQ) K. T. Chan, B. Bensaou and D.H.K. Tsang. Department of Electrical & Electronic Engineering

Design of a Tile-based High-Radix Switch with High Throughput

A Simple Relaxation based Circuit Simulator for VLSI Circuits with Emerging Devices

The GLIMPS Terabit Packet Switching Engine

THE ASYNCHRONOUS transfer mode (ATM) has been

High-Speed Cell-Level Path Allocation in a Three-Stage ATM Switch.

Shared-Memory Combined Input-Crosspoint Buffered Packet Switch for Differentiated Services

SCHEDULING REAL-TIME MESSAGES IN PACKET-SWITCHED NETWORKS IAN RAMSAY PHILP. B.S., University of North Carolina at Chapel Hill, 1988

Packet Switching Queuing Architecture: A Study

Improved Collision Resolution Algorithms for Multiple Access Channels with Limited Number of Users * Chiung-Shien Wu y and Po-Ning Chen z y Computer a

166 The International Arab Journal of Information Technology, Vol. 3, o. 2, April 2006 Modified Round Robin algorithms and discuss their relative meri

A DYNAMIC RESOURCE ALLOCATION STRATEGY FOR SATELLITE COMMUNICATIONS. Eytan Modiano MIT LIDS Cambridge, MA

Globecom. IEEE Conference and Exhibition. Copyright IEEE.

Citation Globecom - IEEE Global Telecommunications Conference, 2011

A Novel Feedback-based Two-stage Switch Architecture

DiffServ Architecture: Impact of scheduling on QoS

Quality of Service (QoS)

IV. PACKET SWITCH ARCHITECTURES

Fixed-Length Switching vs. Variable-Length Switching in Input-Queued IP Switches

Extensions to RTP to support Mobile Networking: Brown, Singh 2 within the cell. In our proposed architecture [3], we add a third level to this hierarc

CBR. rtvbr. nrtvbr ABR UBR

A Heuristic Algorithm for Designing Logical Topologies in Packet Networks with Wavelength Routing

Switch Packet Arbitration via Queue-Learning

Design and Performance Evaluation of a New Spatial Reuse FireWire Protocol. Master s thesis defense by Vijay Chandramohan

Resource Sharing for QoS in Agile All Photonic Networks

Performance analysis of realistic optical time division multiplexed wavelength routed networks. Creative Commons: Attribution 3.0 Hong Kong License

Long Round-Trip Time Support with Shared-Memory Crosspoint Buffered Packet Switch

Optimal Topology for Distributed Shared-Memory. Multiprocessors: Hypercubes Again? Jose Duato and M.P. Malumbres

DESIGN AND IMPLEMENTATION OF A MULTICAST, JOHN WILLIAM LOCKWOOD. M.S., University of Illinois, 1993 THESIS

Efficient Uplink Scheduler Architecture of Subscriber Station in IEEE System

A Fast Switched Backplane for a Gigabit Switched Router

BROADBAND AND HIGH SPEED NETWORKS

A General Purpose Queue Architecture for an ATM Switch

A FAST ARBITRATION SCHEME FOR TERABIT PACKET SWITCHES

Using the Imprecise-Computation Technique for Congestion. Control on a Real-Time Trac Switching Element

The Arbitration Problem

Adaptive Data Burst Assembly in OBS Networks

TOC: Switching & Forwarding

MDP Routing in ATM Networks. Using the Virtual Path Concept 1. Department of Computer Science Department of Computer Science

TOC: Switching & Forwarding

Maximum Weight Matching Dispatching Scheme in Buffered Clos-network Packet Switches

Real-time communication scheduling in a multicomputer video. server. A. L. Narasimha Reddy Eli Upfal. 214 Zachry 650 Harry Road.

Performance Modeling of Queuing Techniques for Enhance QoS Support for Uniform and Exponential VoIP & Video based Traffic Distribution

A New Optical Burst Switching Protocol for Supporting. Quality of Service. State University of New York at Bualo. Bualo, New York ABSTRACT

A Pipelined Memory Management Algorithm for Distributed Shared Memory Switches

Router s Queue Management

Switch Architecture for Efficient Transfer of High-Volume Data in Distributed Computing Environment

Delayed reservation decision in optical burst switching networks with optical buffers

Fault Tolerant and Secure Architectures for On Chip Networks With Emerging Interconnect Technologies. Mohsin Y Ahmed Conlan Wesson

Dynamic Multi-Path Communication for Video Trac. Hao-hua Chu, Klara Nahrstedt. Department of Computer Science. University of Illinois

2 J. Karvo et al. / Blocking of dynamic multicast connections Figure 1. Point to point (top) vs. point to multipoint, or multicast connections (bottom

Hierarchical Scheduling for DiffServ Classes

Improving the Data Scheduling Efficiency of the IEEE (d) Mesh Network

STATE UNIVERSITY OF NEW YORK AT STONY BROOK. CEASTECHNICAL REPe. Multiclass Information Based Deflection Strategies for the Manhattan Street Network

Expert Network. Expert. Network

UNIT 2 TRANSPORT LAYER

A distributed memory management for high speed switch fabrics

A Network Storage LSI Suitable for Home Network

6LPXODWLRQÃRIÃWKHÃ&RPPXQLFDWLRQÃ7LPHÃIRUÃDÃ6SDFH7LPH $GDSWLYHÃ3URFHVVLQJÃ$OJRULWKPÃRQÃDÃ3DUDOOHOÃ(PEHGGHG 6\VWHP

TOC: Switching & Forwarding

A B. A: sigmoid B: EBA (x0=0.03) C: EBA (x0=0.05) U

Implementation of Boundary Cutting Algorithm Using Packet Classification

Transcription:

Matrix Unit Cell Scheduler (MUCS) for Input-Buered ATM Switches Haoran Duan, John W. Lockwood, and Sung Mo Kang University of Illinois at Urbana{Champaign Department of Electrical and Computer Engineering Center for Compound Semiconductor Microelectronics and Beckman Institute for Advanced Science and Technology 45 N. Mathews Ave., Urbana, IL 6181 (217) 244-1565 (Voice) (217) 244-8371 (FAX) email: hd@ipoint.vlsi.uiuc.edu WWW: http://ipoint.vlsi.uiuc.edu September 17, 1997 This research has been supported by National Science Foundation Engineering Research Center grant ECD 89-43166, Defense Advanced Research Projects Agency (DARPA) grant for Center for Optoelectronic Science and Technology (COST) grant MDA 972-94-1-4, and AT&T foundation. 1

ABSTRACT This paper presents a novel matrix unit cell scheduler (MUCS) for input-buered ATM switches. The MUCS concept originates from a heuristic strategy that leads to an optimal solution for cell scheduling. Numerical analysis indicates that input-buered ATM switches scheduled by MUCS can utilize nearly 1% of the available link bandwidth. A transistor-level MUCS circuit has been designed and veried using HSPICE. The circuit features a regular structure, minimal interconnects, and a low transistor count. HSPICE simulation indicates that using 2 m CMOS technology, the MUCS circuit can operate at clock frequency of 1 MHz. 1 Introduction The input-queuing (IQ) architecture is attractive for building ultra-high speed ATM switches [1] [2]. By queuing incoming trac according to the output destination at each input port (N-queue) and scheduling the transmission of queued cells eectively, an IQ- ATM switch can avoid HOL blocking and achieve 1% throughput [3]. The lack of an ecient cell scheduler, however, has been one of the major barriers for building a highperformance, scalable IQ-ATM switch. Such a scheduler must resolve output contention swiftly, provide high throughput, satisfy QoS requirements, and most importantly, be implementable in a low-cost integrated circuit. Existing scheduling algorithms are complicated and costly to implement in hardware. PIM-based (parallel interactive matching) algorithms [4], such as a IRRM (iterative round-robin matching) or a SLIP-IRRM (IRRM with slip) scheduler [5], require 2N round-robin arbitrators with O(N 2 ) complexity of interconnects [6]. Neural-networkbased solutions require O(n 2 ) neurons and O(n 3 ) interconnects [7]. In this paper, we present a novel matrix unit cell scheduler (MUCS). The MUCS enables ATM switches to utilize nearly 1% of the available link bandwidth and requires signicantly less hardware to implement. Its transistor-level circuit implementation features a uniform structure, has a very low interconnect and transistor count, and achieves a near-optimal cell scheduling within a linear processing time [8]. 2

2 The MUCS Algorithm MUCS resolves output contention by computing a trac matrix, A. Initially, each row of A summarizes the status of queued cells at each input port of the switch through an N-element trac vector. Each element, a ij of A, is a nonnegative integer. Whereas a ij = indicates that no cell is ready to be switched to output port j, from input port i, a positive value indicates the highest QoS index among all buered cells to be switched to output port j [9] [1]. The MUCS algorithm selects an optimal set of entries as winning cells from matrix A according to the weight w ij of a ij. It calculates the heaviest entries under the constraint that no more than one cell is transmitted from an input port and no more than one cell is delivered to an output port simultaneously. When implemented in a MUCS module, the selection process can be performed fully in parallel by the hardware. Letting N be the size of the switch and M the size of the reduced trac matrix A' of A at each iteration, the algorithm is described as follows: 1. Initially, M = N, and A' = A. 2. For each iteration, the entry weight w ij is assigned to every entry of A'. The value of w ij is determined by: where w ij = 8>< >: a ij wr i + a ij wc j ; if a ij 6= MX ; otherwise (1) wr i = [number of a ij > ] (2) j=1 MX wc j = [number of a ij > ] (3) i=1 3. The entry with the heaviest weight is selected as the winner. Multiple entries can be selected as winners simultaneously if they all have the same weight and do not reside in the same row or column. Once the winners have been selected, a new reduced matrix, A', is formed as all elements of current A', excluding the rows and columns with winning cells. The algorithm terminates when all entries have been treated. 4. If a tie occurs, winners are selected randomly. 3

3 Heuristic Strategy in MUCS The MUCS algorithm originates from a heuristic strategy that yields a most satisfactory solution when conicts of interest exist. In the ATM scheduling problem, the best interest of an input port is dened as the transmission of a cell and the best interest of an output port is dened as the reception of a cell. Optimal scheduling decision should enable the transmission of a maximum number of queued cells. The heuristic strategy behind the MUCS algorithm is most easily described when a ij is discretized to a binary. In this condition, Eq. (1) is reduced to w ij = 8>< >: (wr i)?1 + (wc j )?1 ; if a ij 6= ; else The heaviest element in the matrix corresponds to a buered cell that has the least contending candidates in the same row and the same column. By choosing the heaviest element(s) rst, each iteration of MUCS renders the remaining elements with the maximum number of scheduling opportunities. This leads to a transmission schedule that maximizes the aggregate throughput. When a ij is not discretized and Eq. (1) is applied, the solution MUCS found will sacrice the overall throughput to satisfy certain QoS requirements. For example, if the cell priority is the major factor that determines a ij, the overall throughput is traded for less delay of higher priority cells. Many variations and extensions of the MUCS algorithm can be made by modifying the function that generates a ij. (4) 4 The MUCS Implementation A novel, mixed digital-analog core has been designed to implement the computation of the MUCS algorithm. A block diagram of the MUCS core is given in Figure 1. The circuit has a regular structure which resembles a symmetrical matrix. Its major building blocks consist of a constant DC current source, a process unit, and a line control unit. For an NxN switch, N 2 process units are needed, each of which corresponds to an element of the trac matrix. Each process unit consists of circuit switches, two current dividers (one for the column and another for the row), and capacitors. Signals from line control units open and close these switches for charging and discharging the capacitors. The value of a ij is represented by values of capacitors. When xed capacitors are used, the 4

DC current source I = i N cell_exist ( a ij) charging_off Row Source Current Divider sw2 Current Divider sw3 sw1 Column Source I = i N a ij win_lost reset/discharg sw4 C line control unit process unit (a) The MUCS Core j=i N /wc j i=i N /wr i (b) The Process Unit Figure 1: Building Blocks of MUCS Core trac matrix A becomes a binary matrix after loaded into the MUCS circuit and the solution is optimized for maximum throughput. When variable capacitors are employed, dierent values of a ij (QoS factors) can be included in the scheduling. The MUCS core computes the solution in N steps. Each step consists of a charging phase followed by a discharging phase. In the charging phase, sw3 is closed and sw4 is open for each participating process unit. The setting is opposite for the discharging phase. Let us dene i N as the value of the constant DC current supplied, and use wr i and wc j (refer to Eqs. (2) and (3)) as the number of active entries in each row and column in the reduced trac matrix. Then, for the charging phase, if a ij >, the two current divider blocks of a process unit drain i N =wr i and i N =wc j currents from the column and row current sources, respectively. The combined current charges the capacitor C. The heavier the weight of an entry, the more the total current drained. The winner is indicated by the voltage level on a corresponding capacitor in the array. When the voltage of the winning capacitor reaches a pre-dened threshold, the line control logic is triggered to open sw3 to terminate the charging process and eliminate other elements of the same row and column. The discharge process erases the residue charge on the capacitor(s) of the process units which belong to the unresolved rows and columns. This, in eect, reduces the dimensions of the matrix for the next charging process. The MUCS core, therefore, absorbs the computation complexity of the trac matrix entry weight assignment and the heaviest weight element selection by mapping 5

them as the capacitor charging / discharging procedure of the circuit and executing fully in parallel through hardware. A transistor-level circuit has been designed that implements the MUCS core. The circuit has a regular structure, minimal interconnects, and a low transistor count. Each processing unit requires only 29 transistors and 2 capacitors. Besides the charging current inputs, each process unit has only three common wired signals for interconnect between process units [11]. HSPICE was used to verify the design. Simulation results indicate that with 2 m CMOS technology, the MUCS circuit can operate at clock frequency of 1 MHz clock. Using larger current sources, reducing the size of capacitor, or reducing the transistor feature size can make the MUCS circuit operate even faster. 5 Numerical Analysis of MUCS Performance MUCS achieves a near-optimal throughput when it is optimized for selecting the maximum number of cells. In this case, all entries of the matrix A are binary element. 5.1 Selecting cells from random trac matrices Normalized Throughput 1.9.8.7.6.5.4.3.2.1 Throughput Analysis of the 8x8 MUCS MUCS throughput achievable throughput.1.2.3.4.5.6.7.8.9 1 Probability (p) Figure 2: 8x8 Switch Throughput vs. Simulation was performed to nd the normalized average number of cells selected by MUCS from random binary trac matrices. The results were compared to the maximum number of cells that could be switched. The parameter is dened as the probability of a ij = 1. Figure 2 compares the normalized throughput of the optimal solution and the solution found by MUCS, in a switches of size 8 with varying from to 1. As indicated, the two 6

(a) Uniform Random Arrival (b) Poisson Arrival Average Delay (Cell Slots) 14 12 1 8 6 4 2 5 Port MUCS-scheduled Switch 8 Port MUCS-scheduled Switch 16 Port MUCS-scheduled Switch 32 Port MUCS-scheduled Switch Average Delay (Cell Slots) 14 12 1 8 6 4 2 5 Port MUCS-scheduled Switch 8 Port MUCS-scheduled Switch 16 Port MUCS-scheduled Switch 32 Port MUCS-scheduled Switch.1.2.3.4.5.6.7.8.9 1 Load (Percentage) (c) Bursty Arrival (average burst length = 5).1.2.3.4.5.6.7.8.9 1 Load (Percentage) (d) Bursty Arrival (average burst length = 1) Average Delay (Cell Slots) 14 12 1 8 6 4 2 5 Port MUCS-scheduled Switch 8 Port MUCS-scheduled Switch 16 Port MUCS-scheduled Switch 32 Port MUCS-scheduled Switch Average Delay (Cell Slots) 14 12 1 8 6 4 2 5 Port MUCS-scheduled Switch 8 Port MUCS-scheduled Switch 16 Port MUCS-scheduled Switch 32 Port MUCS-scheduled Switch.1.2.3.4.5.6.7.8.9 Load (Percentage).1.2.3.4.5.6.7.8.9 Load (Percentage) Figure 3: Simulation Results for Unlimited Buers plots are nearly overlapped. The heuristic strategy provides a near-optimal solution. 5.2 Scheduling IQ-ATM switches Switches of various sizes (N = 5; 8; 16; 32) were simulated to determine the throughput of IQ-ATM switches scheduled by MUCS. Increasing loads of trac with uniform random arrivals, Poisson arrivals, and bursty arrival patterns were imposed on the switch. The simulation results are summarized in Figure 3. For uniform random arrivals and Poisson arrivals (Figure 3(a)(b)), the normalized throughput can approach 1%. For bursty trac arrivals, the switches can be loaded to near maximum cell arrival rate for dierent burst lengths. As indicated in Figure 3(c)(d), for trac with an average burst length of L=5 or L=1, the MUCS can handle 83.3% or 9.9% link utilization with a mean queue length of less than 4 or 14 cells respectively. Note that the delaythroughput performance does not degrade as the number of switch ports increases, which demonstrates the scalability of MUCS. 7

6 Conclusion MUCS provides a highly-eective mechanism for scheduling input-buered ATM switches. It provides near-optimal throughput and can be implemented with simple hardware. Its performance has been veried through simulation and its hardware design has been veried with HSPICE. References [1] J. W. Lockwood, H. Duan, J. J. Morikuni, S. M. Kang, S. Akkineni, and R. H. Campbell, \Scalable optoelectronic ATM networks: The ipoint fully functional testbed," IEEE Journal of Lightwave Technology, vol. 13, pp. 193{113, June 1995. [2] B. Prabhakar, N. McKeown, and R. Ahuja, \Multicast scheduling for input-queued switches," to appear on IEEE Journal on Selected Areas in Communications, 1997. [3] N. McKeown, V. Anantharam, and J. Walrand, \Achieving 1% throughput in an inputqueued switch," in INFOCOM, Mar. 1996, pp. 296{32. [4] T. Anderson, S. Owicki, J. Saxe, and C. Thacker, \High-speed switch scheduling for localarea networks," ACM Transactions on Computer Systems, vol. 11, pp. 319{352, Nov. 1993. [5] N. McKeown, P. Varaiya, and J. Walrand, \Scheduling cells in an input-queued switch," Electronics Letters, vol. 29, pp. 2174{2175, Dec. 1993. [6] N. Mckeown, M. Izzard, A. Mekkittikul, B. Ellersick, and M. Horowitz, \The Tiny Tera: A packet switch core," IEEE Micro Magazine, vol. 17, pp. 27{4, Feb. 1997. [7] M. K. Mehmet-Ali, M. Yousse, and H. T. Nguyen, \The performance analysis and implementation of an input access scheme in a high-speed packet switch," IEEE Transactions on Communications, vol. 42, pp. 3189{3199, Dec. 1994. [8] H. Duan, \Design and development of cell queuing, processing, and scheduling modules for the ipoint input-buered ATM testbed." Ph. D. Dissertation, University of Illinois at Urbana{Champaign, 1997. [9] H. Duan, J. W. Lockwood, S. M. Kang, and J. D. Will, \A high-performance OC-12/OC- 48 queue design prototype for input-buered ATM switches," in INFOCOM, Apr. 1997, pp. 2{28. [1] H. Duan, J. W. Lockwood, and S. M. Kang, \A 3-dimensional queue (3DQ) for practical ultra-broadband input-buered ATM switches," submitted to IEEE Transactions on VLSI Systems, 1997. [11] H. Duan, J. W. Lockwood, and S. M. Kang, \Scalable broad band input-queued ATM switch including weight driven cell scheduler." Pending US Patent Application 121.61488, Oct. 1996. 8