A FAST ARBITRATION SCHEME FOR TERABIT PACKET SWITCHES

Size: px

Start display at page:

Download "A FAST ARBITRATION SCHEME FOR TERABIT PACKET SWITCHES"

Curtis May
5 years ago
Views:

1 A FAST ARBTRATON SCHEME FOR TERABT PACKET SWTCHES H. Jonathan Chao, Cheuk H. Lam, and Xiaolei Guo Department of Electrical Engineering, Polytechnic University, NY USA Abstract nput-output queued switches have been widely considered as the most feasible solution for large capacity packet switches and P routers. The challenge is to develop a high speed and cost-effective arbitration scheme to maximize the switch throughput and delay performance for supporting multimedia services with various quality-of-service (QoS) requirements. n this paper, we propose a ping-pong arbitration (PPA) scheme for output contention resolution in input-output queued switches. The basic idea is to divide the inputs into groups and apply arbitration recursively. Our recursive arbiter is hierarchically structured, consisting of multiple small-size arbiters at each layer. The arbitration time of an n-input switch is proportional to log4[;] when we group every two inputs or every two input groups at each layer. We present a 256 x 256 terabit crossbar multicast packet switch using the PPA. The design shows that our scheme can reduce the arbitration time of the 256 x 256 switch to 11 gates delay, demonstrating the arbitration is no longer the bottleneck limiting the switch capacity. 1 ntroduction Packet switching has been recognized as the key multiplexing and data transfer technique for future Broadband ntegrated Services Digital Networks (B-SDN) [l] and a basic means of high speed implementation of gigabit/terabit P routers [2, 31 for Next Generation nternet. To provide multimedia services with various quality-of-service (QoS) requirements, one of the challenging issues in designing a terabit switch is to develop a high speed and cost-effective output contention resolution scheme to maximize the switch throughput and delay performance. A packet switch consists of input and output ports interconnected by a switch fabric. The switch fabric can use shared-medium (e.g., bus), shared-memory, and space-division (e.g.l crossbar) architecture [l]. The 'Asynchronous Transfer Mode (ATM) switching is a special case of packet switching with equally sized packets (53-byte) called cells P. function of a packet switch is to transfer packets (actually cells ') from the input ports to the appropriate output ports based on the addresses contained within the packet headers. Since multiple packets from different input ports could be destined for the same output port at the same time, we call it output contention or output conflict, a switch arbitration or scheduling algorithm is needed to choose among them the one that the output mostly prefers at that time slot, grant the corresponding input, and configure the switch fabric to transfer the packet. Reducing the arbitration time can significantly reduce the packet delay across a switch, thus enabling high speed implementation. This paper will address itself to this issue. Consider a packet switch with n input/output ports. The switch can be classified as output queued, input queued or input-output queued switch. Define the speedup (c) of the switch fabric as the ratio of the switch fabric bandwidth and the bandwidth of the input links. (Unless otherwise stated, we assume henceforth every input/output link has the same capacity.) An output queued switch is the one with c 2 n. Since each output port can receive n incoming packets in a time slot, there is no output contention. The switch has desirably zero input queuing delay without considering storeand-forward implementation. However, the well-known problem to an output queued switch is the output port memory speed limiting it from buffering all possible input packets. An input queued switch has no speedup (i.e., c = 1) and thus is much easier to implement. However, it suffers the well-known problem of head-of-line (HOL) blocking [4], which could limit its maximum throughput to about 58% when it uses first-in-first-out (FFO) at each input port and operates under uniform traffic (i.e., the output address of each packet is independently and equally distributed among every output). Many techniques have been suggested to reduce the HOL blocking, for example, by considering the first i' cells in the FFO, where i' > 1 [5]. The HOL blocking can be eliminated entirely by using virtual output queuing (VOQ) 'n practice, the variable length packets are usually broken into fixed sired cells (not necessarily 53 bytes) before being transmitted across the switch fabric; the cells are reassembled at the output of the switch [7] /99/$ EEE Global Telecommunications Conference - Globecom'99

2 [6], where each input maintains a separate queue for each output. To achieve 100% throughput in an input-queued switch with VOQs, sophisticated arbitration is required to schedule packets between various inputs and outputs. t is simply an application of bipartite graph matching [7] - each ouput must be paired with at most one input that has a cell destined for that output; a complex procedure to implement in hardware. t has been shown that an input buffered switch with VOQs can provide asymptotic 100% throughput using a maximum matching algorithm [9]. However, the complexity of the best known maximum matching algorithm is O(n2.5) [ll], which is too high for high speed implementation. n practice, a number of maximal matching algorithms have been proposed, such as parallel iterative matching (PM) [7], iterative round robin matching (SLP) [8], and dual round robin matching (DRRM) [lo]. Their complexities are still much high. An input-output queued switch uses a speedup of c > 1. Recent study [12] shows that it is possible to achieve 100% switch throughput with a moderate speedup of 2. Since each output port can receive up to c cells in a time slot (each input port can send up to c cells during the same time), the requirement on the number of input-output matching found in each arbitration cycle (c cycles in a time slot) may possibly be relaxed, enabling simpler arbitration schemes. On the other hand, the arbitration time is reduced c times, making the time constraint more stringent. This motivates us to develop a ping-pong arbitration (PPA) scheme for output contention resolution in terabit packet switching. The basic idea is to divide the inputs into groups and apply arbitration recursively. The traditional arbiters handle all inputs together and the arbitration time is proportional to the number of inputs. As a result, the switch size or capacity is limited given a fixed amount of arbitration time. Our recursive arbiter is hierarchically structured, consisting of multiple small-size arbiters at each layer. The arbitration time of an n-input switch is proportional to log4 when we group every two inputs or every two input groups at each layer. We present a 256 x 256 terabit crossbar multicast packet switch using the PPA. The design shows that our scheme can reduce the arbitration time of the 256 x 256 switch to 11 gates delay, demonstrating the arbitration is no longer the 'A maximum match is one that pairs the maximum number of inputs and outputs together; there is no other pairing that matches more inputs and outputs [7]. 4A maximal match is one for which pairings cannot be trivially added; each node (i.e., input or output) is either matched or has no edge (i.e., connection path) to an unmatched node [q. bottleneck limiting the switch capacity. The rest of this paper is organized as follows. Section 2 introduces the PPA and its performance study. Section 3 describes the implementation of the PPA. Section 4 shows a 256 x 256 terabit crossbar multicast packet switch using the PPA. Section 5 presents the conclusion, 2 Ping-Pong Arbitration (PPA) 2.1 Principles of Ping-Pong Arbitration Consider an n-input packet switch. To resolve its output contention, a solution is to use an arbiter for each output to fairly select one among those incoming packets and send back a grant signal to the corresponding input. The arbitration procedure is as follows: 1. During every arbitration cycle, each input submits a onebit request signal to each output (arbiter), indicating whether its packet, if any, is destined for the output. 2. Each output arbiter collects n request signals, among which one input with active request is granted according to some priority order. 3. A grant signal is sent back to acknowledge the input. The paper focuses on the second step which arbitrates one input among n possible ones. A simple round robin scheme is generally adopted in an arbiter to ensure a fair arbitration among the inputs, such as SLP [S] and DRRM [lo]. magine there is a token circulating among the inputs in a certain ordering. The input that is granted by the arbiter is said to grasp the token, which represents the grant signal. The arbiter is responsible for moving the token among the inputs that have request signals. The traditional arbiters handle all inputs together and the arbitration time is proportional to the number of inputs. As a result, the switch size or capacity is limited given a fixed amount of arbitration time. Here we suggest to divide the inputs into groups. Each group has its own arbiter. The request information of each group is summarized as a group request signal. Further grouping can be applied recursively to all the group request signals at the current layer, forming a tree structure, as illustrated in Figure 1. Thus, an arbiter with n inputs can be constructed using multiple small-size arbiters (AR) at each layer. Different group sizes can be used. Global Telecommunications Conference - Globecom'

3 P U P P d w br* 4, 1. layer 4 without any interference from its external grant signal so that one gate delay is saved. The external grant signal is used only for governing the flag signal update. At each leaf AR2, the local grant signals have to com- i w hyer, local logical operations to be finished while waiting for the grant signals from upper layers, which minimizes the total arbitration time. Assume n = 2k. Figure 1 depicts a k-layer complete ' ' binary tree with a group size of two when k = 4. AR2 represents a 2-input AR. An AR2 contains an internally feedback signal that indicates which input is favored. Once an input is granted in an arbitration cycle, the other input will be favored in the next cycle. n other words, the granted request is always chosen between left (input) and right alternately. That is why we call it ping-pong arbitration (PPA). This mechanism is maintained by producing an output flag signal feedbacked to the input; a register is required to forward this signal at the beginning of each arbitration cycle. The first layer consists of 2k-' arbiters we call leafar2s. The next k - 2 layers consist of arbiters called iniermediaie AR2s, 2"' of which are at layer i. Finally, the last layer consists of only an arbiter called root AR2. Every AR2 has two request signals. An input request signal at layer i is the group request signal of 2i-1 inputs and can be produced by OR gates either directly or recursively. The grant signal from an AR2 has to be feedbacked to all the lower-layer AR2s related to the corresponding input. Therefore, in addition to the feedback flag signal, an AR2 adds an external grant signal that ANDes all grant signals at upper layers, indicating the arbitration results of upper layers. One important usage of the external grant signal is to govern the local flag signal update. f the external grant signal is invalid, which indicates that these two input requests as a whole are not granted at some upper layer(s), then the flag should be kept unchanged in order to preserve the original preference. The root AR2 needs no external grant signal. At each intermediate AR2, the local grant signals are sent out when n = 4 for instance, which is still round-robin, if each input always has packet to send and there is no conflict between all the input request signals. Below we show its performance by simulations. 2.2 Performance ssues We simulate a 32 x 32 switch under uniform traffic (the output address of each cell is equally distributed among all outputs), or bursty traffic with burst length of 10 cells. The bursty traffic can be used as a packet traffic model with each burst representing a packet of multiple cells destined for the same output. The output address of each packet (burst) is also equally distributed among all outputs. The PPA is also used for request selection among VOQs at each input ports. We compare the PPA with FFO+RR (FFO for input queuing and RR for round-robin arbitration), Output Queuing, is- LP, and DRRM. The FFO+RR and Output Queuing serve as benchmark (lower and upper bound in terms of performance). n the islp [8], each VOQ in the input buffer can send a request to an output arbiter. n other words, each input can send up to n requests to n arbiters, one for each. After the grant arbitration, an input may receive multiple grants, and another round of arbitration is needed to guarantee that at most one cell is selected in each input port. A cycle of SLP arbitration consists of five steps: (1) input ports send multiple requests to the output arbiters; (2) the output arbiters perform the grant arbitration; (3) the output arbiters send grants to input arbiters; (4) the input arbiters perform another arbitration for solving the problem of possible multiple grants; and (5) the input arbiters send accept signals to 1238 Global Telecommunications Conference - Globecom'99

4 10 ::T 3' 1 e- -cw~lo"ranp ELiP. ' ' ' /: -i output arbiters. n the DRRM [lo], an input arbiter at each input selects a non-empty VOQ according to the round-robin service discipline. After the selection, each input port sends one request, if any, to an output arbiter. An output arbiter at each output receives up to n requests and chooses one of them based on the round-robin service discipline and sends a grant to the winner input port. The DRRM has four steps in a cycle. They are: (1) each input arbiter performs request selection; (2) the input arbiters send requests to the output arbiters; (3) each output arbiter performs grant arbitration; and (4) the output arbiters send grant signals to input arbiters. Figure 2 shows the throughput and total average delay of the switch under various arbitration schemes, where a speedup of 1 or 2 is used. The PPA performs better than the FFO+RR but worse than the SLP and the DRRM when the speedup is 1, however, they all perform comparably when a speedup of 2 is used. Recall that PM [7] needs a random number generator for its decision process, which is difficult and expensive to implement at high speed. Both SLP and DRRM need to maintain a round-robin service list, which is also expensive to implement. The PPA, however, is simpler for high speed implementation. Since all arbitrations are done in parallel, the overall arbitration time of an n-input switch is proportional to log4 when we group every two inputs or every two input groups at each layer, as will be described in the following section. 3 mplementation of the PPA ' ThOwnlJ 200 le0 - le0 ~ f40: 4 '20 a : RW ~ FFO+RR + PPA (c) c = 2 under uniform traffic x DRRM 0 SLP - OVtM OUWlnp (d) c = 2 under bursty traffic Fig. 2. Comparison of the PPA with FFO+RR, Output Queuing, SLP and DRRM: switch throughput and total average delay Multiple small arbiters can be recursively grouped together to form a large and multi-layer arbiter, as illustrated in Figure 1. Figure 3 depicts an n-input arbiter constructed by using p q-input arbiters (AR-q), from which the group request/grant signals are incorporated into a pinput arbiter (AR-p). Below we demonstrate constructing a 256-input arbiter starting from the basic units: 2-input arbiters input Arbiter (AR2) Figure 4 shows a basic 2-input arbiter (AR2) and its logical circuits. The AR2 contains an internally feedbacked flag signal, denoted by Fi, that indicates which input is favored. When all G, inputs are 1, indicating these two inputs requests (Ro and R) as a whole are swhen the flag is LOW, Ro is favored; when the flag is HGH, R is favored. Global Telecommunications Conference - Globecom'99 f-d>* *. - b u 1239

5 ports Fig. 3. Hierarchy of recursive arbitration with n = pq inputs granted by all the upper layers, once an input is granted in an arbitration cycle, the other input will be favored in the next cycle, as shown by the true table in Figure 4(a). This mechanism is maintained by producing an output flag signal, denoted by F,, feedbacked to the input. Between F, and F; there is a D-flip-flop which is functioned as a register forwarding FO to Fi at the beginning of each cell time slot. When at least one of G, inputs is 0, indicating the group request of Ro and R is not granted at some upper layer(s), Go = G = 0, F, = F;, i.e., the flag is kept unchanged in order to preserve the original preference. As shown in Figure 4(b), the local grant signals have to be ANDed with the grant signals from the upper layers to provide full information whether the corresponding input is granted or not. G, inputs are added at the final stage to allow other local logical operations to be finished in order to minimize the total arbitration time input Arbiter (AR4) A 4-input arbiter module (AR4) has four request signals, four output grant signals, one outgoing group request and one incoming group grant signal. Figure 5(a) depicts our design of an AR4 constructed by three AR2s (two leaf AR2s and one intermediate AR2; all have the same circuitry), two 2-input OR gates and one 4-input OR gate. Each leaf AR2 handles a pair of inputs and generates the local grant signals while allowing two external grant signals coming from upper layers: one from the intermediate AR2 inside the AR4 and the other from outside AR4. These two signals directly join the AND gates at the final stage inside each leaf AR2 for minimizing the delay. Denote R,j and G;j as the group request signal and the group grant signal between input i and input j. The intermediate AR2 handles the group requests (Rol and R23) and generates the grant signals Fig. 4. (a) A 2-input arbiter (AR2) and its true table (b) its logical circuits (Go1 and G23) to each leaf AR2 respectively. t contains only one grant signal that is from the upper layer for controlling the flag signal input Arbiter (AR16) As shown in Figure 5(b), an AR16 contains five AR4s in two layers: four at the lower layer handling the local input request signals and one at the higher layer handling the group request signals input Arbiter (AR256) Figure 6 illustrates a 256-input arbiter (AR256) constructed by AR4s and its arbitration delay components. The path numbered from 1 to 11 shows the delay from when an input sending its request signal till it receiving the grant signal. The first four gates delay (1-4) counts the time for the input s request signal passing though the four layers of AR4s and reaching the root AR2, where one OR-gate delay is needed at each layer to generate the request signal [see Figure 5(a)]. The next three gates delay (5-7) counts the time that the root AR2 performs its arbitration [see Figure 4(b)]. The last four gates delay (8-11) counts the time for the grant signals at upper layers passing down to the corresponding input. The total arbitration time of an AR256 is then 1240 Global Telecommu.nicotions Conference - Globecom 99

6 RO gates delay. t thus follows that the arbitration time (Tn) of an n-input arbiter using such implementation is n Tn = 2 log, + 3. (1) R ' HGH (=) AR4 R - R3 - Jm4 0 mo1ar2 intermediate AR2 0 leafar gram signal - request : OR-gate delay generating q ual signal a1 each : %gale delay of -@ : he llst AND-gale delay in each AR2 Fig. 6. Decomposition of arbitration delay in an AR256 4 A Terabit Crossbar Packet Switch Using PPA n this section, we present a terabit crossbar packet switch by using the PPA. Our design adopts the pipelining technique, separating the arbitration circuits from the data routing circuits, to enable the next-round arbitration to be performed in parallel with the current round of data transmission. Fig. 5. (a) A 4-input arbiter (AR4) and (b) a 16-input arbiter (AR16) constructed by five AR4s 4.1 Crosspoint Unit A crosspoint, the basic unit in a crossbar switch, corresponds to an input and output pair. As shown in Figure 7, it conceptually consists of two parts : a data crosspoint (DXP) and a multicast request crosspoint (MXP). The output of a DXP is controlled by the grant signal. t is LOW by default and the crosspoint is in CROSS Global Telecommunications Conference - Globecom'

7 state that the vertical data will get through. f the grant signal turns HGH, then the crosspoint is toggled and the horizontal data will get through. communications between an input port controller (PC).. and an SW16 chip are through the following 6 lines: 0 4line data broadcasting from the input port to all crosspoints on the same row; line Multicast Pattern (MP) with the NMP bit at the head of the MP indicating whether it is a new MP; 1-line acknowledgement (ACK) signal with 2 bits from the chip to the input. Fig. 7. The conceptual depiction of a crosspoint unit (Xunit) Note that the Dh is broadcast to all DXPs, while the MP signal is cascaded between MXPs to facilitate shifting in the MP. When a new MP is shifted into the switch chip, the MP bit is stored in each corresponding MXP. n addition, we have a bit at the head of MP signal, denoted by NMP (New MP), to indicate whether the MP signal is a new MP and thus to decide whether the MP should be accepted at the switch chip. After each arbitration, we update the request signal (i.e. the MP bit) for the next round. Depending on the NMP bit signal arriving at the beginning of the next arbitration cycle, we decide to use the new MP or the old updated one. 4.2 SW16 - a 16 x 16 crossbar switching chip with AR16s Figure 8 shows a chip layout for a 16 x 16 switch. The. m U.... U Do... DD Fig. 8. Layout of the SW16 chip (data bus in bold) Da W The number of incoming and outgoing signal pins in the chip is 6 x x 16 = 192. The two-bit ACK signal, (AC-1, AClSo), is generated by the handshaking circuits (HSC) in the SW16 chip. The signal specifications are as follows: (ACK, ACKo) PC Action Description 00 do nothing don t send cell nor MP 01 load cell winning the contention 10 load MP all MP bits are zero 11 load both The first bit (AC-1) is used for transmitting MP and the second (ACKo) for transmitting cell. When building a large-scale switch, multiple SW16 chips are interconnected in a two-dimension array. Each PC will receive multiple ACK signals, one from each SW16. The final decision of whether the HOL cell or the MP of the cell next to the HOL should be transmitted to the switch can be easily made by ORing ACKo s or by ANDing ACS s from the SW16 chips on the same row. 4.3 A 256 x 256 switch with 1 Tb/s capacity Consider to build a 256 x 256 terabit multicast switch by using a speedup of two and bit slicing technique. The chosen cell size is 64 bytes when calculating the time budget for arbitration. Each 64-byte cell is sliced into 4 16-byte parts, parallelly handled by using 4 switching planes. n each plane, 256 SW16 chips are arranged in a two-dimension array, as shown in Figure 9. The input capacity per port in each plane is reduced to 5Gbls/4 = 1.25Gb/s. With a 4-bit wide bus for data signals, the switch operation rate is 1.25Gbls x 214 % 622Mb/s. The layout of the 256 x 256 switch plane is shown in Figure 9. The switch consists of 16 x 16 = 256 SW16 chips. On top of these chips, we have 256 AR16s for higher-layer arbitrations. They can be grouped into chips and built separately as shown in Figure 9. Or they can be distributed over all SW16 chips in the same column in order to minimize the number of chips. The 1242 Global Telecommunications Conference - Globecom 99

8 sic: Srpl i9mfr. an&......,:::......,..,,. j : : : ijjj *::: : j : : j ; : : 1 :, : capacity input-output queued switches, which aims at maximizing the switch throughput and delay performance for supporting multimedia services with various &OS requirements. The basic idea is to divide the inputs into groups and apply arbitration recursively. Our recursive arbiter is hierarchically structured, consisting of multiple small-size arbiters at each layer. The arbitration time of an n-input switch is proportional to log, 121 when we group every two inputs or every two input groups at each layer. We present a 256 x 256 terabit crossbar multicast packet switch using the PPA. The design shows that our scheme can reduce the arbitration time of the 256 x 256 switch to 11 gates delay, less than 5 ns using the current CMOS technology, demonstrating the arbitration is no longer the bottleneck limiting the switch capacity. Fig. 9. A plane of the 1 Tb/s crossbar structured multicast switch total number of signal pins in each SW16 will then be increased by 16 x 2 = 32 to 224. Data is identically broadcast from an input to all SW16 chips in the same row while the multicast patterns to those SW16 chips are different. We introduce a SC (switch interface circuits) between each PC and a row of SW16 chips to handle the data broadcast while collecting and processing the ACK signals from the SW16 chips. The SCS can be either placed inside the switch plane or incorporated into the PCs. With four data lines, the transmission time for each cell is equal to 16 bytes / (4 bits/clock) = 32 clocks, which is the time budget for the arbitration and its preand post- processing. An arbitration cycle includes (1) shifting the multicast pattern; (2) arbitrating; and (3) feedbacking acknowledgements. n our design, chips are assigned MP directly. t takes just 17 bit clocks (including the NMP) for the MP shifting into a chip. The arbitration time using the PPA is only 11 gates delay (see Figure 6) for the 256 x 256 switch, less than 5 ns using the current CMOS technology. The circuitry for generating acknowledgements is very simple. The total arbitration and feedback delay is about a few clocks. Therefore, it takes about 22 clocks for one arbitration cycle, less than 32 clocks required for transmitting a cell. 5 Conclusions n this paper, we propose a fast ping-pong arbitration (PPA) scheme for output contention resolution in large Acknowledgement We would like to thank Dr. Jin-Soo Park for providing the simulation results. References F.A. Tobagi, Fast Packet Switch Architectures for Broadband nte rated Services Digital Networks, Proceedings of the EEE, 78(13, p , January S. Keshav and R. Sharma, ssues and Trends in Router Design, EEE Communications Magazine, p , May V.P. Kumar, T.V. Lakshman and D. Stiliadis, Beyond Best Effort: Router Architectures for the Differentiated Services of Tomorrow s nternet, EEE Communications Mogazine, p , May M. Karol, M. Hluchyj, and S. Morgan, nput versus output queueing on a space division switch, EEE Trans. Comm., 35(12), pp , M. Karol and M. Hluchyj, Queueing in high-performance packet-switching, EEE J. Select. Area in Comm., Vo1.6, pp , December Y. Tamir and H-C. Chi, High performance multi-queue buffers for VLS communication switches, Proc. oj 15th Ann. Symp. on Comp. Arch., p , June T. Anderson, S. Owicki, J. Saxe, and C. Thacker, High speed switch scheduling for local area networks, ACM Trans. Computer Systems, pp , November N. McKeown, P. Varaiya, and J. Walrand, Scheduling cells in an input-queued switch, EE Electronics Letters, 29(25), pp , December N. McKeown, V. Anantharam, and J. Walrand Achievin 100% Throughput in an nput-queued Switch, Pric. EEE d FOCOM, pp , H. Jonathan Chao and J. S. Park, Centralized contention re? olution schemes for a large-capacity optical ATM switch, in Proc. EEE ATM Workshop, Ebirfax, VA, May R. E. Tarjan, Data Structure8 and Network Algorithms, Bell Labs, R. Guerin and K.N. Sivarajan, Delay and Throu hput Performance of Speed-up nput-queuing Packet Switc\es. BM Research Report RC 20892, June Global Telecommunications Conference - Globecom

Scalable Schedulers for High-Performance Switches

Scalable Schedulers for High-Performance Switches Chuanjun Li and S Q Zheng Mei Yang Department of Computer Science Department of Computer Science University of Texas at Dallas Columbus State University