A Novel Feedback-based Two-stage Switch Architecture

Size: px

Start display at page:

Download "A Novel Feedback-based Two-stage Switch Architecture"

Louisa Waters
5 years ago
Views:

1 A Novel Feedback-based Two-stage Switch Architecture Kwan L. Yeung and N. H. Liu Dept. of Electrical and Electronic Engineering The University of Hong Kong Pokfulam, Hong Kong Abstract A load-balanced two-stage switch can eliminate the scheduler, is scalable, and provides close to 100% throughput. Its major problem is that packets can be mis-sequenced. Aim at preventing packets from being received out-of-sequence at outputs, we provide an elegant solution to the mis-sequencing problem based on a novel two-stage switch architecture with feedback. The feedback path is constructed by judicially selecting and coordinating the two sequences of N deterministic configurations used in the two stages of switch, such that if middle-stage port j is connected to output k in slot t, then input k is connected to middle-stage j in slot t+1. With a single packet buffer at each middle-stage VOQ, we show that the packet missequencing problem is naturally solved. With this feedback architecture, middle-stage port j piggybacks an N-bit occupancy vector (1-bit for each VOQ) on the packet sent to output k in each time slot. As output k and input k reside on the same line-card, input k selects a packet for sending in the next time slot based on the occupancy vector received a procedure known as port-based scheduling. Four simple port-based scheduling algorithms are proposed for load balancing in the first-stage switch. We show that they provide an unbeatable overall delay-throughput performance under various traffic conditions. I. INTRODUCTION With the continuous growth of bandwidth in fiber links, the need for building high speed switches/routers is urgent in order to keep pace with the increased transmission rate. Recently, a novel two-stage switch architecture is proposed [7] based on the concept of load balancing. It consists of two switch fabrics in tandem (Fig. 1) and each fabric is configured following a deterministic and periodic sequence of N configurations. The only requirement is that each input is connected to each output exactly once in the sequence. There are many ways to generate such a sequence. As an example, a sequence can be constructed by cyclic shifting the set of input/output connections used in each time slot, such that at time slot t, input i (for i = 0,1,2,,N-1) is connected to output j, where j is given by j = ( i + t ) mod N. (1) Each value of t corresponds to a configuration. Varying t from 0 to N-1, this gives a sequence of N required configurations. Since the sequence is pre-determined, a scheduler [1][2] for finding the best configuration on a slot-by-slot basis is not required. This eliminates the computation overheads as well as the communication overheads [3][4][5][6]. In a two-stage switch, the first switch fabric is responsible for load balancing and the second fabric is for delivering packets based on their destination ports. The goal of load balancing is to make the traffic seen by the second switch fabric as evenly distributed as possible. In the original two-stage switch design [7], a packet is immediately sent to the currently connected output of the first switch fabric, called middle-stage outputs/ports, as soon as it arrives. So no input port buffers/queues are required and load balancing relies purely on the connection patterns obtained from the deterministic sequence of configurations. In [7], it is proved that such a basic load balancing scheme can already guarantee 100% throughput for a broad class of incoming traffic. But the major drawback of this two-stage approach is that packets can be mis-sequenced when they arrive at output ports (of the second switch fabric). This is because packets of the same flow (i.e. packets from the same input port to the same output port) will be distributed to different middle-stage ports and thus will experience different amounts of delays. A simple approach to mis-sequencing is to re-order the packets at the output ports using re-sequencing buffers. (Resequencing buffers are not shown in Fig. 1.) With the original two-stage switch architecture [7], packets can be missequenced by an arbitrary amount, thus a finite re-sequencing buffer is not possible. Efforts are made in [8][9] to bound the delay at additional costs: N writes to memory in one time slot in [8], and a very complicated re-sequencing buffer design (such as 3-dimensional queues) in [9]. A proactive approach is to prevent packets from becoming mis-sequencing. A major advantage is that no re-sequencing buffers/delays are needed/incurred. Full Frames First (FFF) algorithm [10] is the first one thus designed. But it requires heavy state information exchange among the switch line-cards. In [11], Full Ordered Frames First (FOFF) is proposed to replace FFF. But FOFF compromises the original goal of preventing packet mis-sequencing. In FOFF, re-sequencing buffers, though finite, are added to each output port for reordering packets. Another good attempt to prevent mis-sequencing is the Mailbox switch [12] in its basic form. By using a set of symmetric configurations in both stages of switch fabrics, a feedback path for reporting middle-stage packet departure time is created. Based on it, the next packet in the flow will be dispatched and inserted in a middle-stage VOQ (virtual output queue) such that it will depart no earlier than the previous packet of the same flow. Compared with FFF algorithm, Mailbox switch trades the switch throughput ( 75%) for simplicity. However, like FFF, Mailbox switch is extended to

2 allow mis-sequencing for the sake of higher throughput ( 95%). Some further extension is made in [13] for studying the amounts of buffers to be placed at inputs, outputs, and middlestage ports. Despite of the compromises made by earlier attempts [10][12], we believe preventing packet mis-sequencing is the way to go. In this paper, we propose an elegant solution based on a novel feedback mechanism, enabled by properly constructing and coordinating the two sequences of N configurations used in the two stages of switch fabrics. Specifically, in each time slot the two configurations used at the two switch fabrics must satisfy the condition that: if middlestage port j is connected to output k in slot t, then input k is connected to middle-stage j in slot t+1. Together with a single packet buffer at each middle-stage VOQ, we show that the packet mis-sequencing problem is gone. With the proposed feedback architecture, middle-stage port j piggybacks an N-bit occupancy vector (1-bit for each VOQ) on the packet sent to output k in each time slot. By exploiting the fact that each pair of input k and output k reside on the same switch line-card, the occupancy vector arrived at output k is available to input k at negligible cost. Therefore, input k can select a packet for sending in the next time slot based on the occupancy vector this is called portbased scheduling. Four simple port-based scheduling algorithms are proposed for load balancing in the first-stage switch fabric. The idea is to forward just enough packets to the middle-stage ports, such that neither overflow nor underflow will ever occur. In the next section, the importance of having an efficient feedback mechanism in load balancing is discussed. In Section III, our feedback-based two-stage switch architecture is presented. Four simple port-based scheduling algorithms are designed in Section IV. Then the delay-throughput performance based on the proposed two-stage switch architecture is studied in Section V. We conclude the paper in Section VI. Fig. 1: A load-balanced two-stage switch. II. LOAD BALANCING & FEEDBACK Load balancing is of paramount importance in two-stage switch design. Lacking an efficient feedback mechanism leads to poor load balancing performance, and in turn hurts the overall switch performance. Switch performance is measured by both average packet delay and throughput. Existing work focuses more on improving throughput. Sometimes delay is (unconsciously) traded for higher throughput. We elaborate on the above issues in this section based on the two-stage switch architecture shown in Fig. 1. It is equipped with VOQs at both input and middle-stage ports, denoted by VOQ 1 and VOQ 2, respectively. We use VOQ 1 (i,k) to represent the VOQ at input port i with packets destined for output k. Similarly, VOQ 2 (j,k) is used to denote the VOQ at middle-stage port j with packets destined for output k. We define flow f ik as the packets arriving at input i and destined for output k. Packets from f ik are stored in VOQ 1 (i,k). Packets destined for output k (which may come from different input ports/flows) are placed in VOQ 2 (j,k) for j = 0, 1,, N-1. The well-received goal of load balancing is to make the traffic seen by the second switch fabric as evenly distributed as possible. This is usually interpreted as making the queue sizes of all VOQ 2 (j,k) s as equal as possible [9][10]. Due to the lack of feedback, each input port keeps track its own sending history and based on that, load balancing schemes in [9][10] try to keep the difference in the cumulative number of packets that sent to each middle-stage port for a given flow by at most one. The problem with such a per-flow-based load balancing is that each VOQ 2 (j,k) is shared by all flows destined to output k. Since there is no coordination among different flows/inputs, the queue sizes of the N VOQ 2 (j,k) s, for j=0,1,,n-1, can be differed by as large as N packets 1, leaving alone the possible discrepancy among all the N 2 VOQ 2 (j,k) s, for j,k=0,1,,n-1. If we have a feedback mechanism that allows us to know the size of each VOQ 2 (j,k) before sending a packet to it, we can surely do a much better job. Here we branch out to take a different view on the goal of load balancing. We believe that load balancing for two-stage switch is better interpreted as avoiding both overflow and underflow at every VOQ 2 (j,k). Overflow is definitely a bad thing as it causes packet dropping/loss and retransmission. Underflow means that if there are packets waiting in some input ports for a particular output k, then VOQ 2 (j,k) should not be empty at the time that middle-stage port j (thus VOQ 2 (j,k)) is connected to output k. (Note that we know in advance when a middle-stage port will be connected to a particular output because the sequence of N configurations are pre-determined.) Preventing underflow ensures the two-stage switch is workconserving. This in turn helps to enable 100% throughput in the second-stage switch fabric. Due to the inability of accurately balancing queue sizes, the per-flow-based load balancing [9][10] suffers from the underflow problem, which intensifies if the incoming traffic is skewed. To ease the situation, inputs are encouraged to move more packets to the middle-stage ports (accordingly the buffer size of each VOQ 2 (j,k) s must be increased), in the hope to boost up the overall queue occupancy so as to avoid the underflow problem at a few queues. This can enhance the throughput performance, as can be seen from [12][13]. Unfortunately, the delay performance has been sacrificed. Delay performance can be adversely affected in two ways. 1 Assume each input sends a single packet to output k via the same middlestage port j. Then VOQ 2 (j,k) contains N packets while others, VOQ 2 (j,k) for j j, have 0-occupancy. 2

3 First, with the deterministic sequence of N configurations, each additional packet in an VOQ 2 (j,k) means this packet will experience an additional delay of N slots. Second, the longer the packets stay in the middle-stage ports, the more severe the mis-sequencing problem is. Consequently, a larger resequencing buffer/delay is required at output ports. By taking the delay performance into account, we would like to stress that the amount of buffers at each VOQ 2 (j,k) is to meet the load balancing goal (of no overflow and no underflow), not to increase the throughput (as 100% throughput is already guaranteed if the load is balanced). In other words, if underflow problem is solved, (extra) packets should be stored at input ports rather than at middle-stage. Note that similar concept is adopted in designing Internet active queue management schemes, where buffers at a router are for absorbing bursts, not for increasing throughput. This brings in another issue: how much buffer is enough? The answer to this question depends on how the goal of no overflow and no underflow can be accomplished. This, in turn, relies on if we have a feedback mechanism in place for letting each input port know the current queue status/sizes at its connected middle-stage port. Last but not the least, this feedback mechanism must be simple enough, or it easily defeats the original purpose of designing two-stage switches. In the next section, an efficient feedback mechanism is designed. With it, we show that a single packet buffer at each VOQ 2 (j,k) is sufficient to prevent both underflow and overflow problems. A side advantage is that an N-bit occupancy vector is enough to report the status of the N VOQs at each middle-stage port. Some other benefits/insights can be seen from the following scenario. Assume we have a perfect knowledge on middle-stage queue occupancy, and the sequence of configurations used in the firststage switch fabric is obtained from (1). In time slot t, if packet A from flow f ik is delivered to join VOQ 2 (j,k) at the second position in the queue, packet A has to wait for N additional slots after the head-of-line (HOL) packet is delivered to output k. (On average, packets queued at the next-to-the-hol position experiences a middle-stage delay of 3N/2 slots for uniform traffic.) Instead of dispatching packet A to VOQ 2 (j 1,k), we can keep it in VOQ 1 (i,k) for possible delivery to VOQ 2 (j+1,k) in slot t+1. If packet A joins VOQ 2 (j+1,k) at the HOL position, it can be delivered to output k in less than N slots (on average N/2 slots), which is much shorter than joining VOQ 2 (j,k). Similarly, if joining VOQ 2 (j+1,k) at the HOL position fails, we can try VOQ 2 (j+2,k) in slot t+2. Repeat this process until packet A can be delivered. Following this approach of only sending a packet to join at the HOL position of each VOQ 2 (j,k), each VOQ 2 (j,k) only needs a single packet buffer. At the same time, there are N chances for a packet (A) to experience less than 3N/2 slots in middle-stage ports. In order not to underutilize the first-stage switch fabric, when packet A is not sent in slot t (due to either the HOL position is taken, or packet A is not being selected for sending), another packet (say, destined for k ) can be sent from input i to join at the HOL position of VOQ 2 (j,k ). Since each input has up to N packets/queues to choose from, the chance for being able to send a packet is very high. (Otherwise, this may hurt the throughput.) We call the algorithm for selecting a packet to send (based on the middle-stage port occupancy) as port-based scheduling algorithm. Port-based scheduling is much simpler than a general scheduling algorithm [1,2], because it is decentralized and no determination of switch configurations is required. III. FEEDBACK PATH DESIGN Our proposed feedback mechanism is enabled by properly constructing and coordinating the two deterministic sequences of N configurations used in the two stages of switch fabrics, such that if middle-stage port j is connected to output port k at time t, then at time (t+1) input port k is connected to the same middle-stage port j. We call it staggered symmetry property (Property 1). Then by exploiting the (implementation) fact that each pair of input k and output k reside on the same switch linecard, the VOQ status at middle-stage port j can be piggybacked onto the packet sent to output k, which is then made available to input k at negligible cost. Fig. 2: A joint sequence for 4 x 4 two-stage switch. A. Two Sequences of N Configurations For an NxN switch, there are N! configurations. They can be divided into (N-1)! sets of N configurations, such that each set meets the requirement that each input connects to each output exactly once. Eqn. (1) provides a way to find such a set. Let π t denote the configuration found from (1) in slot t. Then [π 0, π 1, π 2,, π N-1 ] denotes the resulting sequence of N configurations. Without loss of generality, we adopt this sequence in our firststage switch shown in Fig. 1. The same set of N configurations is used in the second-stage switch, but in a different order/sequence for providing the necessary feedback path (i.e. Property 1). In this case, a sequence in the reverse order of that in the first stage, or [π N-1, π N-2, π N-3,, π 1 ], is used. Specifically, at time t (for 0 t<n), middle-stage port j is connected to output k, where k is given by k = ( j + N 1 t ) mod N (2) If t is inside [xn, (x+1)n), set t=t xn before applying (2). 3

4 Combining the configurations/sequences in both stages, our proposed two-stage switch is configured according to the joint sequence of [π 0 π N-1, π 1 π N-2, π 2 π N-3,, π N-1 π 0 ]. Before we formally prove that this joint sequence has the staggered symmetry 2 property (Property 1), let us first consider the example shown in Fig. 2 for N = 4 and single-packet-bufferper-voq 2. At t = 0, middle-stage port 0 is connected to output 3. A packet is sent from VOQ 2 (0,3) to output 3, together with a piggybacked 4-bit queue occupancy vector. (Also refer to Fig. 3.) This feedback arrives at output 3 at the end of slot t=0. Since output 3 and input 3 are collocated on the same line-card, this feedback is available to input 3 at the beginning of slot t=1. Based on it, input 3 selects a packet for sending to middle-stage port 0 using a port-based scheduling algorithm. Fig. 3 Timing diagram showing the operation of the twostage switch architecture with feedback. B. Properties of the Joint Sequence We prove that a two-stage switch with the proposed joint sequence [π 0 π N-1, π 1 π N-2, π 2 π N-3,, π N-1 π 0 ] and a single packet buffer for each and every VOQ 2 (j,k), it has the following important properties. Property 1 (Staggered Symmetry). If middle-stage port j is connected to output port k at time t, then at time (t+1) input port k is connected to the same middle-stage port j. Proof: At time t, middle-stage port j is connected to output k, where k is given by (2). We need to show that at time t+1, if input i is connected to the same middle-stage port j, then we must have k = i. At time t+1, we have j=(i+t+1) mod N from (1). Substitute j into (2), we get If (i+t+1)<n, then k = ( i+ t+ 1) mod N + N 1 t mod N. k = i+ t+ 1+ N 1 t mod N = ( i+ N)modN =i 2 For every pair of staggered configurations in [π 0 π N-1, π 1 π N-2, π 2 π N-3,, π N-1 π 0 ], namely, π N-1 & π 1, π N-2 & π 2,, π 0 & π 0, the two configurations in the pair are always symmetric to each other with respect to the column of middle-stage ports, as can be observed from Fig. 2. If (i+t+1) N, we have k = i+ t+ 1 N + N 1 t modn = i Combining the above two cases, k = i is true. # Property 2 (Anchor Output). Input i is always connected to output K, where K = [(i+n-1) mod N], via one of the middlestage ports. We call K the anchor output of input i. Proof: At time t, input i is connected to output k via middlestage port j. Substitute j=(i+t+1) mod N from (1) into (2), we can express k in terms of i. k = ( i+ t) mod N + N 1 t mod N = ( i+ N 1) mod N = K We can see that K depends only on i. So for a given input i, it is always connected to the same anchor output K. # In Fig. 2, input 0/1/2/3 has anchor output 3/0/1/2, via one of the middle-stage ports in each time slot. Property 3 (Deterministic Delay in Middle-stage Ports). Let K be the anchor output of input i. For every packet of flow f ik, it experiences the same d slots delay in one of the middlestage ports, where d is given by N, if K = k d = K k, if K > k K + N k, if K < k Proof: Suppose at slot t, input i is connected to its anchor output K via middle-stage port j. If a packet (A) is successfully sent from input i to join VOQ 2 (j,k), then VOQ 2 (j,k) must either be empty at slot t-1, or the packet originally in it at t-1 (B) is sent to output k in slot t. In the latter case, both packets A and B belong to the same flow f ik (that s why they join the same VOQ 2 (j,k)), and with the same anchor output k=k. It takes N slots/configurations for port j to be re-connected to output K for delivering packet A. Because of the single-packet-buffer-per-voq 2, packet A is the only packet in VOQ 2 (j,k) and it will be delivered for sure after N slots. This gives the maximum delay of N slots. Note that a given middle-stage port j is connected to each output in descending order of the output port numbers. This can be seen from Fig. 2, where middle-stage port 0 is connected to outputs 3, 2, 1, and 0 at slots t=0, 1, 2, and 3 respectively. If packet A s target output k is K-1, middle port j will be connected to output K-1 after one slot, which is the minimum delay that a packet (A) can experience in VOQ 2 (j,k). In general, if packet A s target output k is d ports away from the anchor output K (counted in descending order of port numbers), i.e. d =K-k if K>k, and d =K+N-k if K<k, the delay packet A experienced in VOQ 2 (j,k) is d slots. Obviously, the value of d is bounded by [1, N]. # Assume traffic is uniformly distributed such that each input has the same input load and each packet is uniformly distributed to all outputs. The average packet delay at middlestage ports is given by (1+N)N/2 slots. Property 4 (In-order Packet Delivery). If a two-stage switch is configured according to the joint sequence of [π 0 π N-1, π 1 π N-2, π 2 π N-3,, π N-1 π 0 ], and each and every VOQ 2 (j,k) has a 4

5 single packet buffer, in-order packet delivery is guaranteed. Proof: Assume at times t A and t B (where t B >t A ), packets A and B of flow f ik are delivered from VOQ 1 (i,k) and joined VOQ 2 (j 1,k) and VOQ 2 (j 2,k) respectively. Let d A and d B be the delays experienced in VOQ 2 by the two packets. Missequencing occurs only if packet B arrives at output k earlier than packet A, i.e. t A +d A >t B +d B. However, this will never happen because t B >t A and d A =d B from Property 3. # Note that if more than one packet buffers are allocated to each middle-stage VOQ 2 (j,k), in-order packet delivery may not be guaranteed. IV. PORT-BASED SCHEDULING ALGORITHMS Assume input i is connected to middle-stage port j at slot t, and the anchor output of i is K. Based on the N-bit occupancy vector piggybacked from middle-stage j in slot t-1, we find S j, the set of VOQ 2 (j,k) (for k=0,1,,n-1) with 0-occupancy. Input i chooses the HOL packet at VOQ 1 (i,h) for sending only if h S j and VOQ 1 (i,h) is not empty. This prevents overflow at VOQ 2 (j,h). Since middle port j is connected to each output in descending order of the output port numbers, we know that in next slot t+1 port j will be connected to output K-1 (wrapped around by N). If VOQ 2 (j,k-1) is empty and VOQ 1 (i,k-1) is not, we face a possible underflow in VOQ 2 (j,k-1) at slot t+1. As such, a scheduling algorithm should give priority to VOQ 1 (i,k-1) at slot t. Four simple scheduling algorithms are designed with the above considerations in mind. (1) Round-Robin (RR): If VOQ 1 (i,h ) is selected in slot t-1, then VOQ 1 (i,h +1) is selected if h +1 S j. Otherwise, h the first output port with h>h and h S j is selected. RR gives fair access to each VOQ 1, and is suitable for hardware implementation [5]. (2) Optimized Round-Robin (Opt-RR): If VOQ 2 (j,k-1) is empty and VOQ 1 (i,k-1) is not, VOQ 1 (i,k-1) is selected. Otherwise, use RR. Opt-RR is enhanced to avoid underflow. (3) Longest Queue First (LQF): For all VOQ 1 (i,h) s with h S j, the one with the longest queue size is selected. LQF is good for non-uniform traffic. An efficient implementation of (quasi) LQF is available [14]. (4) Optimized Longest Queue First (Opt-LQF): If VOQ 2 (j,k- 1) is empty and VOQ 1 (i,k-1) is not, VOQ 1 (i,k-1) is selected. Otherwise, use LQF. V. PERFORMANCE EVALUATIONS In this section, the delay-throughput performance of our proposed scheduling algorithms is studied by simulations. For comparison, we also simulate a) the LQF algorithm for Byte- Focal Switch (LQF_Byte-focal) [9], for its better than FOFF [11] performance; b) the islip algorithm [5] (with a single iteration), which serves as a benchmark for single-stage inputqueued switches; and c) output-queued switch, which serves as a lower bound. Due to the space limitation, we only present the results for switch size N=32. A. Uniform Traffic At each time slot for each input, a packet arrives with probability p and destines to each output with equal probability. p is called the input load. Fig. 4 shows the delay-throughput performance of all 7 algorithms. Among them, we can see that the four port-based scheduling algorithms we proposed, RR, Opt-RR, LQF, and Qpt-LQF, give comparable and less-than- 20-slot delay performance up to p=0.8. In fact, their average packet delay at middle-stage ports, the minimum cost for twostage switches, can be easily calculated as (1+N)N/2 = 16.5 slots. If we deduct this portion from the overall delay experienced, we can see that the (input port) delay of our algorithms matches very well with output-queued performance. We also notice that the performance gain of using Opt-RR and Opt-LQF is marginal because the possible underflow problem mentioned in Section IV is negligible. Compared with LQF_Byte-Focal, our algorithms give significantly smaller delay. For islip, our algorithms produce a smaller delay when p is large (>0.6). At p=0.7, LQF_Byte-focal requires 120 slots (not shown), islip requires 88 slots, and only 17 slots by our algorithms. Delay (time slots) LQF Opt-LQF RR Opt_RR Output-queued islip LQF_Byte-Focal Delay-throughput with uniform traffic Input load p Fig. 4 Delay vs input load p, with uniform traffic. B. Uniform Bursty Traffic Bursty arrivals are modeled by the ON/OFF traffic model. In the ON state, a packet arrival is generated in every time slot. In the OFF state, no packet arrivals are generated. Packets of the same burst have the same output and the output for each burst is uniformly distributed. Given the average input load of p and average burst size s, the state transition probabilities from OFF to ON is p/[s(1-p)] and from ON to OFF is 1/s. Fig. 5 shows the delay-throughput performance with mean burst size s = 30 packets. Due to the bursty traffic nature, delay builds up quickly with input load. We can still observe that our proposed four algorithms generally perform better than LQF_Byte-focal and islip. At p=0.8, LQF_Byte-focal requires 224 slots, whereas 180 slots for RR and Opt-RR, 168 for LQF, 162 for Opt-LQF, and 114 for output-queued. 5

6 Delay (time slots) LQF Opt-LQF RR Opt-RR Output-queued islip LQF_Byte-focal Delay-throughput with uniform bursty traffic Input load p Fig. 5 Delay vs input load p, with bursty traffic and s = 30. C. Hot-spot Traffic Packets arriving at each input port in each time slot follow the same independent Bernoulli process with probability p. Packet destinations are generated as follows. For input port i, packet goes to output i with probability ½, and goes to other outputs with same probability 1/[2(N-1)]). From Fig. 6, again we can see that our four proposed algorithms give similar superior delay performance, and with a consistent delay gap of about 24 slots (which is the delay experienced at middle-stage ports) above the output-queued performance over all input loads. Delay (time slots) LQF Opt-LQF RR Opt-RR Output-queued islip LQF_Byte-focal Delay-throughput with hot-spot traffic Input load p Fig. 6 Delay vs input load p, with hot-spot traffic. paper. Based on the collected information, input ports forward just enough packets to the middle-stage ports, so as to avoid both buffer overflow and underflow. Combining with a singlepacket-buffer-per-middle-stage-voq, we not only provided an elegant solution for packet mis-sequencing problem, but also an unbeatable delay-throughput switch performance. References [1] Y. Tamir and G. L. Frazier, High-performance multi-queue buffers for VLSI communications switches, Proceeding 15 th Annual Symposium Computer Architecture, pp , June [2] N. McKeown, V. Anantharam, and J. Walrand, "Achieving 100% throughput in an input queued switch," Proceeding INFOCOM, April 1996, San Francisco, USA. [3] T. Anderson, S. Owicki, J. Saxes and C. Thacker, High speed switch scheduling for local area networks, ACM Transactions on Computer Systems, Vol. 11, pp , [4] N. McKeown, Scheduling algorithms for input-queued cell switches, PhD. Thesis, University of California at Berkeley, [5] N. McKeown, The islip scheduling algorithm for input-queued switches, IEEE Transactions. On Networking, Vol. 7, No. 2, pp , April [6] Y. Li, S. Panwar and H. J. Chao, On the performance of a dual round-robin switch, Proceeding INFOCOM, [7] C. S. Chang, W. J. Chen and H. Y. Hunag, Load balanced Birkhoff-von Neumann switches, part I: one-stage buffering, Computer Communications, Vol. 25, pp , [8] C. S. Chang, W. J. Chen and H. Y. Hunag, Load balanced Birkhoff-von Neumann switches, part II: multi-stage buffering, Computer Communications, Vol. 25, pp , [9] Y. Shen, S. Jiang, S. S. Panwar and H. J. Chao, Byte-Focal: a practical load-balanced switch, IEEE Workshop on High Performance Switching and Routing, May 2005, Hong Kong. [10] Isaac Keslassy and Nick McKeown, Maintaining packet order in two-stage switches, Proceeding INFOCOM, June 2002, New York, USA. [11] I. Keslassy, S.T. Chuang, K. Yu, D. Miller, M. Horowitz, O. Solgaard and N. McKeown, Scaling the Internet Routers using Optics, ACM SIGCOMM 03, Karlsruhe, Germany, Aug [12] C. S. Chang, D. S. Lee and Y. J. Shih, Mailbox switch: a scalable two-stage switch architecture for conflict resolution of ordered packets, Proceeding INFOCOM, March 2004, Hong Kong. [13] C.Y. Tu, C.S. Chang, D.S. Lee and C.T. Chiu, Design a simple and high performance switch using a two-stage architecture, IEEE GLOBECOM 2005, St. Louis, USA, Nov [14] Y. S. Lin and C. B. Shung, Quasi-Pushout Cell Discarding, IEEE Communications Letters, Vol. 1, pp , Sept VI. CONCLUSIONS A load-balanced two-stage switch can eliminate the scheduler, is scalable, and guarantees 100% throughput for a broad class of traffic. The major problem is that packets can be mis-sequenced. Aiming at preventing packets from becoming mis-sequencing, a novel feedback mechanism was designed for collecting the VOQ status in the middle-stage ports in this 6

Globecom. IEEE Conference and Exhibition. Copyright IEEE.

Globecom. IEEE Conference and Exhibition. Copyright IEEE. Title FTMS: an efficient multicast scheduling algorithm for feedbackbased two-stage switch Author(s) He, C; Hu, B; Yeung, LK Citation The 2012 IEEE Global Communications Conference (GLOBECOM 2012), Anaheim,