A Low Complexity Scheduling Algorithm for a Crosspoint Buffered Switch with 100% Throughput

Size: px

Start display at page:

Download "A Low Complexity Scheduling Algorithm for a Crosspoint Buffered Switch with 100% Throughput"

Darcy Crawford
6 years ago
Views:

1 A Low Complexity Scheduling Algorithm for a Crosspoint Buffered Switch with 100% Throughput Yanming Shen, Shivendra S. Panwar, H. Jonathan Chao Department of Electrical and Computer Engineering Polytechnic University Abstract-Crosspoint buttered switches are emerging as the focus of research in high-speed routers. They have simpler scheduling algorithms, and achieve better performance than a bufferless crossbar switch. Crosspoint buttered switches have a buffer at each crosspoint. A cell is first delivered to a crosspoint butter, and then transferred to the output port. With a speedup of two, a crosspoint buttered switch has previously been proved to provide 100% throughput. In this paper, we propose a 100% throughput scheduling algorithm without speedup, called SQUID. With this design, each input/output keeps track of the previously served virtual output queues (VOQs)/crosspoint butters. We prove that SQUID, with a time complexity of G(log ), can achieve 100% throughput without any speedup. Our simulation results also show a delay performance comparable to outputqueued switches. We also present a novel queuing model that models crosspoint buttered switches under uniform traffic. I. ITRODUCTIO Internettraffic is growing rapidly, and this puts an increasing demand for high capacity routers. Some real-time applications, e.g., IPTV, Video-on-demand, and Voice-over-IP service have stringent delay requirements. This requires that the delays incurred in the routers should be kept small. A crosspoint buffered switch is a good solution to high performance router design, since it can provide good delay performance, and reduce the complexity of scheduling algorithms at the same time. Indeed, the delay performance is approximately the same as an output-queued switch, the minimum theoretically possible. Output queuing like performance also simplifies traffic engineering for such switches. With today's ASIC technology, crosspoint buffered switches can be implemented in a single chip. This makes crosspoint buffered switches an attractive solution compared to the traditional input-queued switch because the crosspoint buffers allow for simpler scheduling algorithms and better delay performance. Another important application for crosspoint buffered switches is that they can be used in high performance computing to interconnect thousands of processors [1]. A high throughput and low latency scalable switching fabric can improve the performance of a super computer significantly. It is therefore important to design a high throughput and low delay scheduling algorithm for crosspoint buffered switch. With a speedup of 2, the authors in [2] showed that a crosspoint buffered can provide 100% throughput. In [3], the results were extended to variable size packets. The author in [4] proved that the speedup requirement can be reduced to 2 - li. However, without speedup, previous throughput results are only limited to uniform traffic loads. Under uniform traffic, it has been shown that longest-queue-first at the input port and round-robin at the output port (LQF-RR) guaranteed 100% throughput [5]. In [6], the authors proved that a simple round-robin scheduler at both input and output ports can provide 100% throughput under uniform traffic. In [7], the authors proposed a distributed scheduling algorithm and derived a relationship between throughput and the size of crosspoint buffers. However, to achieve 100% throughput, it needed a buffer of infinite size. With state-ofthe-art technology, the total amount of buffers that can be built on chip is limited. For a switch with large number of ports, the buffer size associated with an input-output pair should be kept small. In [8], Tassiulas studied randomized algorithms that achieve 100% throughput for an input-queued switch. The approach works as follows. In each time slot, a random feasible solution to the maximum weighted matching problem is obtained. If the value of the new solution is higher than the value of the current solution, the latter is replaced. Using this approach guarantees achieving 100% throughput under the condition that the probability that the new solution is equal to the maximum weight matching is strictly greater than zero. A de-randomized version of this algorithm was proposed by Giaccone et al. [9], where a Hamiltonian walk is applied instead of randomly generating a new schedule. However, such approaches introduce a large delay. Several approaches [9], [10] have been proposed for input-queued switch to reduce the delay. This technique can be applied to a crosspoint buffered switch to achieve 100% throughput as well. With buffers at crosspoints, we can reduce the complexity significantly, and at the same time achieve a much better delay performance than input-queued switches. In [11], we presented a 100% throughput scheduling algorithm, SQUISH (Stable QUeue Input-output Scheduler with Hamiltonian walk). With SQUISH, a crosspoint buffered switch can achieve 100% throughput without speedup and a finite crosspoint buffer size. In this paper, we propose another 100% throughput scheduler for crosspoint buffered switches, called SQUID (Stable Queuing/Implementable Design). We again assume a crosspoint buffered switch with finite crosspoint buffers and no speedup. We consider a stochastic packet arrival process in which the rate corresponds to an admissible, but unknown traffic matrix. As discussed in Section III-C, compared with /08/$ IEEE 189

VOQs Inpu_ti--+--_~ Input i Outputj Input-queued switch Outputj Crosspoint buffered switch Fig. 3: Emulating an input-queued switch. Fig. 1: Crosspoint buffered switch. Fig. 2: The scheduling phases for crosspoint buffered switch.

2 VOQs Inpu_ti--+--_~ Input i Outputj Input-queued switch Outputj Crosspoint buffered switch Fig. 3: Emulating an input-queued switch. Fig. 1: Crosspoint buffered switch. Fig. 2: The scheduling phases for crosspoint buffered switch. SQUISH, SQUID can reduce message exchange significantly. This makes SQUID an implementable and scalable design. The rest of the paper is organized as follows. In section II, we briefly describe the crosspoint buffered switch architecture. In section III, we propose the scheduling algorithm and prove its stability. In section IV, we present delay results derived from an analytical model of the switch under uniform traffic. In Section V, we present simulation results. Finally, section VI concludes the paper. II. THE CROSSPOIT BUFFERED SWITCH MODEL Figure 1 shows an x crosspoint buffered switch. We shall assume fixed size packet (cell) switching. Variable size packet switching can be implemented by introducing packet segmentation and reassembly. Each input maintains VOQs, one for each output. Let Qij (n) denote the queue length of VOQij at time n, n == 0,1,2,... Each crosspoint contains a finite buffer of size B. The buffer between input i and output j is denoted as CBij; Bij (n) is the buffer occupancy of CB ij at time n, Bij (n) :s B. The key role of crosspoint buffers is to separate input contention from output contention. This results in a twostage scheduling scheme. As shown in Figure 2, in the input scheduling phase, each input determines from which VOQ a cell should be transferred to the corresponding crosspoint buffer with available space. In the output scheduling phase, each output determines from which non-empty crosspoint buffer to serve a cell. When a crosspoint buffer is full, no more cells can be transferred to it. ote that if the crosspoint buffer size is unlimited, the crosspoint buffered switch is equivalent an to output queuing switch and input schedulers are not necessary, since packets can directly go to crosspoint buffers without buffering at inputs. For a practical singlechip implementation using current technology, however, the crosspoint buffers are constrained to a small number. Let Aij (n) be the number of packets that have arrived at input i destined for output j up to time slot n. Assume the arrival process {Aij (n), i, j == 1,..., } satisfies the strong law oflarge numbers (SLL), Le., with probability one,, Aij(n) n 1'\" \';J 'l,] == 1,...,. 11 m == o n~oo Definition 1: An arrival process is said to be admissible if L Aij :S 1, L Aij :S 1. (2) j Let Dij (n) be the number of departures from crosspoint buffer CBij up to time slot n. Definition 2: A switch operating under a matching algorithm is rate stable if, with probability one, 11 m Dij(n) == \';Jo, 1 n n~oo 1'\" 'l,] ==,..., for any arrival process satisfying condition (1). Let Xij (n) be the total number ofcells in VOQij and CBij at time n, Xij(n) == Qij(n) + Bij(n), X(n) == [Xij(n)]. Let S(n) == [SI(n); SO(n)] be the schedule at time n. SI(n) == [SL (n)] is the input schedule and is subject to the following constraints: L Sb(n) :S 1, S{j(n) == 0 if Bij (n) == B; (4) j S O (n) == [S8 (n)] is the output schedule and is subject to the following constraints: L Sg(n) :S 1, Sg(n) == 0 if Bij(n) == O. (5) The set of all possible schedules is denoted by II. For each schedule S E II, we define the weight for output j as Wj == Xij if j serves crosspoint buffer CBij. The total weight Ws(n) of a schedule is Ws(n) == L Wj == (So(n), X(n)), j=l where for two matrices A and B of the same size, (A, B) == Lij AijBij. ote that although the weight is calculated with the output schedule only, in fact, the input schedule comes into the weight calculation implicitly. Indeed, from Figure 2, we can see that the output schedule is performed after the input schedule. A valid output schedule is determined by the state of crosspoint buffers. This takes place after the input scheduling phase, when the state of crosspoint buffers is updated. For crosspoint buffered switches, we have the following property as compared to input-queued switches: (1) (3) 190

3 Property 1: A crosspoint buffered switch can emulate any scheduling scheme applied to a bufferless input-queued switch. As shown in Fig. 3, for an input-queued switch, at each time slot, a matching is generated, where an input i is matched to an output j, and a cell from input i is delivered to output j. For a crosspoint buffered switch, a cell from input i can also be delivered to output j. If the crosspoint buffer is empty, then input i needs to send a cell to crosspoint buffer CB ij, and then output j removes that cell from the crosspoint buffer. If the crosspoint buffer is full, then output j can remove a cell from the crosspoint buffer while input i stays idle. Based on this observation, for a crosspoint buffered switch, we define the Input-Queued Switch Maximum-weightmatching Emulation (IQSME) schedule S* as the solution of the following optimization problem: max (S(n), X(n)) S s.t. L Sij :s; 1,L Sij :s; 1. j=l S;j == 1 means (i) if B ij < B, then input i transfers a cell to crosspoint buffer CB ij if V OQij is not empty and output j serves a cell from crosspoint buffer CB ij ; (ii) if B ij == B, then input i stays idle and output j serves a cell from crosspoint buffer CB ij. S* is essentially an emulation of the maximum weight matching schedule in an input-queued switch with no crosspoint buffers, where if input i sends a cell to crosspoint buffer CB ij, then output j must serve a cell from the crosspoint buffer CB ij. Define the weight W* of S* as W*(n) == (S*(n),X(n)). We will now introduce the following lemma from [11], which will be used to prove the stability of our algorithm: Lemma 1: For any admissible traffic satisfying (1), if the weight of a scheduling algorithm at each time slot is within a bounded constant value C from the maximum weight, Le., denote the queue length of such a VOQ for input i. If there is no such VOQ, then Pi == O. Similarly, each output keeps track of the crosspoint buffer served in the previous time slot. At each time slot, a Hamiltonian walk schedule [9] is generated. Let hi be the queue length of the VOQ corresponding to the Hamiltonian walk at input i. Then the scheduling algorithm is as follows: 1) Each input calculates the difference of d i == Pi - hi, and sends this information to the scheduler. 2) The scheduler calculates the sum of all the differences. It generates a '1' if the sum is positive (or zero) and '0' if the sum is negative. Then it sends back the one-bit information to all inputs/outputs. 3) Each input/output makes decisions based on this onebit information, Le., if a '1' is received, then it stays with the previous VOQ/crosspoint buffered served, and switches to Hamiltonian walk otherwise. 4) For each input, if the selected VOQ cannot be served either because it is empty or the corresponding crosspoint buffer is full, then it serves an eligible VOQ in a round-robin order. 5) For each output, if the selected crosspoint buffer is empty, then it serves a non-empty crosspoint buffer in a round-robin order. We now proceed to prove the stability of SQUID algorithm. To this end, we prove the following Lemma first. Lemma 2: Let S(n) be the schedule used at time nand H(n + 1) be the schedule generated by the Hamiltonian walk at time n + 1. If the schedule S(n + 1) at time slot n + 1 satisfies WS(n+l)(n+l) 2:: max(ws(n) (n+l), WH(n+l)(n+l))-C1, where C 1 is a constant, then this scheduling algorithm is rate stable. proof: From Lemma 1, to prove that the algorithm is rate stable, it suffices to show that Ws(n) 2:: W*(n) - C, (6) Ws(n) 2:: W*(n) - C, then this algorithm is rate stable. The proof of this Lemma can be found in [11]. III. THE STABLE QUEUIG/IMPLEMETABLE DESIG SCHEDULER FOR CROSSBAR BUFFERED SWITCHES In this section, we present the Stable QUeuing Implementable Design (SQUID) scheduler for crossbar buffered switches. A. Crosspoint buffer size ofone We first present the SQUID algorithm for crosspoint buffered switches with a crosspoint buffer size of one. The results for a crosspoint buffer sizes greater than one will be discussed in Sec. III-B. With SQUID, each input keeps track of the VOQ which satisfies the following two conditions: (i) this VOQ was served in the last time slot; and (ii) the corresponding crosspoint buffer is not full, Le., it was served in the last time slot. Let Pi where C is a bounded constant. Since there is at most one arrival or departure from each X ij in each time slot, we obtain that for any schedule P Wp(n) 2:: Wp(n + k) - k. (7) Consider any given time T, let S* denote the IQSME at time T. By the property of Hamiltonian walk, there exists a n' E [T -!, T], such that the Hamiltonian walk at n' is the IQSME at T, Le., H(n') == S*. Thus we have WS(nl) (n') > WH(n ' )(n') - C 1 Ws* (n') - C 1 > Ws* (T) - (T - n') - C 1, (8) where the last inequality is from (7). Again, from the definition of the algorithm and (7), WS(T)(T) 2:: WS(T-l)(T)-C1 2:: W S (T-l)(T-l)--C1. 191

4 Using this repeatedly, we obtain WS(T)(T) 2: WS(n l ) (n') - (T - n') - (T - n')c1. (9) Combining (8) and (9), we have WS(T) (T) 2: Ws* (T) - 2(T - n') - (T - n')c1 - C1. Since T - n' ~!, this implies and the above is true for any T. From Lemma 1, the scheduling algorithm is therefore rate stable. Lemma 3: Let S(n) and S(n+1) be the schedules used by SQUID at time nand n + 1 respectively, then WS(n+1)(n + 1) 2: WS(n) (n + 1). proof' Case 1: L:~1 d i 2: 0: From the SQUID algorithm, we know that if L~l d i 2: 0, then an output will switch to a different crosspoint buffer only if the previously served crosspoint buffer is empty. When an output switches to a different crosspoint buffer if the selected crosspoint buffer is empty, the total weight will increase, or remain the same if all crosspoint buffers are empty. Therefore, we have WS(n+1)(n + 1) 2:: WS(n)(n + 1). Case 2: L:~1 di < 0: If L:~1 di < 0, we have, Lhi > LPi, and inputs/outputs will switch to the Hamiltonian walk schedule. The weight of the schedule S(n +1) at time n +1 satisfies the following inequality, WS(n+1) (n + 1) 2: L hi, since both inputs and outputs will select non-empty VOQs/crosspoint buffers in a round-robin order if the selected VOQs/crosspoint buffers are empty. From the schedule S(n) at time n, we can divide the outputs into two categories: J1: j E J1 if j serves crosspoint buffer CB ij and input i sends a cell to crosspoint buffer CB ij ; J 2 : j E J 2 if j serves crosspoint buffer CB ij, but input i does not send a cell to crosspoint buffer CB ij. When S(n) is applied at time n + 1, if j E J1, then Wj == Qij == Pi If j E J 2, then Wj == 0, since at the beginning of time slot n +1, the crosspoint buffer CB ij is empty, and input i does not send a cell to crosspoint buffer CB ij in time slot n + 1. Therefore, we have. WS(n) (n + 1) == LPi. From the definition of SQUID, WS(n+1)(n + 1) 2: Lhi > LPi == WS(n)(n + 1). B. Crosspoint buffer size more than one When the crosspoint buffer size is more than one, as compared to the case of size one, the only difference is how each input obtains Pi. Let Pi denote the set of VOQs for input i for which (i) both the VOQ and the corresponding crosspoint buffer was served in the last time slot, or (ii) the corresponding crosspoint buffer was served in the last time slot and it is not empty in the current time slot. There may be multiple such VOQs for a particular input port. Let Pi denote the sum of queue lengths for such VOQs for input i, Pi == L:jEPi Qij. If there is no such VOQ, then Pi == O. ow we proceed to prove the stability of SQUID with a crosspoint buffer size greater than one. First, we prove the following Lemma. Lemma 4: Let S(n) and S(n + 1) be the schedules used at time nand n + 1 respectively. Then WS(n+1)(n + 1) 2: WS(n)(n + 1) - B. proof' If L:~1 d i 2: 0, as in the proof of Lemma 3, the total weight will increase. Therefore, we have WS(n+1)(n + 1) 2: WS(n)(n + 1). If L:~1 d i < 0, then inputs/outputs will switch to the Hamiltonian walk schedule, and Let ji denote the corresponding input when output j serves crosspoint buffer CB jij (B jij > 0). If we apply the schedule S(n) at time slot n + 1, then The total weight is Wj WS(n)(n + 1) Qjij + Bjij Pji + Bjij < Pji + B. (10) (11) where the second inequality is from (11). Theorem 1: The SQUID algorithm is rate stable for any admissible arrival process satisfying the strong law of large numbers. The proof of this theorem is a straightforward extension from Lemmas 3 and 4 corresponding the cases of B == 1 and B > 1, respectively. Lhi > LPi. From the definition of SQUID, L Wj j=l < LPji +B. j=l WS(n+1)(n+ 1) > Lhi > LPi > WS(n)(n + 1) - B. 192

c. Implementation In our design, each slot has two message exchanges: (1) Each input sends the queue length difference information to the scheduler; and (2) the scheduler sends one bit information to

5 c. Implementation In our design, each slot has two message exchanges: (1) Each input sends the queue length difference information to the scheduler; and (2) the scheduler sends one bit information to all inputs/outputs. Let d max denote the maximum queue length difference among all input ports, then totally up to logd max + 1 (sign) bits need to be sent to the scheduler from each input port. The number of bits sent by the scheduler to inputs/outputs is 1. Therefore, the total bits exchanged with each line card in each time slot is at most 2 + logd max. With SQUISH [11], each input needs to send VOQ lengths to output ports in order to pick the longest during the output scheduling phase. This implies up to logd max bits needs to be sent by each input. We can see that SQUID reduces the communication overhead significantly. The scheduler only needs to calculate the sum of variables, which can be implemented with a time complexity of 10g. The Hamiltonian walk is deterministic and has a complexity of0(1). The time complexity for each input/output storing the identity of the last served queue and implementing a round-robin schedule is also 0(1). Therefore, the total time complexity of the algorithm is O(log). When the delay between linecards and the scheduler is not negligible, as is typical for a large switch [12], we can apply pipelining to address this issue. With pipelining, the scheduler can take more than one one time slot to collect the information and send out the scheduling decision. The pipelined SQUID will also provide 100% throughput as in the case of SQUISH [11]. IV. DELAY AALYSIS In this section, we analyze the average delay in the crosspoint buffered switch. The other analytical model for a crosspoint buffered switch was presented in [13]. Their model is used to obtain a saturation throughput under FIFO service without input buffering. Throughout the analysis, we consider queue states to be at the beginning of each time slot (before packet arrivals). We assume the VOQs are evenly loaded with a rate of ~. We make the simplifying assumption that, at each input port, among all the eligible VOQs, the input scheduler randomly picks a VOQ to serve. This models scheduling algorithms that are stable under uniform traffic. We first look at all the crosspoint buffers for a particular output. We consider the case when B == 1. Let a denote the number of packets in these crosspoint buffers, and let P be the arrival probability from the corresponding VOQ when a crosspoint buffer is empty. Then we have the state transition diagram for the crosspoint buffers as shown in Figure 4. For clarity, we only show transitions for the zeroth state to higher states (i.e., right transitions). Let Si denote the probability of being in state i. From flow conservation, the average arrival rate to these crosspoint buffers should be A, therefore -I A == E ( - i)psi. i=o Fig. 4: Markov chain for crosspoint buffers. )j(l-q) ~(1-)j)q Then we have )j(l-q) (1-)j)q 1-p Fig. 5: Markov chain for a VOQ. P )j(l-q) (1-)j)q )j(l-q)... ~ ~-I ~-I" L..,;i=O Si - L..,;i=1 ~Si A -E(a)" (1-)j)q Then we can solve P iteratively by picking an initial value for P, solving the Markov chain to get E(a) to generate the next value for p. ext, let us look at a particular non-empty VOQ destined to this output. Let q denote the probability that this non-empty VOQ is served in the input scheduling phase, which means that (i) its corresponding crosspoint buffer has space, and (ii) it is selected for service by the input scheduler. Then we have a Markov chain for this VOQ as shown in Figure 5. The probability PI that this VOQ is non-empty is simply A(1 - q) ( - A)q" (12) We define Po to be the probability that a crosspoint buffer is empty, then - E(a) Po == " (13) We also define the conditional probability Ps that a VOQ is selected for service when it is not empty and its corresponding crosspoint buffer has space ps == P(VOQ selectedlvoq non-empty, crosspoint buffer empty)" From these definitions, we can see that - E(a) q == POPs == PS" (14) Among the remaining VOQs, the expected number that are not empty and have the corresponding crosspoint buffers that are empty (which means those VOQs are eligible for service) can be estimated by ( -1)PIPO. Then, using our assumption that eligible VOQs are randomly picked for service, 1 1 Ps = 1 + ( - l)pipo 1 + >'(l-q)(-l)(-e(a))' (15) q(->") By solving equations (14) and (15), we can get _ 2 (1 - A)( - E(a)) q - 3(1 -,x) +,x( - l)e(a)' (16) 193

-to-=32, simulation ----&-=32, model 102 - =64, simulation -&- =64, model >.

6 -to-=32, simulation ----&-=32, model =64, simulation -&- =64, model >. - =128, model ct3 lo' "'0 0) ct3 0>10" > «-+-SQUID 50 0 SQUISH ~-output-queu rl 40 "0 ~30 H Blue: SQUID Green: SQUISH ~20 Red: output-queued d ol-- ' """'====-===:ti~~===l------' j o O. ~ O. B 0.9 Fig. 6: Average delay by simulation and Markov model. Fig. 7: Average delay under uniform Bernoulli traffic, == 32, crosspoint buffer size B == 1. Since E(a) is known from the solution of the Markov chain for the crosspoint buffers, we can then get the average delay by solving the VOQ Markov chain. Figure 6 shows the average delay performance obtained by simulations using the SQUID scheduler and the model respectively. We can see that when the loading is not extremely high, then the model matches the simulation results very well. We also plot the average delay obtained from the model for == 128. Under extreme loading (A == 0.99), as the switch size increases, the model becomes more accurate. The Markov model can therefore be used to evaluate delays and buffer occupancies for large switches where simulations take too long w=o. 75, SQUID o w=o. 75, SQUISH 100 -w=q. 75, output-qu ued ~ -e-w=o.5, SQUID rl w=o.5, SQUISH ~ 80 ~ w=o.5, output-que ed ---w=o.25, SQUID 01 &0 w=o. 25, SQUISH ~ --?-w=o. 25, output-qu ued ~ 40 Blue: SQUID Green: SQUISH 20 Red: output-queued Fig. 8: Average delay under hotspot traffic with different values of w, == 32, B == 1. V. SIMULATIO In this section, we study the delay performance of the SQUID scheduling algorithm via simulations. We show a comparison on the delay performance between SQUID and an output-queued switch, which has the best possible performance. The traffic patterns studied in our simulations are uniform and non-uniform traffic with Bernoulli and bursty arrivals. In our simulations, all inputs are equally loaded on a normalized scale A E (0, 1), and we measure the fixed length packet (cell) delay. In our switch model, we do not consider segmentation and re-assembly delays. A. Uniform Bernoulli traffic For uniform traffic, the destination of a new cell is uniformly distributed among all the output ports, i.e., Aii == AI. Figure 7 shows the delay performance under uniform Bernoulli traffic. We can see that SQUID has delay performance very close to an output-queued switch. As compared to SQUISH, if the load is not close to 1, the average delays are almost the same. However, unlike SQUISH, SQUID does not need to choose the longest queue to guarantee stability. Indeed, even a simple round-robin scheduler has a delay performance close to an output queued switch [6]. However, if the traffic is nonuniform, the round-robin scheduler fails to achieve 100% throughput. We therefore study the delay performance of SQUID under nonuniform traffic next. B. onuniform traffic For the nonuniform traffic, we first try hot-spot [6] traffic. With hot-spot traffic, for any input port, Aii == 11)A, Aii == (1 - w)ai( - 1), for i i- j. By changing the factor w, we can get different nonuniform traffic patterns. Figure 8 shows the delay performance for Bernoulli traffic with W == 0.25,0.5 and We can see that for different w, when the load is less than 0.9, the average delay of SQUID is very close to an output-queued switch. When W becomes larger, which means the traffic is more unbalanced, the average delay of SQUID will increase when the load is close to 1. The reason is that SQUID selects queues using a round-robin schedule rather than queue length, thus it is not as efficient as SQUISH [11] under unbalanced traffic. Though the average delay is higher under extremely high load, SQUID can achieve 100% throughput. ote that under hotspot traffic, the roundrobin scheduler has a throughput around 85% [6] and the SBF LBF [14] has a throughput around 87%. ext, we test the delay performance of SQUID under Logdiagonal and Lin-diagonal [15] traffic patterns. With Logdiagonal, arrival rates at the same input differ exponentially, Le., Ai(i+i) == 2Ai(i+i+l)' where 0 :s; j :s; - 2. With Lindiagonal, arrival rates at the same input differ linearly, Le., Ai(i+i) - Ai(i+i+l) == 2AI( + 1), where 0 :s; j :s; - 2. Figure 9 shows the delay performance for log-diagonal 194

7 -+-Log-diagonal, SQUID Blue: SQUID o Log-diagonal, SQUISH Green: SQUISH >t':'() -Log-diagonal, output-q leued Red: output-que d rtl - e - Lin-diagonal, SQUID " ~ Lin-d~agonal, SQUISH " "d.:. o -+-Lln-dlagonal, output-q eued I 0) tj1 rtl H 0)':'(' ~ Fig. 9: Average delay under lin-diagonal and log-diagonal traffic, == 32, B == 1. Fig. 11: Average delay under hotspot traffic with different crosspoint buffer sizes, == 32, w == 0.5. and lin-diagonal traffic. We can see that the average delay performance for both traffic patterns is close to an outputqueued switch unless it is under an extremely high load. Log-diagonal traffic is more unbalanced as compared to lindiagonal traffic, thus the average delay under log-diagonal traffic is higher than the average delay under lin-diagonal traffic. c. Switch size effect Generally, for input-queued switches, the average delay increases linearly with the switch size. For output-queued switches, the delay is independent of the switch size. In this simulation, we test the delay performance with different switch sizes. Figure 10 shows the average delay with different switch sizes. For a given switch size, we simulated uniform Bernoulli, and nonuniform Bernoulli (w == 0.25 and 0.75) traffic. We can see that the average delays are almost the same for different switch sizes. ote that for a crosspoint buffered switch, even if the crosspoint buffer size is one, when the switch size increases, the total crosspoint buffer size increases. For a particular output, the amount of crosspoint buffers associated with it is (if the crosspoint buffer size is one). From simulations, we found that when the switch size increases, the average delay at the input ports will decrease. On the other hand, the average delay at the output will increase since there are more crosspoint buffers. These appear to balance out and therefore, the total delay a packet experiences at both input port and crosspoint buffer does not increase with the switch size. D. Buffer size effect When the crosspoint buffer size is infinite, a crosspoint buffered switch would become equivalent to an output-queued switch. Therefore, if we increase the crosspoint buffer size, the average delay will decrease, and eventually converge to the average delay of an output-queued switch. In this simulation, we test the average delay performance for hotspot traffic (w == 0.5) with different crosspoint buffer sizes. Figure 11 shows the average delay under hotspot traffic with different crosspoint buffer sizes. We can see as the crosspoint -+-a=l. 7, SQUID ---e--a=l. 7, ou:put-quea ':'0' -a=1.9, SQUID ~ -a=1.9, ou:put-quea,...-i -+-Bernoulli, SQUID ~ Bernoulli, output- Fig. 12: Average delay under uniform bursty traffic with different burst parameters, == 32, B == 1. buffer size increases, the average delay of SQUID gradually decreases, closer to that of an output-queued switch. E. Bursty traffic We have proved that SQUID can achieve 100% throughput for any admissible input traffic satisfying the strong law of large numbers, and the stability results are not limited to Bernoulli arrivals. In this section, we test the delay performance of SQUID under bursty traffic. For bursty traffic, all the cells in the same burst go to the same destination. The burst lengths are chosen independently from the following truncated Pareto distribution [16]: c P(A burst has a length l) == la' 1 == 1,..., 10000, where a is the Pareto parameter and c is the normalization constant; we vary a to get different average burst lengths. In our simulations, we try two different values for a, a == 1.9 and a == 1.7. When a == 1.9, the average burst length is about 9. When n == 1.7, the average burst length is about 24. Figure 12 shows the delay performance of SQUID under uniform bursty traffic as compared to output-queued switch. Generally, the average delay will increase as the average burst length increases (smaller 0). But the delays in either case are close to that of an output-queued switch. 195

~" ~'---~'-----'-"='="':'=---'--J,I ~"rj-~---:'==~ ro t'--.::.:...::...:..=..=c::.-:...::..::~ '0 r~ '0 r~ '0'0 tj1 Cll H IO- ~ o 31 02 OJ 34 D.':> 0," 3' 0.8 :J 01 02 :Jl 0.

8 ~" ~'---~'-----'-"='="':'=---'--J,I ~"rj-~---:'==~ ro t'--.::.:...::...:..=..=c::.-:...::..::~ '0 r~ '0 r~ '0'0 tj1 Cll H IO- ~ o OJ 34 D.':> 0," 3' 0.8 :J :Jl 0.4 (a) Uniform (b) Hotspot, W = 0.25 (c) Hotspot, W = 0.75 Fig. 10: Average delay with different switch sizes under Bernoulli traffic, B == 1. VI. COCLUSIO In this paper, we have proposed a Stable QUeuing Implementable Design (SQUID) scheduler for crosspoint buffered switches with a crosspoint buffer size as small as one and without speedup. The complexity of SQUID is O(log ). With SQUID, a crosspoint buffered switch can achieve 100% throughput under any admissible input traffic satisfying the strong law of large numbers. Our simulation studies indicate that for most realistic traffic scenarios the delay performance of SQUID is close to that of an ideal output-queued switch. Only under highly loaded non-uniform traffic is the delay significantly greater than an output-queued switch. The average delay does not increase with the switch size, as has been observed for input-queued switches. This means that SQUID has scalable delay performance and is suitable for a large-scale switch. This paper also presents a novel queuing model for crosspoint buffered switches under uniform load. This model can be used to estimate the performance of switches too large to simulate in a reasonable time. With SQUID, we need a low complexity but centralized scheduler. Most previous research on crosspoint buffered switches focused on designing distributed scheduling algorithms, where inputs and outputs can make scheduling decisions independently. However, such algorithms cannot achieve 100% throughput. And an interesting open problem is to design a distributed scheduling algorithm that provides 100% throughput. [6] R. Rojas-Cessa, E. Oki, Z. Jing, and H. 1. Chao, "CIXB-I: Combined Input-One-Cell-Crosspoint Buffered Switch," in Proceedings of IEEE Workshop ofhigh Performance Switches and Routers 2001, Dallas, TX, May [7] P. Giaccone, E. Leonardi, and D. Shah, "On the maximal throughput of networks with finite buffers and its application to buffered crossbars," in Proceedings of IEEE Infocom, Miami, FL, March [8] L. Tassiulas, "Linear complexity algorithms for maximum throughput in radio networks and input queued switches," in Proceedings of IEEE IFOCOM 1998, ew York, 1998, vol. 2, pp [9J P. Giaccone, B. Prabhakar, and D. Shah, "Toward simple, highperformance schedulers for high-aggregate bandwidth switches," in Proceedings of IEEE IFOCOM, ew York, [10] Y. Li, S. S. Panwar, and H.J. Chao, "Exhaustive service matching algorithms for input queued switches," in Proceedings ofieee Workshop on High Performance Switching and Routing, [11] Y. Shen, S. S. Panwar, and H. J. Chao, "Providing 100% throughput in a buffered crossbar switch," in Proceedings ofieee Workshop on High Performance Switching and Routing, May [12]. Chrysos and M. Katevenis, "Crossbars with minimally-sized crosspoint buffers," in Proc. IEEE Workshop on High Performance Switching and Routing (HPSR 2(07), Brooklyn, Y, USA, 30 May - 1 June [13] Mingjie Lin and ick McKeown, "The throughput ofa buffered crossbar switch," IEEE Communications Letters [14] L. Mhamdi and M. Hamdi, "MCBF: a high-performance scheduling algorithm for buffered crossbar switches," IEEE Communications Letters, vol. 7, no. 9, pp , Sept [15] A. Baranowska, G. Danilewicz, W. Kabacinski, J. Kleban, D. Parniewiez, and P. Dabrowski, "Performance evaluation of the multiple output queueing switch under different traffic patterns," in Proceedings of IEEE GLOBECOM 2005, Dec [16] C.S. Chang, D. Lee, and Y. Jou, " balanced Birkhoff-von eumann switches, part I: one-stage buffering," Computer Communications, vol. 25, pp , REFERECES [1] Y. P. Zhang, T. Jeong, F. Chen, H. Wu, R. itzsche, and G. R. Gao, "A study of the on-chip interconnection network for the ibm cyclops64 multi-core architecture," in Parallel and Distributed Processing Symposium, April [2] S-T. Chuang, S. Iyer, and. McKeown, "Practical algorithms for performance guarantees in buffered crossbars," in Proceedings ofieee Infocom, Miami FL, March [3] J. Turner, "Strong performance guarantees for asynchronous crossbar schedulers," in Proceedings of IEEE Infocom, Spain, April [4] M. Berger, "Delivering 100% throughput in a buffered crossbar with round robin scheduling," in Proceedings of IEEE High Performance Switch and Routing, Poznan, Poland, [5] T. Javidi, R. Magill, and T. Hrabik, "A high throughput scheduling algorithm for a buffered crossbar switch fabric," in Proceedings ofieee International Conference on Communications,

Efficient Queuing Architecture for a Buffered Crossbar Switch

Proceedings of the 11th WSEAS International Conference on COMMUNICATIONS, Agios Nikolaos, Crete Island, Greece, July 26-28, 2007 95 Efficient Queuing Architecture for a Buffered Crossbar Switch MICHAEL