THE CAPACITY demand of the early generations of

Size: px

Start display at page:

Download "THE CAPACITY demand of the early generations of"

Noah Pearson
6 years ago
Views:

1 IEEE TRANSACTIONS ON COMMUNICATIONS, VOL. 55, NO. 3, MARCH Resequencing Worst-Case Analysis for Parallel Buffered Packet Switches Ilias Iliadis, Senior Member, IEEE, and Wolfgang E. Denzel Abstract This paper considers a general parallel buffered packet switch (PBPS) architecture which is based on multiple packet switches operating independently and in parallel. A load-balancing mechanism is used at each input to distribute the traffic to the parallel switches. The buffer structure of each of the parallel packet switches is based on either a dedicated, a shared, or a buffered-crosspoint output-queued architecture. As in such PBPS multipath switches, packets may get out of order when they travel independently in parallel through these switches, a resequencing mechanism is necessary at the output side. This paper addresses the issue of evaluating the minimum resequence-queue size required for a deadlock-free lossless operation. An analytical method is presented for the exact evaluation of the worst-case resequencing delay and the worst-case resequence-queue size. The results obtained reveal their relation, and demonstrate the impact of the various system parameters on resequencing. Index Terms Load balancing, parallel packet switching, resequencing, switching systems. I. INTRODUCTION THE CAPACITY demand of the early generations of high-performance packet switches or routers could sufficiently be covered with single-stage switch-fabric architectures. Even the majority of the current switch generations is still based on single-stage architectures, because the continued advances in very-large-scale integration (VLSI) technology made it possible, for the most part, to keep pace with the increasing capacity requirements. However, a further proliferation in capacity, in particular, an increase of the number of ports as opposed to an increase of the port bandwidth, will eventually require the transition to multistage packet switch architectures. Owing to the complications associated with this transition, an intermediate solution that allows an easier migration is of great interest. One such solution is based on multiple single-stage switches, which provide multiplied capacity by independently operating in parallel. This concept has been referred to as parallel switch architecture in [1], parallel packet switch (PPS) in [2] and [3], and distributed packet switch in [4]. It addresses the problem arising when line rates run faster than the switch fabric does. Further advantages of the PPS concept include the possibility for a smooth migration from the existing single-stage switches and the flexibility regarding the number of parallel switches. The migration is facilitated because the individual switches can work fully independently and need not be synchronized among Paper approved by A. Pattavina, the Editor for Switching Architecture Performance of the IEEE Communications Society. Manuscript received April 17, 2005; revised January 27, 2006 and August 2, The authors are with IBM Research, Zurich Research Laboratory, 8803 Rüschlikon, Switzerland ( ili@zurich.ibm.com). Digital Object Identifier /TCOMM each other. Therefore, this concept can be applied to any switch by developing only a single new component, namely, a higher speed switch adapter or line card that incorporates the plane load-balancing and plane server schemes. Second, using more switches than the minimum number required for accommodating the total bandwidth results in an effective speedup that improves the overall performance. In this paper, we consider the general parallel buffered packet switch (PBPS) architecture, which is based on multiple buffered packet switches operating independently and in parallel. The buffer structure of each of the PPSs is based on either a dedicated, a shared, or a crosspoint output-queued architecture. Our particular interest is on the buffered-crosspoint architecture, such as the single-stage distributed packet-routing switch described in [5], which for practical reasons uses a round-robin column-scheduling mechanism. The operation is assumed to be lossless, which is achieved by employing either a credit or a backpressure type of a feedback flow-control mechanism both from the switch planes to the input adapters and from the output adapters to the switching planes. As in multistage-multipath switches, a resequencing problem arises also in this context, as packets of a given flow may get out of sequence when they travel independently through various buffered switch planes. Consequently, a resequencing mechanism is needed at the egress side. It is assumed that it is based on sequence numbers rather than a time-based resequencing. The cost for resequencing is now tolerated, because it provides the possibility to obtain higher capacity with a minimum of lower speed parallel hardware and without centralized control [3]. We address the issue of evaluating the worst-case resequence-queue size, as this determines the minimum output buffer size required for a deadlock-free lossless operation. A deadlock occurs when an egress output buffer is full with resequence cells; none of them can depart from the buffer, and no cells can enter the buffer because there is no free space left. Consequently, the knowledge of the worst-case resequence-queue size is a prerequisite for an appropriate buffer dimensioning and safe operation of the switching system. The disturbance of the cell sequence, as well as the delay performance, depend on how evenly the traffic load is balanced across the parallel switches. A good balance and a high efficiency are best achieved by dynamic load-balancing mechanisms that dispatch each cell independently. They can be state-independent or state-dependent. In the former case, cells are dispatched to the planes without considering any information related to the switches, for example, in a round-robin [6], [7], random, or cyclic [1] manner. In the latter case, cells are dispatched to planes that are determined based on switch-related information, such as buffer (queue) occupancies [4], or /$ IEEE

2 606 IEEE TRANSACTIONS ON COMMUNICATIONS, VOL. 55, NO. 3, MARCH 2007 the knowledge of the traffic already arrived [2], [3]. Note that these mechanisms do not entail any throughput reduction, because they do not inhibit the flow of cells from the input adapters to the planes, provided there is free space. However, they cause different degrees of temporary load asymmetries, which, in turn, have an impact on the resequencing. Although the state-dependent mechanisms are likely to result in improved resequencing behavior owing to the minimization of the load asymmetries, this improvement is practically reduced because of the propagation delays (in cells in flight) between adapters and switch planes. Clearly, the longer the propagation delay, the less representative and, therefore, useful the state information is. Also, simulations we conducted have shown that state-dependent load balancing is less effective when it is performed on cells rather than on variable-length packets. Moreover, as we consider the PPS concept applicable to any switch, we could not a priori assume that the switch has the capability of providing sufficient state information. We therefore focus here on the less complex, state-independent load-balancing mechanisms. Extensive work in resequencing analysis in various other contexts exists, e.g., [8] [11] and the references therein. These results, however, are limited to the resequencing delay and the resequence-queue occupancy. To the best of our knowledge, none of the earlier studies has considered the issue of the maximum resequence-queue size because it was assumed that if the resequence buffer is full, subsequently arriving out-of-sequence packets are dropped. In contrast, we do not allow packets or cells to be dropped in the architectures considered, because one intended use of the fabric is for server interconnect. In this domain, the stringent latency requirements do not allow for retransmissions, which, in turn, translates into a lossless mode of operation. Moreover, early works dealt with the resequencing incurred by a single flow either in isolation [9], [10] or in the presence of multiple flows [8]. In contrast to previous works, here we consider the combined multiple flow effect arising in the PBPS context, as at an output adapter there can be many independent flows simultaneously under resequencing, all of which should be jointly taken into account. More recent works [12] have only established upper bounds of the worst-case resequence-queue size, denoted by, because, as this paper demonstrates, its exact derivation is far from straightforward. It turns out that a trivial upper bound is derived as where denotes the number of parallel switch planes, and the maximum buffer occupancy in a plane with cells of the highest priority that are destined to an output port associated with an output adapter. This paper addresses the following practical questions. How tight is this upper bound derived? If it is not tight, is it possible that a given switch configuration with an output buffer size that is much less than this upper bound can still operate in a deadlock-free lossless fashion? One of the contributions of this paper is that it provides the answers to these questions by deriving the actual worst-case resequence-queue size, given as follows: (1) (2) The essence of (2) is that the worst-case resequence-queue size corresponds to the largest delay cells can experience in the network, and that this largest delay is determined by the number of cell buffers that can be used by cells destined to a particular output in a single plane multiplied by the number of remaining parallel planes. This corresponds to the case in which some (possibly one) cells are located in a full plane and, while they are working their way through this plane, the remaining planes are left free to allow subsequent cells to bypass them. What is not at all evident, and in fact seems counterintuitive, is the possibility of ever being able to arrive at the scenario described above, in which the loading asymmetry between the full plane and any of the remaining planes is so extreme, given that actually a scheme that load-balances the traffic is deployed. This issue will be addressed in great detail. A range of additional contributions cover the following aspects. The worst-case resequence-queue sizes for the dynamic state-independent load-balancing mechanisms considered (random, cyclic, round-robin) are all equal to the absolute worst-case resequence-queue size. The worst-case resequence-queue sizes for the dedicated, shared, and crosspoint output-queued architectures are the same as given by (2), despite some strikingly different characteristics. In the presence of priority classes, the worst-case resequence-queue size is bounded for the highest-priority traffic only. Consequently, additional measures are needed to ensure a safe operation for the low-priority traffic classes. Although (2) turns out to apply to both a dedicated/shared architecture and a crosspoint architecture, there are distinct differences, such that the details of the underlying architecture matter. For instance, in the presence of multiple flows, the worst-case resequence-queue size in the former case is always achieved by a single flow, whereas in the latter case, it can also be achieved by multiple flows. Additional differences are given in Section III-B.3. Owing to the generality of the result obtained, one may be tempted to assume that there is a unified approach to arrive at this result. This is, however, highly unlikely, because the analysis presented in this paper shows that the methodology developed in the case of dedicated/shared architecture does not apply to the case of a crosspoint architecture, and vice versa. For small values of, the worst-case resequence-queue size established in (2) is substantially different from the trivial upper bound. For example, for a PBPS architecture based on six switch planes, with each having a maximum buffering capability of 1024 cells per output port, is equal to 5120 cells for the dedicated, shared, and crosspoint output-queueing. This value is about 17% lower than the loose upper bound of 6144 cells obtained as the product of the number of planes (6) and the maximum buffering capability (1024). This gap becomes even more pronounced for a smaller number of switch planes: reducing the number of switch planes to four, is reduced to 3072 cells, which is 25% less than the loose upper bound of 4096 cells. The design conditions for a deadlock-free lossless operation now follow from (2), where the minimum output buffer size required depends only on and. It therefore indirectly de-

3 ILIADIS AND DENZEL: RESEQUENCING WORST-CASE ANALYSIS FOR PARALLEL BUFFERED PACKET SWITCHES 607 Fig. 1. PBPS/buffered-crosspoint switch. pends on the switch size through. Furthermore, it is independent of the load, because the worst-case resequence-queue size could also occur in a lightly loaded system. For example, we have conducted simulations for a server interconnect fabric based on a buffered-crosspoint architecture with eight switch planes, each having 128 kb buffer space per output port. Using real-application traffic based on traces of a scientific computing application, and although the average load was only 1%, the maximum resequence queue occupancy measured (299 MB) was already 33% of the worst case (896 MB). This was reached during one of the rare activity periods, and is attributed to the highly bursty nature of this traffic and the long-tailed packetlength distribution. Also, simulations we conducted using geometrically distributed packet lengths have shown that the maximum resequence queue occupancy increases with the load, as well as the mean packet length. It has reached values as high as 55% of the worst case. This means that if the resequencing queue provided were half of that dictated by the worst-case analysis, which is still significant, a deadlock would have occurred. This implies that, in general, reducing, even by a small factor, the resequence queue size compared with the worst-case requirement could potentially lead to a deadlock situation. This paper is organized as follows. In Section II, our model of a PBPS based on buffered-crosspoint switches is presented. Detailed descriptions of the state-independent load-balancing mechanisms considered, as well as of the operation of the resequence queue, are provided. Section III presents a general analytical method for the derivation of the worst-case resequencing delay and resequence-queue occupancy under the joint effect of multiple resequence flows. The circumstances under which the worst-case resequence-queue size arises are identified. The issue of multiple priorities is also addressed. Concluding remarks follow in Section IV. II. PRELIMINARIES A. Switch Model A PBPS architecture based on a buffered-crosspoint switch is illustrated in Fig. 1. There are parallel switch planes, each of which consisting of an output-queued switch of size running at a nominal speed of one. The ingress side comprises input adapters, each performing the plane loadbalancing function, i.e., the distribution of incoming packets or cells across the lower capacity parallel switches (planes). At the egress side, the output traffic of all parallel switches is merged back by the reverse function, the plane server function, performed by each of the output adapters. Both the input and output adapters have a buffering capability. As shown in Fig. 1, the input and output adapters consist of individual ingress-input and egress-output ports, respectively, running at a relative speed of. Note that the architecture proposed allows us to realize both a fast switch (by choosing a high value of with an appropriate number of planes ), as well as a larger switch size ( versus ). Owing to the deployment of multiple planes, the internal switch speed need not be higher than the ingress/egress port speed. The speedup or expansion factor of the configuration considered is then given by. For stability reasons, the expansion factor should be greater than or equal to one. The architecture considered provides the capability of switching variable-length packets by segmenting the packets at the ingress into fixed-length segments, and subsequently transmitting them through the switch via fixed-size switch cells. In the following, a buffer space unit is taken to be equal to the fixed size of a switch cell. We also refer to the transmission time of a switch cell at the nominal switch port speed as a cell cycle. It is, moreover, assumed that the round-trip times between input adapters and planes, and between planes and output adapters, are negligible. The system supports a number of priority classes. Priorities are obeyed at all points of queueing, i.e., at the virtual output queues (VOQs) of the input adapters, in the buffers of the switch planes, and at the output adapters. A flow is the set of fixed-length segments traveling between a pair of ingress-input and egress-output ports and transported by switch cells through the system. A given flow may contain multiple streams of individual internet protocol (IP) sessions [e.g., transport control protocol (TCP) or user

4 608 IEEE TRANSACTIONS ON COMMUNICATIONS, VOL. 55, NO. 3, MARCH 2007 datagram protocol (UDP)]. The switch-related information of cells is their incoming and outgoing ports, their sequence number, and their priority. At the output adapters, however, the cells (and the corresponding segments) are associated with the ingress-input/egress-output flows so that resequencing can be performed at the flow level. The worst-case resequencing is evaluated over all packet sizes. This, in conjunction with the fact that resequence queue-size requirements increase with increasing packet size, implies that the results obtained apply to the case where packet sizes exceed a minimum value. In particular, in the case of a crosspoint architecture, we shall show that the worst-case resequence-queue size can be achieved by flows. Therefore, the minimum value is equal to, as cells waiting for resequencing and associated with any given of the flows all belong to the same packet. Within the switch planes, there is a buffer for every input output pair, also referred to as the crosspoint buffer. Crosspoint buffers of a given column correspond to the same output port and, for ease of implementation, are assumed to be served in a round-robin manner. Each crosspoint buffer distinguishes priorities and has a size of units, which is shared among the priority classes. Let be the maximum buffer occupancy with cells of the highest priority. Each input adapter contains VOQs corresponding to the switch output ports. Each VOQ can be fed at a maximum rate of cells per cell cycle. VOQs are served according to priorities, and according to a round-robin scheme within a priority. In one cell cycle, the plane load-balancing scheme is assumed to have the capability to dispatch one cell to each plane, i.e., up to cells to the corresponding planes, as shown in Fig. 1. Furthermore, the cells could be taken from the same VOQ if the remaining VOQs are empty (this is reflected by the thick lines in the figure). Each output adapter contains an egress output buffer of size units shared among the egress-output ports and fed from the planes. In one cell cycle, the plane server scheme can transfer one cell from each plane to an output adapter, implying a maximum rate of cells per cell cycle. As mentioned earlier, either a credit or a backpressure type of a feedback flow-control mechanism is employed from the switch planes to the input adapters, as well as from the output adapters to the switching planes, ensuring a lossless operation. Therefore, a specific scheme has to be specified and applied when an output buffer is full, which, in turn, causes the feedback flow control to be activated. More specifically, we consider a round-robin scheme in which each of the planes periodically gets an opportunity to send a cell every slots. In Section II-B, we comment on the efficiency of this mechanism. In addition to the buffered-crosspoint queued architecture depicted in Fig. 1, this paper also considers the dedicated and shared output-queued architectures. The dedicated output-queued architecture assumes a single buffer per output port that corresponds to all crosspoint buffers of the column associated with this port in the buffered-crosspoint architecture. The shared output-queued architecture assumes a single buffer for all output ports that corresponds to all crosspoint buffers. Let denote the maximum buffer occupancy in a plane with cells of the highest priority that are destined to an output port associated with an output adapter. In the case of the Fig. 2. Maximum resequence-queue size versus output buffer size. buffered-crosspoint architecture, it holds that. Note that cells can join this buffer (of a given plane) at a maximum rate of cells per cycle. According to the scheme described above, cells are transferred from the input to the output adapters over multiple switch planes. As a result, cells (and segments) corresponding to a given flow may arrive at an output adapter in an order different from the one in which they originally departed from the input adapter. As first-in first-out (FIFO) delivery is required, out-ofsequence cells must wait at the output adapter in a virtual resequence queue for the missing cells to arrive, so they can be put back into proper sequence. When a missing cell arrives, it releases the corresponding resequence cells from the resequence queue. Consider an arbitrary output adapter and let denote the maximum resequence-queue size in the output buffer of size. Clearly, it holds that. For small values of, we expect to be equal to, as the first few out-of-sequence cells can easily fill the output buffer leading to a deadlock. However, for sufficiently large values of, is smaller than and, in fact, converges to the worst-case resequence-queue size, as shown in Fig. 2. Clearly, the minimum output buffer size required for safe operation is equal to the worst-case resequence-queue size incremented by one cell, i.e., work is the evaluation of this value. B. Load-Balancing Mechanisms. The objective of this Each input adapter contains a plane load-balancing function that dynamically dispatches the outgoing cells to the available planes. A cell can be dispatched to a given switch plane if the corresponding flow control permits it. A cell is said to be eligible if there is a plane that can accept it. We also refer to the planes to which an eligible cell can be dispatched as eligible planes. Within a cell cycle, there can be at most one cell dispatched to any given plane. Therefore, a high efficiency is achieved when cells are being dispatched to all planes within each cell cycle. In this paper, we focus on the less complex state-independent load-balancing mechanisms, and, in particular, on the following schemes. a) Round-Robin: A round-robin mechanism is applied to the aggregate stream of eligible cells; each cell is dispatched to the plane pointed by the counter that is incremented until the first eligible plane is found. b) Random: Cells are randomly dispatched to the eligible planes. c) Cyclic: It is applied in switch input adapters that operate in slotted fashion.

5 ILIADIS AND DENZEL: RESEQUENCING WORST-CASE ANALYSIS FOR PARALLEL BUFFERED PACKET SWITCHES 609 Fig. 3. Resequence queue. According to this scheme, successive slots are cyclically preassigned to planes, regardless of the arrival pattern of the cells. Note that these schemes cannot exclude the possibility that an arbitrary number of successive cells of a given flow are dispatched to the same plane. Owing to this property, severe load asymmetries between planes can occur, as discussed in detail in Appendix A. At the egress side of the switch, we have considered various mechanisms for the plane server scheme. In contrast to the load-balancing schemes at the ingress side, simulations we conducted on similar schemes at the egress side revealed that more complex state-dependent plane server schemes did not result in any advantage over the round-robin scheme. Consequently, in the remainder, we assume that the plane server scheme operates in a round-robin fashion at the egress side of the switch. C. The Resequence Queue In the PBPS context, many flows can be simultaneously under resequencing at an arbitrary output adapter. Let us consider a specific flow, and, in particular, a typical sequence of cells, belonging to this flow. In Fig. 3, it is shown that cells have already arrived and have been stored at the output buffer of the corresponding destination adapter, forming the resequence queue while waiting for cell to arrive. Let us now consider the case where the output buffer is not full, and suppose that cells were transferred within one cell cycle from planes, respectively, to the output buffer along with cell transferred from plane 1. As cell remains in plane 1, cells join the resequence queue. The same effect occurs when the output buffer is full, albeit on a different time scale. The longest period arises when all cells in the output buffer are destined to a particular egress-output port such that the output buffer is emptied at a rate of cells per cell cycle. In this case, cells,, will be transferred to the output buffer in successive cell cycles. This period constitutes an arbitration cycle, during which the plane server scheme transfers one cell from each of the planes to the output buffer. Clearly, the resequence-queue size increases by the same amount in both cases, although the arbitration cycle lasts one cell cycle in the former, and cell cycles in the latter case. This indicates that as far as the resequence-queue size is concerned, time should be measured in relative (arbitration cycles) rather than absolute terms (cell cycles). More formally, arbitration cycle is the period it takes for the plane server scheme to check and serve (if possible) all planes provided they are not all empty. In the remainder, unless otherwise indicated, the term cycle will refer to an arbitration cycle. Remark 1: From the above, it follows that a delay of one arbitration cycle corresponds to a resequencing delay of at most cell cycles, occurring when the egress output buffer is constantly filled with cells that are all destined to the same egressoutput port. III. ANALYSIS OF MULTIPLE FLOWS A. Worst-Case Resequencing Delay We consider the crosspoint output-queued architecture, and describe a scenario resulting in the maximization of the resequencing delay. Let us consider multiple flows stemming from the input adapters and destined to the first output adapter, as depicted in Fig. 4 by the shaded and black cells, and let us focus on a typical tagged flow arriving at, say, the first adapter with its cells indicated in black. At the instant cell of the tagged flow is dispatched to plane 1, buffers of the first plane corresponding to the first output adapter as well as are full, as indicated in Fig. 4 by the shaded cells, which all are destined to the first egress-output port of the first output adapter. At the same time, a subsequent cell of the tagged flow, say cell, is dispatched to an empty plane, say plane 2. In the next cycle, cell and a shaded cell will be dispatched to the output buffer. Cell will arrive at the output buffer after all shaded cells have arrived, i.e., after cycles. Thus, the worst-case resequencing delay (experienced by cell and expressed in arbitration cycles) is given by or, by virtue of Remark 1, cell cycles. Note that (3) also applies to the case of a dedicated/shared output-queued architecture. Remark 2: The worst-case resequencing delay does not depend on, the number of individual ports per adapter. Note that in obtaining these results, no assumptions were made regarding the load-balancing scheme. Let us now address the following issue. Is it possible, under the deployment of a plane load-balancing scheme, to arrive at the scenario described above, where the loading asymmetry between plane 1 and any of (3)

6 610 IEEE TRANSACTIONS ON COMMUNICATIONS, VOL. 55, NO. 3, MARCH 2007 Fig. 4. Worst-case resequencing delay for a PBPS/crosspoint-queued switch configuration. the remaining planes is so extreme? Note that the plane load-balancing scheme ensures that the load is equally distributed among the planes. Consequently, at first glance, this scenario, in which all the shaded cells reside in one of the planes, seems rather unlikely. Nevertheless, in Appendix A, it is shown that this scenario is feasible under any of the state-independent load-balancing schemes considered. B. Worst-Case Resequence-Queue Size Let us now consider a snapshot of the output resequence queue at a random cycle, and let us assume that there are flows under resequencing indexed by. Let us denote by a cell that entered the th plane at time, and was subsequently dispatched to the egress-output buffer in cycle, that corresponds to the th flow, and whose sequence number is equal to. To distinguish resequence cells residing in the resequence queue from the rest of the cells, the notation is used ( instead of ). For example, cell waits for cell, which is located in the second plane and will be dispatched to the output buffer in cycle. The missing cells will arrive after cycle, and are, therefore, indicated in subsequent columns. From the set of missing cells corresponding to a given flow, the one with the smallest sequence number is called the principal missing cell. For example, and are missing cells corresponding to flow, with the latter being the principal one. As there are flows under resequencing, there are principal missing cells. These are indicated in Fig. 5 with a sequence number equal to one. Remark 3: Let be a resequence cell that waits for cell of the same flow. This implies that and. Furthermore, the two cells cannot stem from the same plane, i.e.,, as in all output-queued architectures considered, a cell of a flow cannot bypass, within a plane, another cell of the same flow. Remark 4: The dedicated/shared output-queued architecture ensures a FIFO delivery because cells maintain their positions relative to one another. Thus, cells depart from the plane and enter the output adapter in the same order in which they entered the plane. In contrast, a crosspoint-queued architecture

7 ILIADIS AND DENZEL: RESEQUENCING WORST-CASE ANALYSIS FOR PARALLEL BUFFERED PACKET SWITCHES 611 Fig. 5. Snapshot of the output resequence queue. with round-robin column scheduling does not necessarily guarantee a FIFO delivery. The FIFO property is guaranteed only among cells stemming from the same crosspoint buffer. Let and be two cells arriving at the output adapter from a given plane, say the th plane, with. Consider that the two cells enter two different crosspoint buffers. will depart prior to if at the instant it enters its crosspoint buffer, the number of cells in the buffer is smaller than the corresponding number of cells in front of in its crosspoint buffer. Thus for a dedicated/shared output-queued architecture and for a crosspoint output-queued architecture for cells stemming from the same crosspoint buffer, and for the crosspoint output-queued architecture considered. (4) We now proceed by defining the resequencing degree as the maximum of the resequencing delay cycles corresponding to the resequence cells at cycle, i.e.,, where denotes the earliest dispatch cycle for the resequence cells located in the resequence queue at cycle, formally. Let be such a resequence cell, and let be its corresponding principal missing cell. Clearly, at cycle, the principal missing cell is stored at the buffer of plane and remains there throughout the interval. Note that the resequencing degree is bounded above by the worst-case resequencing delay (expressed in arbitration cycles) with the equality holding iff two successive cells of a flow are simultaneously transferred to two planes, with the buffer of the first of them being full of cells destined to the output adapter, and the buffer of the other being empty. From the above, it follows that the resequence-queue size is equal to the number of resequence cells located inside the resequence rectangle depicted in Fig. 5. The remaining empty slots (5) do not contribute to the resequence-queue size because they correspond either to no cell arrivals or to arrivals of cells that are not out-of-sequence and, therefore, not shown. Note that the resequence-queue size at cycle is trivially bounded above by, the maximum possible number of cells inside the resequence rectangle. Similarly, the worst-case resequence-queue size is trivially bounded above by, the maximum possible number of cells inside the resequence rectangle when the length of the rows is the maximum possible, which, by virtue of (3) and (5), is equal to. This type of bound has been applied in previous works, such as [1] and [12]. We now proceed to derive tight upper bounds determining the worst-case resequence-queue size. 1) Dedicated/Shared Output-Queued Architecture: Lemma 1: For the dedicated and shared output-queued architectures, the resequence-queue size at cycle is bounded above by with the equality holding iff throughout the interval, at each cycle there are resequence cells stemming from the same planes. Proof: Let us consider the missing cells, and, in particular, a missing cell that has entered its plane at the earliest time compared with the others. Formally s.t. (7) It follows that is a principal missing cell. Suppose that from the missing cells depicted in the right-hand side of the resequence rectangle in Fig. 5, is the one that arrived the earliest, i.e., for all these cells, it holds that. Then,,,,, and. We now consider the row of the resequence rectangle corresponding to the plane, and establish the following proposition. Proposition 1: The slots corresponding to the th row (plane) of the resequence rectangle do not contain any resequence cell. Proof: For the purpose of contradiction, suppose that there exists a slot that contains a resequence cell which waits for. These correspond to cells and shown (6)

8 612 IEEE TRANSACTIONS ON COMMUNICATIONS, VOL. 55, NO. 3, MARCH 2007 Fig. 6. Snapshot of the output resequence queue (crosspoint-queued architecture). in Fig. 5. According to Remark 3, it holds that and. Cell arrives at the output adapter prior to, which by virtue of (4) implies that. From definition (7), it follows that the entry time of cell satisfies the inequality. Combining these inequalities yields, which implies that during the same cycle, first and were dispatched to planes and, respectively, and then was dispatched to plane. This, however, leads to contradiction, because it implies that has entered its plane earlier than. Consequently, the shaded row of Fig. 5 does not contain resequence cells. Proposition 1 implies that the number of resequence cells in the resequence rectangle can be at most, with the equality holding iff throughout the interval, there are resequence cells at each cycle stemming from the remaining planes. This completes the proof of the lemma. Theorem 1: For the dedicated and shared output-queued architectures, the worst-case resequence-queue size is given by achieved by a set of flows corresponding to a single input/output pair of adapters, and with a corresponding resequencing degree equal to the worst-case resequencing delay. Proof: From (3), (5), and (6), it follows that, with the equality holding iff equalities hold in both (5) and (6). The plane with the full buffer described in the condition of (5) corresponds to the th row of Proposition 1. On the other hand, the equality in (6) implies that throughout the interval, at each cycle there are resequence cells stemming from the remaining planes, which, in turn, excludes having a second plane with full buffer. This implies that all the resequence cells of the first cycle, as well as subsequent cycles, correspond to the same flow (in terms of input/output ports), which potentially corresponds to several flows from the various ingress-input ports of an input adapter to the various egress-output ports of the output adapter. Consequently, this (8) leads us to the unique worst-case resequence-queue size scenario in which cells of such flows are dispatched to a plane having a full buffer, and at the same time, subsequent cells are dispatched to the remaining planes where the buffers are empty. Note that in this scenario, it holds that the resequencing degree is equal to the worst-case resequencing delay. Remark 5: Note that the worst-case resequence-queue size implies a maximum resequencing degree. 2) Buffered-Crosspoint Output-Queued Architecture: We begin by noting that Lemma 1 established in the case of a dedicated/shared output-queued architecture does not necessarily hold in the case of the crosspoint output-queued architecture considered, as (4) no longer implies owing to the failure of the FIFO property. We therefore proceed by following a different methodology which in the end reveals that indeed Lemma 1 does not hold. Let us consider a snapshot of the worst-case output resequence-queue size at cycle, as shown in Fig. 6. Consider the principal missing cells, and let us group them based on the input adapter from which they were dispatched to the planes. Suppose that there are such groups formed, and without loss of generality, assume that they correspond to the first input adapters. Let us also consider the principal missing cells of the th group, and in particular, cell, which was the first of these cells to enter its plane. This allows us to assign all resequence cells of the flows corresponding to this group to a single flow corresponding to, without affecting the number of resequence cells. Consequently, applying this assignment to each of the groups results in a worst-case scenario in which there is a single principal missing cell for each such group. Let be the set of the principal missing cells. We now denote by the number of the cells in that are dispatched to the th plane, such that. Let be the vector and be the number of planes that contain such cells, i.e.,, with. Let us now consider a plane, say the th plane, for which. From the above, we deduce that the principal missing cells reside in the th th crosspoint buffer of the column of the th plane. Furthermore,

9 ILIADIS AND DENZEL: RESEQUENCING WORST-CASE ANALYSIS FOR PARALLEL BUFFERED PACKET SWITCHES 613 the worst-case resequence-queue size assumption implies that they are residing in the th plane throughout the interval. As a consequence, these crosspoints contribute cells to the th row of the resequence rectangle, none of which is a resequence cell. The proof is similar to the proof given for the dedicated/shared case, with (4) now being repeatedly applied to each of the crosspoint buffers, and is therefore omitted. These cells are indicated by the symbol in Fig. 6, where it is assumed that and. The number of resequence cells in the th row of the region is maximized when the remaining crosspoint buffers dispatch resequence cells (indicated by the symbol in Fig. 6). Moreover, the worst-case resequence-queue size assumption implies a high resequencing degree, which translates into a maximum length of the rows of the rectangle. This is achieved when the remaining th th crosspoint buffers also constantly dispatch (nonresequence) cells (indicated by the symbol in Fig. 6), and when at the arrival epochs of cells the corresponding crosspoints are full, i.e., contain cells. Owing to the round-robin service mechanism of the crosspoints buffers, the th row (starting at cycle ) contains identical -cycle-long segments followed by the last segment, which contains the principal missing cell(s), as depicted in Fig. 6. Clearly, for a given vector, both the resequencing degree and the number of resequence cells are maximized when, for instance, the resequence cell(s) are at the beginning and the missing cell(s) are at the end of the last segment. It will later be shown that a vector that maximizes the resequence-queue size does not necessarily maximize the resequencing delay. Remark 6: Depending on the values of, and, it may not always be possible that in all of the first rows of the last segment the cells are arranged in the form,,, as shown in Fig. 6. Consequently, the expressions subsequently derived constitute upper bounds, which are tight for some special combinations of, and, as will be discussed below. The number of resequence cells in the th row of the last segment cannot exceed the quantity, where, as depicted in Fig. 6. Furthermore, by considering all the planes, it follows that Let us now consider a plane, say the th plane, for which. The worst-case resequence-queue size implies that the first crosspoints constantly provide resequence cells corresponding to the resequence flows, and that the remaining crosspoints are empty, so that the th row is filled with resequence cells. This is illustrated in the top two rows in Fig. 6. From the above, it follows that is bounded above by (9), and by using the fact that and (10) with and (11) (12) From (10), it follows that the worst-case resequence-queue size is bounded above by In order to derive the function following two tight upper bounds: (13),we first evaluate the (14) We begin with the evaluation of. Conditioning first on and then on, (14) yields with s.t.,or (15) s.t. (16) We now establish the following lemmas. Lemma 2: Consider the vector, with elements having the value, elements having the value, and the remaining elements being equal to zero. The vector is maximal, i.e.,, and also results in the largest resequencing degree. Proof: See Appendix C. Corollary 1: It holds that (17) Proof: Immediate from Lemma 2 and (12). Lemma 3: It holds that, for either and, or. Proof: See Appendix C. Lemma 4: It holds that, for either or. Proof: See Appendix C. Theorem 2: The worst-case resequence-queue size is given by (18)

10 614 IEEE TRANSACTIONS ON COMMUNICATIONS, VOL. 55, NO. 3, MARCH 2007 for either and, with a corresponding resequencing degree,or, with a corresponding resequencing degree. Proof: From (10), (14), and by combining the results and conditions given by Lemmas 3 and 4, (13) yields. We now demonstrate that for each of these conditions, the equality holds. According to Remark 6, a sufficient condition for equality is when the cells in the first rows of the last segment are arranged in the form,,.for condition and, this is possible when, for each, cells reside in consecutive crosspoints in the th plane, and at the beginning of a segment, the round-robin mechanism serves the th (modulo ) crosspoint. For condition, this is possible when, at the beginning of a segment, the round-robin mechanism of the plane in which the only missing cell is located serves the crosspoint following the one in which the missing cell is located. A detailed description of the corresponding worstcase scenario is provided in Appendix B. In conclusion, the function does indeed represent the worst-case resequence-queue size. The resequencing degree corresponding to each of the conditions is obtained from (9) by considering the maximal vector as defined by Lemma 2. Remark 7: In contrast to the dedicated/shared output-queued architecture case, here the worst-case resequence-queue size does not necessarily imply a maximum resequencing degree. This is because when and the worst-case resequence-queue size is obtained by the vector, the corresponding resequencing degree is, which is less than the worst-case resequencing delay. However, the resequencing degree corresponding to the worst-case resequence-queue size obtained by the vector is equal to the worst-case resequencing delay. Note also that in the former case, Lemma 1, which was established for a dedicated/shared output-queued architecture, does not hold. 3) Comparison of the Output-Queued Architectures: From (8) and (18), it follows that the same expression applies to both a crosspoint architecture and a dedicated/shared architecture. There are, however, two differences between the dedicated/shared and the crosspoint architectures considered. First, the worst-case resequence-queue size in the former case is always achieved by a single flow, whereas in the latter case, it can also be achieved by flows. Second, according to Remark 7, the worst-case resequence-queue size implies the worst-case resequencing delay only in the former case. In the latter case, the worst-case resequence-queue size does not necessarily imply a maximum resequencing delay. Remark 8: The worst-case resequence-queue size does not depend on or. Note again that in obtaining these results, no assumptions were made regarding the load-balancing scheme. As in the evaluation of the worst-case resequencing delay, it turns out that these results also apply in the case of the state-independent plane load-balancing schemes considered. The scenarios are constructed using the technique presented in Appendices A and B. This technique involves two steps. First, the creation of backpressure and backlog in the output adapter, the switch buffers, and the input adapters, by considering a hot-spot scenario. Second, the consideration of appropriately synchronized arrival patterns. C. Priorities We consider here a system operating under the presence of multiple priorities. In this case, the results obtained in the preceding Sections III-A and III-B apply to the highest priority class. Let us now consider a lower priority flow of cells. The worst-case resequence-queue size and resequencing delay arise when cell is dispatched to a plane and stays there indefinitely, because it is always preempted by traffic of higher priority, and at the same time, cells subsequent to are transferred to the remaining planes, and from there to the output adapter. In this case, both the worst-case resequencing delay and the worst-case resequence-queue size are unbounded. Note also that this holds independently of, the number of priorities. Consequently, the model considered so far is prone to deadlock situations under priority traffic and regardless of the size of the output buffers. From the above, it follows that additional measures are needed to ensure safe operation. One solution, for instance, could be to devise a mechanism to identify and release lower priority missing cells that have been blocked for a long period of time. This amounts to overriding priorities at these release instants. As a first attempt, we have simulated numerous such schemes, and found that a reduction of the maximum measured resequence-queue sizes could only be achieved by schemes of high complexity. The worst-case resequence-queue sizes associated with these schemes are the subject of further study. IV. CONCLUSIONS The general class of PBPSs based on a dedicated, shared, or buffered-crosspoint output-queued architecture has been considered. The issue of evaluating the minimum resequence-queue size required for a deadlock-free lossless operation in the presence of multiple flows was addressed. An analytical method was developed for evaluating the absolute worst-case resequence-queue size, independently of the load-balancing mechanism used, and it was proved that this is also the worst case for some typical load-balancing mechanisms, such as random, cyclic, and round-robin. It was demonstrated that for these mechanisms, the worst-case resequence-queue size depends only on the characteristics of the parallel switches, and, in particular, on the number of their ports, the size of the internal switch buffers, and the number of parallel planes. The circumstances under which the worst-case resequence-queue size arises were identified. Whereas earlier works derived only upper bounds of the worst-case resequence-queue size, we derived the exact values, which for a small number of planes are significantly lower than these loose upper bounds. Interestingly, in the case of the crosspoint-queued architecture with round-robin column scheduling, it is found that the resequencing delay corresponding to the exact worst-case

11 ILIADIS AND DENZEL: RESEQUENCING WORST-CASE ANALYSIS FOR PARALLEL BUFFERED PACKET SWITCHES 615 resequence-queue size is not necessarily the maximum possible one. Furthermore, despite some strikingly different characteristics of the output-queued architectures considered, such as the lack of the FIFO property of the crosspoint-queued architecture, the worst-case resequence-queue size is the same for all three architectures. The results obtained apply to the highest priority class only. In systems having multiple priority classes, however, additional measures are needed to ensure a safe operation. Their nature has been outlined, and is a subject of future work. A further reduction of the resequence buffer size required is possible by adopting more sophisticated (state-dependent) load-balancing mechanisms. The analysis of these schemes builds upon the analytical methods developed and the results obtained in this paper, and is a subject of ongoing investigation. APPENDIX A SCENARIO FOR EXTREME ASYMMETRY We show here that the scenario with the extreme plane asymmetries considered in Section III-A is feasible under any of the state-independent load-balancing schemes considered. 1 It is based on the property that any of these schemes cannot exclude the possibility that successive cells of a given flow are dispatched to the same plane. While for the random scheme this is apparent, for the other two schemes, this is less so. Let us, for instance, assume a round-robin mechanism. We begin the construction of the scenario by initially considering an empty system and flows arriving at all ingress-inputs at full rate, and destined to the first egress-output of the first output adapter. This hot-spot scenario eventually causes the egress-output buffer, as well as the crosspoint buffers of the first column of each plane, to fill up. 2 Owing to the assumption of an expansion factor greater than or equal to one, the potential input rate of is larger than the rate of cells per cell cycle at which the egress output buffer is emptied. Therefore, the egress output buffer remains filled and exercises constant backpressure to the planes. Consequently, a period of one arbitration cycle lasts cell cycles. This implies that the rate at an ingress-input can be at most cells per arbitration cycle. We now show how one arrives at the scenario described by considering appropriately synchronized arrival patterns. Suppose, without loss of generality, that each of the round-robin input load-balancers points to the first plane. The plane asymmetry is now created by changing the arrival pattern as follows. Let us consider the arbitration cycle that begins at the instant when one cell place becomes available in of the first plane, owing to a cell transfer to the output adapter. 3 All flows stemming from the first input adapter are stopped, except the one between the first ingress-input port and the first egress-output 1 Although this scenario refers to a crosspoint-queued architecture, a similar scenario can be constructed in the case of a dedicated/shared output-queued architecture. 2 Note that when the crosspoint buffers of the switches are filled before the egress-output buffer, the input VOQs also start filling up. One then arrives at this state by allowing the egress-output buffer to fill up as well, and subsequently stopping the incoming traffic until the input VOQs become empty. 3 In the case of a cyclic load-balancing mechanism, we assume that the flow out of the output adapter is momentarily interrupted until the cell transfer occurs at a slot preassigned to the first plane. port. The rate of this flow is changed to one (shaded) cell every (arbitration) cycles. The first cell of this flow is transferred to buffer of the first plane. Consider now the instant (within the same arbitration cycle) when one cell place becomes available in of the second plane, owing to a cell transfer to the output adapter. A new flow between the first ingress-input port and the last egress-output port is initiated at a rate of cells every (arbitration) cycles. According to the dynamic state-independent schemes considered, the first cell of this flow is transferred to buffer of the second plane, and the th cell to the th plane. It is now evident that at the end of the arbitration cycle buffer of the first plane is full, whereas buffers of the remaining planes are no longer full (one cell place is empty). This is the first instance of asymmetry between the planes. The above sequence of arrival patterns is now successively repeated in the next arbitration cycles at input adapters 2 through, such that buffers of the first plane are full, in contrast to buffers of the remaining planes, which are no longer full (one cell place is empty). Note also that at the end of this -cycle interval, each of the round-robin input load-balancers has dispatched cells to all planes, and therefore points to the first plane, just as it did at the beginning of this interval. The same holds in the case of the cyclic load-balancing mechanism. Repeating the scenario described above in the subsequent -cycle intervals will, therefore, result in the buffers of the first plane being always full, in contrast to buffers of the remaining planes, which are continuously depleted until they become empty. APPENDIX B WORST-CASE RESEQUENCE-QUEUE SIZE Theorem 2 states that the worst-case resequence-queue size can be achieved either by flows or a single flow. We choose here to present a scenario that builds upon the one presented in Appendix A and is based on a single flow, recognizing that a similar one can be constructed for the case of flows. Condition of Theorem 2 implies that all resequence cells belong to a single flow. This flow is assumed to correspond to the pair between the first ingress-input port and the first egress-output port, with its rate being changed again to the maximum of (black) cells per cell cycle. The first, second, third,, th cell of this flow are assumed to join buffers of the first, second, third,, th plane, respectively. In the following arbitration cycle, the second, third,, th cell of the flow (along with a cell from buffer of the first plane) will be transferred to the egress-output buffer and join the resequence queue. Note also that the th cell of the flow will be dispatched to the second plane because buffer of the first plane is full. In this way, at each arbitration cycle, cells of the flow will be transferred from planes 2 through to the egress-output buffer. All these cells join the resequence queue, as they have to wait for the first cell of the flow stored in buffer of the first plane. This cell will be dispatched to the egress-output buffer after arbitration cycles, corresponding to the number of cells in the first plane upon its arrival. Consequently, the resequence-queue size will grow to, which, according to (18), is the worst case.

12 616 IEEE TRANSACTIONS ON COMMUNICATIONS, VOL. 55, NO. 3, MARCH 2007 APPENDIX C PROPERTIES RELATED TO THE BUFFERED-CROSSPOINT ARCHITECTURE Proof of Lemma 2: First, we show that results in the largest resequencing degree. This follows immediately from (9) and the fact that s.t.. To show that is also maximal, it suffices to show that there is a maximal vector with such that. Suppose there is a maximal vector with for some pair. We will show that the vector is also maximal, with and. Clearly, repeatedly applying this procedure results in a maximal vector with,or. Let and. The following cases are considered. 1). Then, and. From the above and (12), it follows that. Owing to the assumption, the event implies, and consequently,. From the above, it follows that and, by virtue of definition (16),, which implies that the vector is also maximal. 2). Then, and and given that also maximal. Proof of Lemma 3:, and for. From the above and (12), it follows that. From this inequality, it follows that and, by virtue of definition (16),, which implies that the vector is Substituting (17) into (15) yields. We proceed by observing that, with the equality holding iff or. Furthermore, it holds that, with the equality holding iff or, which is equivalent to, considering that. Consequently,, with the equality holding iff,or and. From the above, it now follows that., Proof of Lemma 4: Conditioning first on and then on, (14) yields s.t., and by virtue of (11),, where the two terms of the maximum are both equal to and are obtained when and, respectively. REFERENCES [1] T. Aramaki, H. Suzuki, S.-I. Hayano, and T. Takeuchi, Parallel ATOM switch architecture for high-speed ATM networks, in Proc. IEEE Int. Conf. Commun., Chicago, IL, Jun. 1992, vol. 1, pp [2] S. Iyer, A. Awadallah, and N. McKeown, Analysis of a packet switch with memories running slower than the line rate, in Proc. IEEE IN- FOCOM, Tel Aviv, Israel, Mar. 2000, vol. 2, pp [3] S. Iyer and N. McKeown, Making parallel packet switches practical, in Proc. IEEE INFOCOM, Anchorage, AK, Apr. 2001, vol. 3, pp [4] W. Wang, L. Dong, and W. Wolf, Distributed switch architecture with dynamic load-balancing and parallel input-queued crossbars for terabit switch fabrics, in Proc. IEEE INFOCOM, New York, NY, Jun. 2002, vol. 1, pp [5] F. Abel, C. Minkenberg, R. P. Luijten, M. Gusat, and I. Iliadis, A fourterabit packet switch supporting long round-trip times, IEEE Micro, vol. 23, no. 1, pp , Jan./Feb [6] F. M. Chiussi, J. G. Kneuer, and V. P. Kumar, Low-cost scalable switching solutions for broadband networking: The ATLANTA architecture and chipset, IEEE Commun. Mag., vol. 35, no. 3, pp , Mar [7] M. A. Henrion, G. J. Eilenberger, G. H. Petit, and P. H. Parmentier, Multipath self-routing switch, IEEE Commun. Mag., vol. 31, no. 4, pp , Apr [8] I. Iliadis and Y.-C. Lien, Resequencing in distributed systems with multiple classes, in Proc. IEEE INFOCOM, New Orleans, LA, Mar. 1988, pp [9] I. Iliadis and L. Y.-C. Lien, Resequencing delay for a queueing system with two heterogeneous servers under a threshold-type scheduling, IEEE Trans. Commun., vol. 36, no. 6, pp , Jun [10], Resequencing control for a queueing system with two heterogeneous servers, IEEE Trans. Commun., vol. 41, no. 6, pp , Jun [11] N. Gogate and S. S. Panwar, On a resequencing model for high speed networks, in Proc. IEEE INFOCOM, Toronto, ON, Canada, Jun. 1994, vol. 1, pp [12] H. J. Chao and J. S. Park, Architecture designs of a large-capacity Abacus ATM switch, in Proc. IEEE CLOBECOM, Sydney, Australia, Nov. 1998, vol. 1, pp Ilias Iliadis (S 84 M 88 SM 99) received the B.S. degree in electrical engineering in 1983 from the National Technical University of Athens, Greece, the M.S. degree in 1984 from Columbia University, New York, NY, as a Fulbright Scholar, and the Ph.D. degree in electrical engineering in 1988, also from Columbia University. He has been with the IBM Zurich Research Laboratory, Rüschlikon, Switzerland, since He was responsible for the performance evaluation of IBM s PRIZMA switch. His research interests include performance evaluation, optimization and control of computer communication networks and storage systems, traffic control and engineering for networks, switch architectures, and stochastic systems. He holds several patents. Dr. Iliadis is a member of IFIP Working Group 6.3, Sigma Xi, and the Technical Chamber of Greece. He served as a Technical Program Co-Chair for the IFIP Networking 2004 Conference.

ILIADIS AND DENZEL: RESEQUENCING WORST-CASE ANALYSIS FOR PARALLEL BUFFERED PACKET SWITCHES 617 Wolfgang E. Denzel received the M.S. and Ph.D. degrees in electrical engineering from Stuttgart University, Stuttgart, Germany, in 1979 and 1986, respectively.

13 ILIADIS AND DENZEL: RESEQUENCING WORST-CASE ANALYSIS FOR PARALLEL BUFFERED PACKET SWITCHES 617 Wolfgang E. Denzel received the M.S. and Ph.D. degrees in electrical engineering from Stuttgart University, Stuttgart, Germany, in 1979 and 1986, respectively. Since 1985, he has been a Researcher at the IBM Zurich Research Laboratory, Rüschlikon, Switzerland. He was responsible for architectural design and performance evaluation of IBM s PRIZMA switch. He worked on system aspects of ATM-based corporate networks and corporate optical networks. In these fields, he participated in several European RACE projects and coordinated the ACTS COBNET project. His recent interests are in server interconnection networks, congestion control in lossless networks, and end-to-end simulation techniques for high-performance computing systems. He holds several patents.

Switching Using Parallel Input Output Queued Switches With No Speedup

IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 10, NO. 5, OCTOBER 2002 653 Switching Using Parallel Input Output Queued Switches With No Speedup Saad Mneimneh, Vishal Sharma, Senior Member, IEEE, and Kai-Yeung