A CLASS OF SHAPED DEFICIT ROUND-ROBIN (SDRR) SCHEDULERS

The Pennsylvania State University The Graduate School Department of Computer Science and Engineering A CLASS OF SHAPED DEFICIT ROUND-ROBIN (SDRR) SCHEDULERS A Thesis in Computer Science and Engineering by Soranun Jiwasurat c 2005 Soranun Jiwasurat Submitted in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy August 2005

The thesis of Soranun Jiwasurat has been reviewed and approved* by the following: George Kesidis Associate Professor of Electrical Engineering and Computer Science and Engineering Thesis Adviser Chair of Committee Raj Acharya Professor of Computer Science and Engineering Head of the Department of Computer Science and Engineering Guohong Cao Associate Professor of Computer Science and Engineering Peng Liu Assistant Professor of Information Sciences and Technology Director, Cyber Security Lab Natarajan Gautam Associate Professor of Industrial Engineering *Signatures are on file in the Graduate School.

iii Abstract Scheduling mechanisms are deployed in the egress linecards of modern Internet routers to control dequeue from packet memory. Scheduling is required because the egress link fed by packet memory may be channelized (partitioned) and individual flows within each channel may have different priorities or have explicit bandwidth allotments. At extremely high data-rates, schedulers must possess extremely low computation complexity and be able to handle flows of variable-length packets. Existing examples of such schedulers include deficit round robin (DRR). In addition, it may be desirable to limit the variation about the peak-rate (jitter or burstiness) of certain egress flows (e.g., those under expedited forwarding (EF) service) so that their packets are not deemed out-of-profile down stream and possibly dropped as a result. In this research, a scheduler called Shaped Deficit Round-Robin (SDRR) is proposed based on a multiple time-scale decomposition using Shaped Weighted Round- Robin (SWRR) and Deficit Round-Robin (DRR) ideas wherein the deficit counter itself is used for shaping instead of a separate leaky bucket, resulting in O(1) per packet computational complexity. SDRR is proved to have a strict and precisely defined per-flow shaping property making it unique among existing DRR schedulers. As such, the SDRR scheduler is suitable for very high speed multiplexing of flows of variable-length packets. Although SDRR is implementable at high-speeds and has a strict and precisely defined per-flow shaping property, flows, especially those with larger bandwidth allocations, tend to receive less bandwidth than allocated. To improve the under-allocated

iv bandwidth problem while maintaining the ability to limit per-flow shaping property, a Hierarchical Shaped Deficit Round-Robin (HSDRR) is defined. HSDRR is a hybrid round-robin/time-stamp scheduler that employs SDRR scheduling in the first stage and Shaped Virtual Clock (SVC) in the second and final stage. Moreover, feasible varieties of SDRR that we have considered to achieve better packet latency performance or be partially work-conserving are described.

v Table of Contents List of Tables...................................... viii List of Figures..................................... ix List of Acronyms.................................... xii Acknowledgments................................... xiv Chapter 1. Introduction................................ 1 1.1 Motivation................................. 1 1.2 Traffic Descriptor............................. 4 1.3 Overview of Deficit Round-Robin (DRR)................ 6 1.4 Review of Previous Work on Per-flow Scheduling........... 8 1.4.1 Time-stamps per packet based schedulers........... 8 1.4.2 Round-robin based schedulers.................. 9 1.5 Problem Statement............................ 11 1.6 Organization of thesis.......................... 13 Chapter 2. Background on Shaped Weighted Round-Robin (SWRR) scheduling 14 2.1 Shaped Virtual Clock........................... 14 2.2 Non-work-conserving Shaped WRR (SWRR)............. 18 Chapter 3. Shaped Deficit Round-Robin (SDRR)................. 22

vi 3.1 Introduction................................ 22 3.2 Shaped Deficit Round-Robin algorithm................. 23 3.3 Performance Analysis........................... 27 3.3.1 Output jitter performance of SDRR.............. 27 3.3.2 Worst-case latency performance of SDRR........... 29 3.4 Alternate versions of SDRR....................... 30 3.4.1 SDRR without the knowledge of transmission queue occupancy 30 3.4.2 SDRR without backpressure from the transmission queue.. 31 3.5 Simulation results............................. 31 3.5.1 Weighted throughput performance............... 32 3.5.2 Output jitter performance.................... 33 3.5.3 Worst-case packet latency performance............. 34 3.6 Conclusions................................ 34 Chapter 4. Shaped Deficit Round-Robin with Declared Output Jitter Bounds per flow (SDRR-σ).............................. 48 4.1 Introduction................................ 48 4.2 SDRR-σ Algorithm............................ 48 4.3 Output jitter performance........................ 50 4.4 Simulation results............................. 51 4.5 Conclusions................................ 54 Chapter 5. Hierarchical Shaped Deficit Round-Robin (HSDRR)......... 61 5.1 Introduction................................ 61

vii 5.2 Performance of Shaped VirtualClock on Flows of Variable-Length Packets.................................. 62 5.2.1 Output-jitter performance of SVC............... 63 5.3 Hierarchical Shaped Deficit Round-Robin............... 67 5.3.1 Choice of flow clustering parameters.............. 68 5.3.2 Flow clustering.......................... 70 5.4 Performance comparison of HSDRR with SDRR and SRR...... 72 5.5 Conclusions................................ 73 Chapter 6. Conclusions and Future Work...................... 80 6.1 Conclusions................................ 80 6.2 Future work................................ 82 6.2.1 Feasible varieties of SDRR.................... 82 6.2.1.1 A hierarchical version of SDRR............ 82 6.2.1.2 A partially work-conserving version of SDRR.... 83 6.2.2 Flow clustering strategy..................... 84 References........................................ 86

viii List of Tables 2.1 Example of flow weight parameters with no overbooking condition... 20 4.1 Traffic parameters for SDRR-σ simulations................ 52 5.1 Example of per-flow output jitter bound parameters for HSDRR simulations...................................... 70

ix List of Figures 1.1 A token bucket traffic shaper........................ 6 1.2 Multiplexing N packet flows........................ 12 2.1 A buffered multiplexer............................ 15 2.2 Timing diagram of a SWRR scheduler for a high-speed implementation 21 2.3 Example of a f = 35 slot SWRR frame................... 21 3.1 A buffered multiplexer and a transmission queue for variable-length packets....................................... 23 3.2 A description of SDRR Algorithm..................... 25 3.3 Rule for scheduling out the next packet at time r under SDRR.... 35 3.4 Tributary queue j s packets departing the transmission queue..... 35 3.5 A description of SDRR without the knowledge of transmission queue occupancy (SDRRb) algorithm....................... 36 3.6 Plot of SDRRa weighted throughput.................... 37 3.7 Plot of SDRRb weighted throughput.................... 38 3.8 Plot of SDRRc weighted throughput.................... 39 3.9 Plot of weighted throughput of different SDRR schedulers under the Internet packet length distribution [42]................... 40 3.10 Plot of SDRRa maximum output jitter................... 41 3.11 Plot of SDRRb maximum output jitter................... 42

x 3.12 Plot of SDRRc maximum output jitter for flows of constant-length 41- bytes and 1500-bytes packets........................ 43 3.13 Plot of SDRRc maximum output jitter for flows of Internet packet length distribution.................................. 44 3.14 Plot of SDRRa packet latency........................ 45 3.15 Plot of SDRRb packet latency........................ 46 3.16 Plot of SDRRc packet latency........................ 47 4.1 A description of SDRR-σ Algorithm.................... 49 4.2 Plot of SDRR-σ s weighted throughput for flows of constant-length 1500- byte packets.................................. 56 4.3 Plot of SDRR-σ s maximum output jitter for flows of constant-length 1500-byte packets............................... 57 4.4 Plot of SDRR-σ s packet latency for flows of constant-length 1500-byte packets.................................... 58 4.5 Plot of SDRR-σ s weighted throughput for set σ 4............. 59 4.6 Plot of SDRR-σ s weighted throughput for flows of constant-length 40- byte packets.................................. 60 5.1 Hierarchical SDRR (HSDRR)- a basic building block of queues..... 74 5.2 Example of HSDRR scheduler acting on flows clusterized based on bandwidth allocation (ρ) parameters in Table 5.1................ 75

xi 5.3 Example of HSDRR scheduler acting on flows clusterized based on ratio of per-flow output jitter bound to weighted allocation (σ/ρ) parameters in Table 5.1.................................. 76 5.4 Plot of weighted throughput for flows of Internet packet length distribution for SDRR, HSDRR-ρ, and HSDRR-σ/ρ............... 77 5.5 Plot of weighted throughput for flows of Internet packet length distribution for SDRR, SRR and HSDRR...................... 78 5.6 Plot of maximum output burstiness for flows of Internet packet length distribution for SDRR, SRR and HSDRR.................. 79

xii List of Acronyms ATM Bps BSFQ DRR EF FCFS FIFO Gbps HSDRR HSDRR-ρ Asynchronous Transfer Mode Bytes Per Second Bin Sort Fair Queueing (algorithm) Deficit Round-Robin (algorithm) Expedited Forwarding (service) First Come First Serve (a.k.a FIFO) First In First Out Gigabits Per Second Hierarchical Shaped Deficit Round-Robin (algorithm) HSDRR acting on flows clusterized based on bandwidth allocation parameters HSDRR-σ/ρ HSDRR acting on flows clusterized based on ratio of per-flow output jitter bound to bandwidth allocation parameters HOL IETF IP LER LFVC Head-of-Line The Internet Engineering Task Force Internet Protocol Label-Edge Router Leap Forward Virtual Clock (algorithm)

xiii LSP LSR MBS MPLS MSS PIR QoS SCED SDRR SLA SRR SVC SWRR SIR TDT TM WRR Label-Switched Path Label Switching Router Maximum Burst Size Multi Protocol Label Switching Monolithic Shaper-Scheduler Peak Information Rate Quality of Service Service Curve Earliest Deadline Shaped Deficit Round-Robin (algorithm) Service-Level Agreement Smoothed Round-Robin (algorithm) Shaped Virtual Clock (algorithm) Shaped Weighted Round-Robin (algorithm) Sustained Information Rate Target Departure Time Traffic Management Weighted Round-Robin (algorithm)

xiv Acknowledgments I am extremely grateful to my advisor, Professor George Kesidis, for introducing me to this interesting research and giving me a good deal of guidance and encouragement. I am especially indebted for the financial support which he has provided to me over the years. It was a privilege to work with him and learn from him. I thank my other committee members, Professors Raj Acharya, Guohong Cao, Peng Liu, and Natarajan Gautam, for their insightful commentary on my work. Last, but not least, I am also thankful to Patcharee Busarawong for her understanding, unfailing love, and support. Most of all, my deepest gratitude goes to my parents, Sawang and Somjit Jiwasurat, for their unconditional love and encouragement.

1 Chapter 1 Introduction 1.1 Motivation Over the last decade, traffic scheduling algorithms have gained considerable attention as a necessary part in packet-switched networks to support deterministic Qualityof-Service (QoS) guarantees. In packet-switched networks, packets from different flows (streams of packets that require the same service requirement) compete with each other for the same output link of a switch. Using scheduling algorithms at switching nodes, we can control the interactions among different traffic flows according to their service requirements. A variety of scheduling algorithms were surveyed in [40]. Among these algorithms [1, 14, 36, 37, 11, 10, 22], schedulers based on time-stamps per packet were shown to overcome the undesirable property of coupling between bandwidth and delay guarantees, which can limit the number of flow that can be accepted. These schedulers were designed to handle variable-length packets but the computation complexity per packet associated with time-stamps may be too high for modern very high-speed internet routers, i.e., OC-192 (10 Gbps) and, in the future, OC-768 (40 Gbps). Typically, time-stamps per packet based algorithms have a time complexity of O(log 2 N), where N is the number of flows/queues being scheduled. The time complexity of O(log 2 N) per packet is due

2 to the need to sort packets in the order of their deadlines. Thus, simpler round-robin schemes have been suggested for routers with very high speed link capacities. Under round-robin schedulers [28, 35, 6, 26, 5, 34], time is divided into successive rounds and in each round each flow is given an opportunity to transmit packets in some sort of round-robin fashion. By eliminating the sorting of packet timestamps, they allow an O(1) per packet scheduling decision. As a result, the general advantage of round-robin based schedulers is that they require a minimal amount of computation perpacket to be implemented. Under a round-robin scheme, Deficit Round-Robin (DRR) [35] (or its variation) is the most popular scheduling algorithm implemented in modern high speed routers. The IETF has considered mechanisms to limit the variation (or jitter ) of the output rates about the allocated peak rates of certain aggregate flows [16, 17], e.g., those under expedited forwarding service. Because of the action of a router s ingress linecard and switch fabric, a flow arriving to a router with a certain peak rate may have a higher peak rate (or much greater variation about the arriving peak rate) when it reaches the router s egress TM part [32, 13, 33, 7]. Thus, the egress TM part will need to shape the egress peak-rates of a potentially very large number of flows. In addition, it is desirable to limit the variation about the peak-rate of certain egress flows so that their packets are not deemed out-of-profile downstream or possibly dropped as a result. This leads to the notation of Service Level-Agreement (SLA) wherein the transmitting network agrees that her packet flow will conform to certain parameters. A preferable choice of flow parameters would be:

significant from a queueing perspective, simply to ensure conformity by the sending network, and 3 simple to police by the receiving network. Per-flow shaping at each network node (i.e., combined shaper-schedulers) has been advocated previously in [13, 33]. In the case of a buffered multiplexer with flows of variable-length packets, the outputs of the (tributary) FIFO queues are fed into a common transmission queue (actually, there could be a hierarchy of transmission queues [2]). The departures of each tributary queue could be shaped by a leaky bucket which employs backpressure from the transmission queue (triggered by departures from the transmission queue of the tributary queue under consideration); the result is that the departures from the transmission queue that emanated from the tributary queue have a bounded burstiness [16, 17]. Such a mechanism may, in fact, achieve very conservative bounds. Without the backpressure signal, it is not clear whether the jitter bounds ensured by the leaky buckets will hold after the transmission queue. This research will explore a class of shaper-schedulers based on DRR and Shaped Weighted Round-Robin (SWRR) [25] ideas wherein the deficit counter itself is used for shaping instead of a separate leaky bucket, resulting in significantly less implementation complexity. The emerging goal is to obtain a design of combined shaper-schedulers of variablelength packets with the following properties: the ability to allocate a minimum amount of bandwidth to a flow irrespective of the length distribution of its component packets or those of other flows. the ability to fully allocate (book) the available bandwidth resource.

the ability to be work-conserving so as to disburse excess capacity to elastic flows that may not be peak rate limited. 4 the ability to impart low packet latency for certain flows. the ability to limit the variation (jitter) about the designated peak rate of output flows. implementable at high speeds, i.e., per packet computational complexity is O(1). 1.2 Traffic Descriptor This section will discuss a preferable parameters of Service Level-Agreements (SLAs). The mean arrival rate is generally useful in terms of predicting queueing behavior. However, it is difficult to police flows according to their mean arrival rate parameters as a reliable estimate is only known after the flows have terminated. Instead of the mean arrival rate, we consider parameters that stipulate a maximal amount of flow (in bytes) on a sliding time-window basis. The defined burstiness of a flow is limited by such parameters. We now describe preferable traffic parameters which can be used to regulate and police the burtiness of traffic flows. One can define the burstiness σ (bytes) of a flow of packets as a function of the amount of bandwidth, say ρ bytes per second (Bps), used to service the flow: if the flow is serviced by a queue exceed σ bytes. Such a definition for burstiness informs a node (for example, a label-edge router (LER) at the edge of a Multi Protocol Label Switching (MPLS) network) so that it can allocate both memory and bandwidth resources in order to accommodate such a regulated flow. Moreover, by

5 limiting the burstiness of a flow, one also limits the degree to which it can affect other flows with which it shares network resources. Indeed, such regulation was standardized by the ATM Forum and adopted by the IETF [16, 17], the latter in order to help specify SLAs at network boundaries. Suppose that at some location there is a flow of packets arriving to a token bucket mechanism depicted in Figure 1.1. A token represents a byte and tokens arrive at a constant rate of ρ tokens/s to the token bucket which has a limited capacity of σ tokens. A head-of-line packet of size m leaves the packet queue when there are at least m tokens present in the token bucket; when the packet leaves, it consumes m tokens, i.e., they are removed from the bucket. Note that his mechanism requires that σ by larger than the largest packet length (again, in bytes) of the flow otherwise the token bucket mechanism may deadlock. Let A 0 be the total number of bytes departuring from the packet queue, and A be any arrival process to the packet queue. In [8], it stated that any flow A that satisfies the token bucket condition, i.e., A 0 (s, t] σ + ρ(t s) for all times s t, is said to satisfy a (σ, ρ) constraint. In IETF jargon [16, 17], ρ could be a sustained information rate (SIR) and σ could be a maximum burst size (MBS). Alternatively, ρ could be a peak information rate (PIR) in which case σ would usually be taken to be a maximum packet size, i.e., σ 1500 bytes MBS. Clearly, the token bucket can delay packets of the arrival flow A so that the departure flow A 0 is (σ, ρ)-constrained [8]. This is known as traffic shaping. The receiving network of the exchange of flows described above may wish: shape the flow using a (σ, ρ) token bucket; or

simply identify (mark) any packets that are deemed out of the (σ, ρ) profile of the flow; or 6 drop any out-of-profile packets. A A 0 Fig. 1.1. A token bucket traffic shaper It is noted that the last two actions are know as traffic policing. This can be done on a packet-by-packet basis in the leaky bucket mechanism for feasible policing. 1.3 Overview of Deficit Round-Robin (DRR) This section will describe a Deficit Round-Robin (DRR) mechanism that will be among the basic building block for the schedulers described in this research. In [35], weighted DRR was formulated and was shown to have the property that the bandwidth

7 allocated to a flow is proportional to its weight parameter (i.e., unity weighted throughput ) irrespective of its packet length distribution or that of other queues. Under DRR, at the beginning of each round, it assigns a certain number of tokens to each nonempty flow are assigned proportional to the weight of the flow. Each flow has a deficit counter to count the unused portion of the allocated tokens. Packets departing a flow consume their byte-length in tokens from the flow s allotment. Flows are serviced in a round until the sum of their deficit counter and token allotment becomes insufficient to transmit their head-of-line packet, and the currently unused tokens are carried over to the next round as the value of the deficit counter. If a flow has no packets at the end of a round, its deficit counter may be reset to zero. Note that the token allotments per round can be precomputed given the weight parameters which change at a much slower connection-level time-scale than that of the transmission time required for a single packet. DRR has an implementation simplicity comparable to that of Weighted Round- Robin scheduling (cf., Section 1.4.2) and is able to handle variable length packets by use of deficit counters per queue. Varieties of DRR are now being deployed in scheduling mechanisms in traffic management (TM) components of modern egress Internet linecards [43, 44]. It is known that DRR suffers from a high worst-case packet latency [35, 34], however, DRR+ [35] and DRR++ [27] employ a priority mechanism to expedite the service of packets belonging to latency critical flows. These mechanisms will be described in detail in Section 6.2.1.1.

8 1.4 Review of Previous Work on Per-flow Scheduling The literature reviews of schedulers that focus on providing per-flow performance guarantees and/or can be implemented in very high-speed links are discussed. The previous work in time-stamps per packet based schedulers are first discussed in Section 1.4.1. Section 1.4.2 will review several Round-Robin based schedulers that were recently proposed to achieve good fairness and output burstiness while maintaining low computational complexity. 1.4.1 Time-stamps per packet based schedulers A hybrid VirtualClock and Shaped VirtualClock (SVC) [37] supports all of the desirable shaper-scheduler properties except it is non-work-conserving algorithm, and not implementable at high speeds as its per-packet computational complexity is O(log 2 N). It is known that the available link bandwidth cannot be fully utilized under Shaped VirtualClock because it is non-work-conserving in principle. The Monolithic Shaper-Scheduler (MSS) described in [12] was proposed to obtain work-conserving behavior while preserving minimum-bandwidth guarantees and output burstiness bounds. MSS uses per-flow dual-leaky-bucket profiles to relax the non-workconserving behavior of Shaped VirtualClock scheduler in order to disburse the available bandwidth among eligible flows. That is, it allows some in-profile flows to violate the eligibility condition whenever no eligible flows are available. Even though it was shown that MSS possesses the ability of fully utilize the available bandwidth while maintaining

9 the crucial properties of Shaped VirtualClock, its implementation complexity is considerably high, that is unacceptable to high-speed implementability. This is because MSS requires all eligible packets to be sorted into one monolithic queue. Leap Forward Virtual Clock (LFVC) [36] achieves a reduction in computational complexity from O(log 2 N) to O(log 2 log 2 N) by using coarser numerical precision to represent the packet time-stamps. However, the sorting mechanism of LFVC requires a complicated data structure that is impractical to implement at high-speeds. Time-stamps per packet based Schedulers with consider delay jitter bounds and low computational complexity have appeared in the SCED+ algorithm [9, 4, 3]. SCED+ uses a damper (traffic regulator) to regulate the delay variation at one network node in the next network node. Also, SCED+ provides a representation of output burstiness to obtain and analyze delay jitter results. However, it is not clear whether SCED+ possesses a precisely defined per-flow output burstiness property. Shaped virtual clock (SVC) scheduling [37] employs time-stamps to give packets service priority over others [41] but restricts consideration only to packets that meet an eligibility criterion [8, 40, 18, 37, 36], the latter used to limit the jitter of the individual output flows (an ability that DRR does not possess). 1.4.2 Round-robin based schedulers Bin Sort Fair Queueing (BSFQ) [6] uses an approximated priority scheme with a cyclic bin sort mechanism. It assigns a virtual service deadline to each arriving packet that is used to sort packets with similar deadlines into bins. All packets in the active

10 bin are transmitted in a FIFO manner until it is empty. Therefore, packets are scheduled in approximately the same order as their deadlines. Aliquem [26] addresses the known DRR problem, i.e., a larger frame length can cause poor fairness and high packet latency, where the frame length is the total size of packets that should be transmitted in each round when all flows are backlogged. In order to limit the frame size, Aliquem allows the quantum assigned to a flow to be scaled down and smaller than maximum packet size. Therefore, in each round, some flows may not be able to transmit any data. To manage the eligible flows in each round, Aliquem uses an Active List Queue mechanism to keep track of flows with enough accumulated quanta to transmit a packet. Also, the similar approach in employing flexible quantum assigned to a flow to improve the packet latency was found in [23, 38]. Similarly, Stratified Round-Robin [31] alleviates the larger frame size by grouping flows with close reserved bandwidths together into an aggregate flow. DRR is used to schedule flows in the same group, while time-stamps per packet based scheduling is used among aggregate flows to provide better packet latency. Smoothed Round-Robin (SRR) [5] uses a specially designed weight matrix to spread the quantum assigned to a flow over a round, so that a flow can send at most one maximum packet size at a time. In addition, in each round two consecutive visits of the same flow are spread as far as possible in order to smooth the output burstiness. These scheduling proposals were all shown to have low computation complexity and shown to achieve better fairness and lower packet latency than DRR. In addition, to the best of our knowledge, SRR is the only schedulers that attempt to address the perflow output burstiness problem under tight implementation constraints [31]. However, as

we will subsequently see, it is unclear that SRR possesses the per-flow output burstiness (jitter) control property. 11 1.5 Problem Statement In this research, we consider the problem of simultaneous bandwidth scheduling (multiplexing) and shaping of distinct flows of variable-length packets. Such shaperschedulers are resident in egress linecards of Internet routers operating on packet memories that feed output links. They are used for link channelization (partitioning a link into smaller channels) and at network boundaries where service-level agreements (SLAs), involving token bucket parameters [16, 17] are struck and policed. Flows are considered as streams of packets that require the same service requirement in packet switched networks. Distinct flows can be defined in a variety of ways, e.g., a two flow system based on transport layer protocol, flows corresponding to networklayer (diffserv) classes of service, or flows corresponding to individual provisioned labelswitched paths (LSPs). Suppose that at some location, N distinct flows of variable-length packets are to be multiplexed into a single flow. The flows are indexed by n {0, 1,..., N 1} in Figure 1.2. Suppose each flow n is assigned its own tributary queue, allocated a fraction ρ n of the available link (output) bandwidth c bytes/s, and the output flows of the tributary queues are multiplexed into a transmission queue served at a rate of c bytes/s. We assume that the tributary queues are not overbooked, i.e., N 1 n=0 ρ n 1. Given the problem of multiplexing N distinct flows of variable-length packets into a single flow, we present a shaper-scheduler [20] based on DRR and SWRR ideas

12 0 c N-1 Fig. 1.2. Multiplexing N packet flows wherein the deficit counter itself is used for shaping instead of leaky bucket, resulting O(1) per packet computational complexity. Indeed, we show that our proposed Shaped DRR (SDRR) scheduler have a strict and precisely defined per-flow shaping property making it unique among existing DRR schedulers. Our approach to the final goal, high-speed implementability, is to exploit the much longer time-scale of connection admission control; specifically we will employ a Shaped Weighted Round-Robin (SWRR) mechanism [25] to govern queue eligibility for service and maintain limits on the peak output rate. SWRR frames are recomputed infrequently and hence their computation cost is not felt at the packet level. Also, we discuss several feasible varieties of SDRR that possess some of the above desirable scheduling properties.

13 1.6 Organization of thesis This thesis is organized as follows. Chapter 2 reviews the definition and pertinent properties of SVC and SWRR working on fixed-length packet segments. Using SWRR to increment deficit counters, a class of SDRR schedulers [20] are formulated in Chapter 3 and bounds on the variation about the peak rate for each output flow is derived. In Chapter 4 we present a version of SDRR (with declared output jitter bounds) acting on variable-length packets [19]. To overcome the under-allocated bandwidth problem of SDRR, a low-complexity Hierarchical SDRR (HSDRR) scheduler [21] is proposed and discussed in details in Chapter 5. Finally, the promising directions of future work are discussed and conclusions are drawn in Chapter 6.

14 Chapter 2 Background on Shaped Weighted Round-Robin (SWRR) scheduling In this Chapter, Shaped VirtualClock (SVC) algorithm [37]is first described and its desirable ability to limit the variation about the designated peak rate of output flows is specified. We review the properties of Shaped Weighted Round-Robin (SWRR) [25] acting on flows of fixed length packets (which may be interpreted as segments of variablelength packets); proofs are given for completeness. Issues pertaining to variable-length packets are covered in Chapter 3. 2.1 Shaped Virtual Clock For a time-slotted buffered multiplexer (mux) using Shaped Virtual Clock scheduling, let ρ j be the bandwidth allocation of the j th first-in-first-out (FIFO) queue. A buffered mux with N queues having explicit bandwidth allocations 1 is depicted in Figure 2.1. The no overbooking condition is in effect: N 1 j=0 ρ j 1, i.e., the bandwidth quantities have normalized dimension packet segments per time-slot. Under Shaped Virtual Clock (SVC), the k th packet of the j th queue is eligible to depart at and after time E j k where E j k max{f j k 1, aj k }, 1 Even queues handling best-effort flows may have a small amount of bandwidth reserved as one measure to prevent starvation of service.

15 FIFO queue 0 a 0 k ρ 0 one packet segment per time slot a N 1 k FIFO queue N 1 ρ N 1 Fig. 2.1. A buffered multiplexer.

16 a j k Z+ {0, 1, 2, 3,...} is the arrival time of the k th packet to the j th queue, and F j k is its target departure time (TDT). Thus, for all j, k, the departure time d j k Z+ from the buffered mux of the k th packet of the j th queue satisfies d j k E j k. (2.1) For all j, k, the TDTs satisfy the following recursion: F j k = max{fj k 1, aj k } + 1 ρ j = E j k + 1 ρ j (2.2) where F j 0. For any given slot, the SVC scheduler serves the head-of-line (head-of-queue) packet with the smallest TDT among all such eligible packets. Note that, as long as there is a packet in the buffered mux that is eligible for service, the scheduler is nonidling. The following theorem bounds the variation 2 about the peak-rate of the packet departure process of a SVC queue. Theorem 2.1. For all times t s, the total number of departures from a SVC queue with bandwidth allotment ρ over the interval of time [s, t] = {s, s + 1,..., t} is less than or equal to 2 ρ + ρ(t s + 1), 2 Alternatively called just output jitter or output burstiness about the peak rate.

17 i.e., the packet departure process is said to be (2 ρ, ρ)-constrained. Proof: First note that any interval of time of length t spans at most 1 + ρ t consecutive intervals of time of length ρ 1. Also recall that for the queue under consideration, F k ρ 1 E k d k F k (2.3) where the first inequality is by Equation (2.1) and the second is the guaranteed-rate property of shaped virtual clock acting on flows of fixed-length packets [37, 25]. The property of shaped virtual clock acting on flows of variable-length packets will be addressed in our future work. For a positive rational ρ < 1, choose an interval of discrete time where s, t are nonnegative integers; note that [s, t] contains t s + 1 transmission epochs (integers). Let D be the cumulative packet departure process. The maximal value of D[s, t] ρ(t s + 1) is achieved under the following conditions: j ρ(t s) 1 is an integer there exists a packet index k such that F k 1 = E k = s d k 1 = s = F k 1 d k = s + 1 = E k + 1 d m = E m = s + (m k)/ρ for all k m k+j (in particular, d k+j = E k+j = t)

Note that ρ < 1 implies E k + 1 < E k+1 and, therefore, the final two conditions above are simultaneously possible. So, under these conditions, 18 D[s, t] ρ(t s + 1) = 2 + ρ(t s) ρ(t s + 1) = 2 ρ as desired. Now assume that the bandwidth allotments ρ are all rational; that is, for each j, there are co-prime positive integers a j b j such that ρ j = a j /b j. Let f be the least common multiple of the b j and let w j = a j f/b j Z +. Thus, ρ j = w j /f for all j. The queues of a buffered mux are said to be always backlogged if E j k = Fj k 1 for all j, k and E j 1 = aj 1 = 0 for all j. That is, F j k = a j 1 + k/ρ j. Theorem 2.2. The scheduling decisions made by SVC for an always backlogged buffered mux with such rational bandwidth allotments are periodic with period f [25]. 2.2 Non-work-conserving Shaped WRR (SWRR) Consider a bandwidth resource disbursed by a non-work-conserving WRR scheduler. The frame size f is chosen so that the smallest allowable bandwidth (granularity) that can be offered to a single queue is 1/f times the total link bandwidth. By Theorem 2.2, we can construct an idling WRR scheduler with frame size f that has exactly the

19 same service decisions as always-backlogged SVC; we call this scheduler Shaped Weighted Round-Robin (SWRR). Note that SWRR scheduling will enjoy the same output jitter property of SVC as described in Theorem 2.1 above. Theorem 2.3. Under SWRR, E j k dj k F j k + 1 ρ j 2 for all j, k [25]. For a high-speed implementation, timing diagram of operating SWRR is depicted in Figure 2.2. Time can be split up into consecutive intervals each of duration, say, y seconds where y is on the order of 10 ms. There are two SWRR arrays: one is active and the other is the under construction. That is, connection-level processing occurs every y-seconds. Consider the n th such interval of length y. The newly granted rates (to new or renegotiated connections) and expired rates (of recently terminated connections) over the interval of time [ny x, (n + 1)y x) are collected, where x represents the required time to compute the SWRR frame; here, of course, x < y is required. The under-construction frame is then computed over the interval [(n + 1)y x, (n + 1)y) and then swapped with the active frame at time (n + 1)y, i.e., the under construction frame becomes the active frame and vice-versa. This process then repeats. The complete detail of how SWRR could be implemented and a specific example of a SWRR frame is described in [25]. Again, we stress that the computation of the SWRR frame occurs on a much longer time-scale than the transmission time of a packet.

20 Thus, in terms of servicing queues, this computation is performed at the connection-level time-scale resulting in absolutely minimal (O(1) worst-case) per packet computational complexity. Figure 2.3 shows an example of a f = 35-slot SWRR frame (assuming a j 1 = 0 for all j) for a buffered multiplexer with N = 10 queues with the weights shown in Table 2.1. Table 2.1. Example of flow weight parameters with no overbooking condition Q index 0 1 2 3 4 5 6 7 8 9 ρ j 1/35 1/35 2/35 2/35 2/35 4/35 4/35 4/35 5/35 10/35

21 y y The n-1 th Active frame Under construction frame Under construction frame The n th active frame (n-1)y x Collecting information of new granted rates and expired rates of the n th interval. x ny Collecting information of new granted rates and expired rates of the n+1 th interval. (n+1)y x Time to compute the SWRR frame Fig. 2.2. Timing diagram of a SWRR scheduler for a high-speed implementation 9 8 5 6 9 7 2 9 8 3 4 9 5 6 7 9 8 0 9...... 5 6 9 7 8 1 9 2 3 9 4 5 6 7 8 9 Fig. 2.3. Example of a f = 35 slot SWRR frame

22 Chapter 3 Shaped Deficit Round-Robin (SDRR) 3.1 Introduction In this chapter, we present a shaper-scheduler of variable-length packet flows, called Shaped Deficit Round-Robin (SDRR). Our proposed SDRR is a multiple timescale mechanism employing Shaped Weighted Round-Robin (SWRR) to maintain its deficit counters in a manner similar to the leaky bucket mechanism, where the SWRR frame is computed over the slower connection-level time-scale resulting in O(1) per packet computational complexity. In order to limit the jitter about the designated peak rate of output flows, it employs backpressure signal, triggered by departures from the transmission queue of the flow under consideration. The organization of this chapter is as follows. First, our proposed SDRR is formulated in Section 3.2. Next, Section 3.3 presents the performance analysis of SDRR, especially a bound on the jitter about the peak rate for each output flow. Then, a lower computational complexity of SDRR is described in Section 3.4. A simulation study of the performance of SDRR is given in Section 3.5. Finally, conclusions are summarized in Section 3.6.

23 FIFO queue 0 packets ρ 0 transmission queue W C output link packets ρ Ν 1 FIFO queue N 1 Fig. 3.1. A buffered multiplexer and a transmission queue for variable-length packets 3.2 Shaped Deficit Round-Robin algorithm Under SDRR, each queue in the buffered multiplexer shown in Figure 3.1 has a deficit counter (D j, as with DRR [35]) and a backpressure signal (B j ). Let β be the number of bytes in a memory block which is chosen to be the smallest possible packet length; a typical modern value for β is 40. Let C bytes/s be the total output bandwidth (service rate of the transmission queue). A time-slot is therefore β/c seconds and the bandwidth allocation of the j th queue is ρ j C bytes/s. Let W (t) represent the number of bytes in the transmission queue including the remains of any partially transmitted packet at time t, i.e., W is the workload in the transmission queue: when a packet of size m arrives at time t (measured in time-slots), W (t) = W (t 1) + m β,

24 and when no packet arrives at time t, W (t) = max{w (t 1) β, 0}. A decision to schedule a head-of-line (HOL) packet out of one of the N tributary queues and into the transmission queue is made only once each time-slot. SDRR employs SWRR, operating on a time-slot basis, to schedule a head-of-line (HOL) packet out of one of the N tributary queue and into the transmission queue. Packet departures from the transmission queue, however, do not necessarily occur on time-slot boundaries; again, departures from the transmission queue occur according to a dedicated service rate of C. In fact, constructing such a scheduler to move packets from tributary queues to the transmission queue requires only the movement of packet pointers referencing packets stored in packet memory. Thus, the cost of moving packets from queue to queue is as expensive as the cost of moving their pointers which definitely preserves O(1) per packet computational complexity. The SDRR algorithm is summarized in Figure 3.2. Note that the quantity β/ρ j can be precomputed when the SWRR frame itself is computed and the quantity m/ρ j can be computed prior to the arrival of packets at the head of the transmission queue, i.e., the cost of this computation can be amortized over more than one time-slot. Also note that, at all times, D j β/ρ j and that D j 0 implies that W D j. The deficit counters are allowed to become negative in order to accommodate packet lengths that are not integer multiples of memory blocks (β).

25 Initially, for every queue j: 1. all of the N tributary queue backpressure signals are deactivated (B j OF F ). 2. all of their deficit counters are set to zero (D j 0). For every time-slot (of the SWRR frame) assigned to queue j: 1. if D j > 0 then D j D j β/ρ j 2. if W D j and B j == OF F : send the HOL packet of the j th queue to the transmission queue. activate backpressure on queue j (B j ON). 3. shift the SWRR pointer to the next queue index in the frame. When a packet from queue j of size m bytes moves to the head of the transmission queue: 1. deactivate backpressure on queue j (B j OF F ). 2. D j D j + m/ρ j. Fig. 3.2. A description of SDRR Algorithm

26 We now explain why the comparison is made between W and D j to decide whether to forward the HOL packet to the transmission queue; the explanation is made more precise in Section 3.3. Let t be the time (in seconds) at which a packet of size m bytes is transferred from tributary queue j to the transmission queue. This packet will arrive to the head of the transmission queue and begin to be transmitted out onto the output link at time s t + W (t)/c (at time s + m/c this packet has completely departed the transmission queue). To maintain a limit of ρ j C bytes/s on the flow emanating from the j th tributary queue on the output link, we would like the next packet of the j th queue to appear at the head of the transmission no sooner than time s + m/(ρ j C). If the next packet in the j th queue is released into the transmission queue at time r s, it will get to the head of the transmission queue at time r + W (r )/C. Therefore, to limit the peak output rate of flow j, we want to choose r so that r + W (r )/C s + m/(ρ j C). (3.1) According to the mechanism of SDRR as described above, at time s, the deficit counter D j (s) (deficit counter j evaluated at time s) is set to m/ρ j. Note that both t and r are integer multiples of time-slots, β/c. Over the interval of time [s, r ], SWRR will visit the j th queue roughly ρ j r s β/c times. (3.2)

27 Therefore, D j (r r s ) (m/ρ j ) ρ j β/c (β/ρ j ). (3.3) Finally, note that the inequality W (r ) D j (r ), using this approximation for D j (r ), is the desired condition (3.1), see Figure 3.3. 3.3 Performance Analysis 3.3.1 Output jitter performance of SDRR An arrival process of variable-length packets is said to be weakly (σ, ρc)-constrained if, for all times d that are start-times of the arrival of a packet and all times s d, A[s, d) σ + ρc(d s) where A[s, d) is the cumulative number of bytes arrived over the interval of time [s, d). Note that such a process with a maximum packet size 1 of M is (σ + M(1 ρ), ρ)- constrained in the ordinary sense. Theorem 3.1. Under SDRR, the j th flow emanating from the transmission queue will be weakly (3β, ρ j C)-constrained. Proof: Consider the scenario of Figure 3.4 depicting departures from the transmission queue 1 M = 1500 bytes or, if jumbo packets are used, M = 9600 bytes.

28 of packets that emanated from the j th tributary queue. The weak (σ, ρ)-constraint of this flow is achieved at times d and s that are start-times of transmission of packets (the former by definition). So, consider two such times s i and d k for i k and let m n be the size of the n th packet of the j th queue (departing the transmission queue over the interval of time (s n, s n + m n /C)). Let n j [s, r] be the number of visits to the j th queue by SWRR over the interval of time [s, r] with r s. Also let r be the time at which the k th queue-j packet is sent to the transmission queue, i.e., d k = r + W (r )/C r + D j (r )/C (3.4) where this last inequality is a rule of SDRR. Also, under SDRR, for all i, m i ρ j D j (s i ) m i β ρ j and, for all i k, D j (r ) D j (s i ) + 1 k m ρ n β n j ρ j [s i, r ]. n=i+1 j

29 Thus, D j (r ) > 1 k β + m ρ n β n j ρ j [s i, r ] n=i j = 1 ( β + A ρ j [s i, r ) ] β n j ρ j [s i, r ] j 1 ( β + A ρ j [s i, r ) ] j β [ r ] s (2 ρ ρ j + ρ j j β/c + 1 ) where A j is the cumulative departure process from the transmission queue that emanated from the j th tributary queue and the last inequality is Theorem 2.1. Substituting this lower bound for D j (r ) into (3.4), we get that d k > s i + A j [s i, d k ] 3β Cρ j which gives the desired result because the quantity 3β (= σ) does not depend on i or k. 3.3.2 Worst-case latency performance of SDRR Under DRR, a worst-case scenario for packet latency occurs when a packet arrives to an empty queue shortly after the beginning of a long service cycle in which the queue under consideration is not served (as it must wait until the end of the service cycle to be put on the list of active queues). In this context, a service cycle may be long because of a possibly large number (N 1) of active queues and because of the particularly high common grant of credits to the deficit counters at the start of the cycle. These grants

30 of credits are sufficient to allow the queue with the smallest bandwidth allocation to transmit a maximum-sized packet. Similar to DRR, SDRR imparts a high worst-case packet latency. Under SDRR, a worst-case scenario for packet latency occurs simply when a packet arrives at the transmission queue at which it is occupied by maximum-sized packets (M) from all other different queues. This is possibly because the SWRR frame clusters a large number of different queues consecutively and each such queue has a maximum-size packet to transmit. We identify the latency of a packet as the difference between its arrival time to the head of its tributary queue and the time it departs from the transmission queue. So, for N tributary queues, the worst-case delay under SDRR is O(NM). 3.4 Alternate versions of SDRR In this section, we study feasible varieties of SDRR that we have considered to achieve lower computational complexity while maintaining reasonable output jitter control per flow. In other words, the simplified versions of SDRR that do not use the transmission queue occupancy information or backpressure from the transmission queue are explored. For clarity, the version of SDRR described above will be called SDRR version a (SDRRa) in the following. 3.4.1 SDRR without the knowledge of transmission queue occupancy In the light of the discussion in Section 3.2, we could specify SDRR version b (SDRRb), a less computationally complex version of SDRRa that does not use the

knowledge of transmission queue occupancy, W. Figure 3.5 illustrates the description of SDRRb mechanism. 31 3.4.2 SDRR without backpressure from the transmission queue We now describe a simplification of SDRRa that does not use backpressure from the transmission queue. Under SDRR version c (SDRRc), backpressure from the transmission queue is not used but the deficit counter is updated as SDRRa (using the backlog information W of the transmission queue). That is, we do not check whether queue j s backpressure is deactivated in step 2) of the every time slot rules of SDRRa as stated above. 3.5 Simulation results The simulations were set-up using the following traffic parameters: one transmission queue with capacity 10,000 bytes/s 2, N = 10 tributary queues with weights sorted in non-decreasing order 1/35, 1/35, 2/35, 2/35, 2/35, 4/35, 4/35, 4/35, 5/35, and 10/35 as used in Section 2.2, and memory blocks of β = 40 bytes as commonly used in practice. We considered simulations involving three types of packet-length distributions for the flows involved. Two of these were packets of constant size: either M = 41 bytes or M = 1500 bytes. Note that these two sizes are part of standard industry test suites wherein the router is subjected to sustained bursts of maximum-sized packets,and of packets whose length are all one byte more than the memory block size. Indeed, the objective of a test-flow consisting of constant 41-byte packets is to precipitate maximally 2 The value of the transmission queue s service rate was not important.

32 wasteful packet memory I/O, i.e. two 40-byte memory blocks are required to send a single 41-byte packet thereby wasting 39 bytes of I/O. The third packet length distribution is representative of realistic Internet traffic [42]. To obtain worst-case results consistent with the discussion of Section 3.3.2, we assumed that the tributary queues are always nonempty (active). Each simulation was terminated after 3,500 packets had been serviced by the transmission queue giving very low relative error of all quantities estimated (sample standard deviation over estimated mean was always 0.10 or less). We present the performance results in terms of weighted throughput, i.e., the throughput of flow j divided by ρ j, output peak rate shaping, and worst-case packet latency. 3.5.1 Weighted throughput performance Figures 3.6, 3.7, and 3.8 respectively depict the per-flow/queue weighted throughput under SDRRa, SDRRb, and SDRRc. The weighted throughput of queue n is defined to be the observed mean rate out of the transmission queue in bytes/s of packets that emanated from queue-n divided by ρ n C. We therefore refer to the value 1 as the reference value in the graphs. As can be seen from these two figures, the throughputs observed were clearly less than the corresponding allocated amounts of bandwidth ρ n C. In particular, this discrepancy is maximum of about 75% for queue 10, (the queue with the highest bandwidth allocation) for Internet and constant-1500-byte packet length distributions. In general, we observed that flows with highest bandwidth allocations received the least weighted throughput under SDRRa and SDRRb. This phenomenon is due in