Modeling a shared medium access node with QoS distinction

Size: px

Start display at page:

Download "Modeling a shared medium access node with QoS distinction"

Laureen Booker
5 years ago
Views:

1 Modeling a shaed medium access node with QoS distinction Matthias Gies, Jonas Geutet Compute Engineeing and Netwoks Laboatoy (TIK) Swiss Fedeal Institute of Technology Züich CH-8092 Züich, Switzeland {gies,geutet}@tik.ee.ethz.ch TIK-Repot No. 86 Mach 30, 2000 Abstact This epot descibes a high-level design space exploation fo the implementation of an IP ove ATM shaed-medium access node with Quality of Sevice (QoS) distinction and outing functionality. The diffeent blocks of the achitectue ae modeled with diffeent optimization citeia in mind. The models povide thoughput, delay, costs, and memoy space wost-case bounds depending on a vaiety of paametes and implementation altenatives. We then eveal diffeent optimal designs fo the main memoy system of the access node and point out the hot spots of the achitectue and implementation. As a esult, a node suppoting a 155 Mbit line ate, a state of the at fai queuing schedule with up to 2 13 concuent connections, a buffe fo at least 2 15 packets, and a backbone oute with up to outing enties can be implemented using a single modeately clocked geneal pupose CPU coe with two memoy buses and fou to eight memoy chips. The esouces show a wost-case utilization of 75 % in tems of latency, so that thee is oom fo futhe building blocks and highe line ates. 1

2 Contents 1 System oveview 3 2 Memoy Model Model fo a single chip with bank inteleave Seveal chips foming a wide memoy bus Wide memoy bus with seveal modules Seveal sepaate memoy buses Memoy model paametes deived fom simulations Common paametes fo the building blocks of the access node 8 4 The wheel faming mechanism Requiements fo the fame stoage Requiements fo detemining the destination queue Example Destination queues and IP-ATM heade convesion Requiements of the destination queues Requiements of the heade convesion Example Packet schedule Fixed Pioity schedule WFQ schedule WFQ appoximation MD-SCFQ IP payload - IP heade split and stoage Payload and addess queue RAM Example VPI/VCI lookup fo cell emoval fom line VPI/VCI lookup Example ATM payload-heade split ATM cell split and stoage Reassembly Reassembly queues Example Enhanced IP oute IP oute/ fowading stage Example Design space exploation of the memoy subsystem Conclusion 38 2

3 1 System oveview The modeled netwok access node is pat of a netwok which uses a line topology (see Fig. 1). ATM connections ae established fo bundles of IP flows with the same souce and destination node on the line. Thus, the ATM line foms a tanspaent backbone fo IP flows. Incoming UBR taffic is always taken out fom the line at the input of a node and fed in the ATM taffic at the output of a node, povided that the node is not the destination of the UBR taffic. All othe kinds of taffic can pass though an ATM node on a diect data path with highest pioity to minimize the latency intoduced by the node fo line taffic. The line access abitation is implemented with a fame slot esevation scheme. In ode to keep the management easonably simple as well as access latencies shot, the numbe of slots and the fame length ae fixed. That is, a 100 % usage of the line with non-ubr taffic can only be achieved using paticula ates. Fo an abitay flow ate, line bandwidth must be ovebooked educing the usage of the line. Howeve, a 100 % usage of the line can still be eached by stuffing unused bandwidth with UBR taffic at the line access level. line segments ATM nodes ATM line bundled flows fo the same destination node IP-based souce and destination devices Figue 1: System oveview. The inne achitectue of a line node is displayed in Fig. 2. IP flows ae classified, outed, and scheduled on a packet basis. The ATM cell segmentation is pefomed aftewads. Fo each destination node thee is a schedule which bundles IP flows accoding to eithe fixed pioity o individual QoS classes. The modeled achitectue only consides components fo a static netwok scenaio, i.e., ATM connections ae established and fame slots have been assigned to connections. The models in paticula undepin the influence of the main memoy. Theefoe, models fo ecent RAM technologies (SDRAM, RDRAM, SRAM) have also been taken into account. Modeling methodology The chaacteistics of the building blocks of the access node, such as memoy space and memoy thoughput equiements, calculations, and intoduced delay have been detemined by the analysis of suitable algoithms. The following assumptions have been made: only aithmetic opeations have been counted, no banch o loop instuctions have been counted. Instuctions ae found in on-chip cache memoy and thus do not geneate additional delay due to fetching. Moeove, allocation and deallocation of memoy segments only need one futhe opeation. Since only memoy segments of fixed size ae equied, the fee memoy list can be implemented by a linked list togethe with distinct memoy aeas fo the list and the memoy segments. Moeove, memoy fagmentation does not affect system pefomance. In this configuation, only a couple of additional egistes ae equied to keep tack of the FILO-oganized 3

4 IP packets: souce add. dest. add. FIFO queue pe flow ATM cells + add. info: dest. physical pot IP flow id IP output... physical pots physical pots... IP flows eassembly flow split (48 byte ganulaity) addess queue payload DRAM (48 byte payload) (update) UBR taffic IP input IP flows enhanced IP oute IP payload - IP heade split and packet stoage (WFQ tag calculation if WFQ-type schedule used)... IP packet vitually stoed as linked list of 48 Byte units (= ATM payload) one FIFO queue pe flow, gouped pe dest. node IP packets: souce add. dest. add. IP packets + additional info: destination node on line QoS/context info fo flow VPI/VCI additional info: destination node on line QoS/context info fo flow payload stating addess VPI/VCI additional info: scheduling tag payload stating addess QoS/context info fo flow VPI/VCI packet packet packet schedule schedule schedule one schedule pe destination node to contol / management plane ATM payload - heade split and payload stoage ATM cells UBR taffic ATM heade geneation flow id -> VPI/VCI (seveal cells)... an IP packet is mapped onto seveal ATM cells one FIFO queue pe dest. node additional info: payload add. VPI/VCI ATM cell heades payload add. OAM / CAC cells fom contol / management plane OAM / CAC cells assigned fame slots "wheel" (destination eached) (UBR, othe destination) ATM cells ATM line in VPI/VCI lookup (othe destination, non-ubr taffic) ATM cells ATM line out Figue 2: ATM node achitectue: IP laye + ATM Plane / ATM Laye. fee list. The delay analysis consides thee kinds of easons: delay intoduced by memoy accesses, by calculations, and by algoithm chaacteistics such as pioity invesion if packet scheduling is pefomed. In the wost-case, data dependencies pevent the concuent execution of calculations and memoy accesses. Theefoe, the delay bounds fo the thee cases have been added. In the given examples, two diffeent memoy configuations have been investigated: an SDRAM unning at 100 MHz with an aveage access ovehead of fou cycles and an SRAM unning at 166 MHz with two cycles access ovehead. Both memoy subsystems use a 32 bit data bus. Finally, the thoughput analysis shows whethe the chosen memoy technology is suitable fo the coesponding task. The ecipocal of the delay bound divided by the size of an ATM cell (IP packet espectively) should be less than the ate that has to be suppoted. Othewise, the system can be oveloaded. The esults of this epot can thus be used to estimate the impact of a vaiety of paametes such as ATM ate, numbe of connections, packet sizes, etc. on the equiements of the memoy subsystem and the needed computing powe without the help of any special hadwae functional units. The analysis can be pefomed independently fo the building blocks of the node, e.g. fo a pipelined implementation of the system by a chain of pocesso modules. The case study in this epot howeve shaes a single 4

5 pocesso module in ode to implement all building blocks of the access node. This epot is oganized as follows: in the next section, diffeent main memoy models ae developed taking simulations of eal-wold applications into account. Section 3 summaizes the common modeling paametes used in the succeeding sections. Sections 4 to 11 descibe the models fo the diffeent building blocks of the access node. Each of these sections povides examples fo typical system settings. Section 12 finally shows optimal configuations of the main memoy system in tems of costs and latency fo an access node consideing the models of the peceding sections. Section 13 concludes the epot. 2 Memoy Model The models fo main memoy accesses deived in this section ae used in subsequent sections fo the detemination of ealistic thoughput and delay bounds fo the building blocks of the access node. Nomenclatue g [bit] ganulaity of data, i.e., smallest amount of data to be accessed and tansfeed by a bust f [MHz] clock fequency of the memoy bus cap [bit] capacity of a single memoy chip db [bit] data bus width of the memoy chip b numbe of memoy banks within a chip gd [bit] size of the data stuctue to be stoed c [$] costs pe chip bb [bit] width of the memoy bus cmc [$] costs of the memoy contolle m numbe of memoy modules bc numbe of independent memoy buses ov [cycles] ovehead fo pocessing a ead o wite equest dependent on memoy type, e.g, an SDRAM may need up to six exta clock cycles fo pepaing a bust tansfe. ps [bit] Rambus RAM specific paamete: size of a data packet Simplifications no distinction between ead and wite accesses and no consideation of the intenal state of the RAM. That is, all tansfes need the same latency. efesh cycles ae ignoed. the smallest amount of data to be accessed fits in a memoy ow. an ideal inteleave of memoy banks and modules is assumed. In eality, data dependencies usually pevent this behavio (see subsec. 2.5). 2.1 Model fo a single chip with bank inteleave A single memoy tansfe without inteleaving is shown in Fig. 3 a). The SDRAM needs g db clock cycles fo the tansfe of the data. Moeove, ov cycles must be spent fo pepaing the tansfe. Thus, a single bust tansfe needs ov + g db cycles fo completion. Howeve, the ov cycles can be hidden if moe than one memoy bank is available within the chip. Fig. 3 b) shows two tansfes using two memoy banks hiding some of the ovehead cycles. Finally, if enough memoy banks ae available, the ovehead cycles can be hidden completely as shown in Fig. 3 c). 5

6 ov g db a) single tansfe b) two banks available c) fou banks available clock cycles Figue 3: Memoy bank inteleave using an SDRAM. thoughput th: Paamete SDRAM RDRAM a) SRAM X g db g ps ps db 2 g db a) The RDRAM uses both edges of the clock. th = #bits tansfeed cycles needed = b g max (b X, ov + X) 1 f (1) costs: gd c + cmc cap 2.2 Seveal chips foming a wide memoy bus In this model, seveal memoy chips fom a wide memoy bus. All chips ae contolled by the same signals. Fo instance, using a single SIMM o DIMM memoy module is an example fo this configuation. RDRAMs ae usually not used in this configuation. thoughput th fo SDRAMs / SRAMs: th = b g max ( b g bb,ov+ g bb ) 1 f (2) costs: gd bb db cap bb db c + cmc 2.3 Wide memoy bus with seveal modules In this model, seveal memoy chips fom a wide memoy bus. All chips ae contolled by the same signals. Seveal memoy modules can be contolled concuently shaing the same memoy bus. Using 6

7 seveal SIMMs o DIMMs on a single memoy bus is an example fo this configuation when using SDRAMs o SRAMs. Fo RDRAMs (bb = db), seveal memoy chips ae contolled concuently using the same bus. The numbe of needed memoy modules can be detemined fom the size of the data stuctue gd o be given by the paamete m to enfoce a highe degee of paallelism. Basically, inceasing the numbe of memoy modules povides moe individual banks in this model. thoughput th: Paamete SDRAM RDRAM a) SRAM X g bb g ps ps db 2 g bb a) The RDRAM uses both edges of the clock and bb = db. th = ( max m, ( max gd bb db cap ) gd cap bb b g db ) b X, m b X, ov + X 1 f (3) costs: ( gd max bb db cap ),m bb c + cmc (4) db 2.4 Seveal sepaate memoy buses Using seveal memoy buses concuently simply means multiplying the cost and thoughput measues in subsection 2.3 by the numbe of buses bc. Note that a sepaate memoy contolle is needed fo each bus. 2.5 Memoy model paametes deived fom simulations In this subsection, aveage values fo the memoy access ovehead (ov) and the influence of bank inteleaving ae detemined. Simulations descibed in [13] have shown that the speedup of the memoy system is clealy less than a facto of two by applying bank inteleaving due to data dependencies and the amount of banches in eal-wold applications. The aveage ovehead ov fo the application mix and CPU configuation used in [13] has been detemined as follows, see Tab. 1. These esults suggest to simplify the fomula given in subsections 2.1 and 2.2. The effect of using seveal modules as descibed in subsection 2.3 has not been investigated in [13]. Equation 1 educes to th = g (ov + X) 1 f by emoving the influence of the numbe of banks b in favo of using the coesponding value of the memoy access ovehead ov given in Tab. 1. Similaly, equation 2 educes to th = g (ov + g bb ) 1 f 7

8 SDRAM, fou intenal banks, 100 MHz bus aveage ovehead ov [ns]/[cycles] closed-page policy 41.5 / 4.15 open-page, no bank inteleaving 36.8 / 3.68 open-page, bank inteleaving 30.1 / 3.01 RDRAM, 32 intenal banks, 400 MHz bus aveage ovehead ov [ns]/[cycles] closed-page policy 51.1 / open-page, no bank inteleaving 46.7 / open-page, bank inteleaving 31.3 / Table 1: Memoy access ovehead dependent on DRAM and contolle policy used. 3 Common paametes fo the building blocks of the access node The following paametes will be used in the succeeding sections to model the diffeent building blocks of the access node as shown in Fig. 2. N numbe of nodes along the ATM line cs [bit] ATM cell size hs [bit] ATM cell heade size [bit/s] ate of the ATM line clk [MHz] CPU clock fequency ipps max [bit] maximal IP packet size [bit] minimal IP packet size vis [bit] VPI/VCI field size con [bit] space needed fo stoing the context infomation of a connection, i.e., the max. packet length and the eseved (minimal) ate. nc max. numbe of connections pe ATM node 4 The wheel faming mechanism It is assumed that connections with the same destination node shae a fame slot. Specific paametes fs [slots] numbe of cell slots in a fame add fs [bit] numbe of addess bits needed fo andom accesses in the memoy used fo the fame implementation (can be calculated fom the size of the data stuctue gd and the memoy chip oganization) add hd [bit] addess bits fo heade queue accesses; the memoy fo fame and heade data stuctues may be shaed. lat wh [s] aveage memoy access latency fo wheel setting 4.1 Requiements fo the fame stoage Each fame enty consists of a souce node id and a destination node id. By the souce id the node identifies eseved fames. A compaison of log 2 N bit values is needed fo this task in each cell clock cycle. The destination id then detemines the coesponding FIFO queue fom which a cell can be put on the ATM line. 8

9 space equied: fs 2 log 2 N (souce + destination node id) bits of memoy space fo fame stoage. thoughput needed : cs 2 log 2 N fo eading out fame positions in each cell cycle. calculations fo a cell: two opeations Acompaisonof log 2 N bit values and a fame addess incement. delay fo a cell: 2 (lat wh + clk 1 )withg = max( log 2 N, add fs, add hd ). 4.2 Requiements fo detemining the destination queue p-tie implementation A complete multiway tie [18] with p pointes in each node needs log p N addess offset calculations fo detemining the destination queue of a fame slot. 1 space wost-case: p 1 (p log p N 1) p add fs + N add hd The fist tem consides the space needed fo the pointes at the next nodes in the tie and the second one stands fo the pointes to the FIFO queues. thoughput needed fo tavesing the tie: cs ( log ) p N add fs + add hd calculations: log p N addess offset calculations pe cell clock cycle delay fo a cell: ( log p N +1) lat wh + log p N clk Table lookup implementation space equied: N add hd thoughput needed fo eading out the table: cs add hd calculations: an offset addess calculation pe cell clock cycle delay fo a cell: lat wh + clk Example See Tab Destination queues and IP-ATM heade convesion The ATM cell heade data and the coesponding payload addesses ae stoed afte they have passed the schedules. 9

10 Fixed model paametes add 20 bit addesses fo heade and fame data stuctues cs 424 bit cell size clk 200 MHz CPU clock Model paametes 4-tie implementation Lookup implementation N fs lat wh sp th ops del sp th ops del [Mbit/s] [ns] [Kbit] [Mbit/s] [10 6 ] [ns] [Kbit] [Mbit/s] [10 6 ] [ns] Table 2: Wheel faming example. Specific paametes add pl [bit] numbe of addess bits needed fo andom accesses in the memoy used fo the payload stoage add hd [bit] addess bits fo heade queue accesses lat pl [s] aveage memoy access latency fo payload RAM lat hd [s] aveage memoy access latency fo ATM heade queue RAM lat ip [s] aveage memoy access latency fo packet heade queue RAM 5.1 Requiements of the destination queues Since the queues ae positioned afte the WFQ schedules and ovebooking is thus not possible, at most one packet must be buffeed fo each destination, i.e., ippsmax cs hs ATM cells must be stoed pe destination in the wost-case. The faming scheme, intoducing a maximal delay of two ATM cell peiods, equies that at least two ATM cells must be buffeed pe destination node. This condition is always satisfied if the minimal packet size is lage than 48 byte. space equied: (N 1) ippsmax cs hs (hs + add pl) Howeve, the space equied fo UBR taffic may be abitaily chosen. The IP packet payload emains unchanged in the shaed memoy. thoughput equied: cs (add pl + hs) is needed fo the ead out of the heades by the wheel. The thoughput cs hs cs needed fo eading 10

11 out the payload (using sepaate chips fo heade and payload) will be consideed in section 7. calculations: memoy enty fo heade is feed. delay fo a cell: lat hd + clk 1 fo a cell heade ead fom heade RAM. The ganulaity g is add pl + hs. The payload RAM will be consideed in section Requiements of the heade convesion The packet schedules detemine which IP packet has to be conveted next. The coesponding infomation (VPI/VCI and payload addesses) is ead fom the packet queues befoe the schedules and copied seveal times (dependent on the packet length) into the coesponding destination queue. thoughput demand of the packet queues (consideed in section 6): cs cs hs (vis + add pl) thoughput demand of the destination queues: cs (hs + add pl) memoy space needed: duing IP-ATM heade convesion only a single ATM heade must be stoed which is used as template fo the ATM heade geneation. Usually, egistes ae used fo this pupose. The space needed in the packet and destination queues is aleady consideed in the coesponding subsections. calculations: memoy enty fo the heade must be allocated. No additional calculations ae needed because the assignment of schedules to destination queues is fixed. delay: lat hd +clk 1 pe cell fo witing the heade into a heade queue, lat ip pe packet fo eading out a packet queue enty (consideed in section 6). Fo the packet queues, g is max(ts + add pl + vis + con, add ip ) in the WFQ case and add pl + vis fo the fixed pioity schedule espectively. 5.3 Example See Tab Packet schedule Due to quality of sevice (QoS) distinction equiements in access and coe netwoks, QoS must be distinguished fo goups of flows (DiffSev appoach [16]) classified by taffic chaacteistics, applications, oganizations, o potocols and even fo individual flows (IntSev appoach [16]). Scheduling algoithms fo output link shaing must theefoe be able to assign link bandwidth to classes of flows independently on the behavio of othe flows. The idea to adapt the behavio of an ideal fluid seve to the timemultiplex in packet netwoks has been the basis fo a vaiety of packet scheduling algoithms and has been used fist by Weighted Fai Queuing (WFQ) [9, 23]. An ideal fluid seve is able to seve seveal flows concuently accoding to weights. Excess bandwidth not eseved by any flow is distibuted in a fai manne on backlogged flows again accoding to the assigned weights. A WFQ seve then schedules packets accoding to the ode in which they would finish sevice in a fluid system. Fo this pupose a WFQ system labels each incoming packet by the time o amount of nomalized sevice at which the packet would leave the fluid seve. The aiving packets ae soted accoding to these scheduling tags in inceasing ode and the schedule chooses the packet with the minimum scheduling tag fo tansmission ove the output link. It has been shown that by applying WFQ the packet system can only be a packet 11

12 Fixed model paametes cs 424 bit cell size hs 40 bit cell heade size ipps max 1536 byte maximal packet size 40 byte minimal packet size vis 28 bit VPI/VCI field size add pl 22 bit payload addesses clk 200 MHz CPU clock Model paametes dest. queues and heade conv. N lat hd sp hd th hd ops del [Mbit/s] [ns] [Kbit] [Mbit/s] [10 6 ] [ns] Table 3: Destination queues and IP-ATM heade convesion example. length behind the sevice of the ideal fluid system in the wost-case. Howeve, thee is a temendous ovehead fo the calculation of the scheduling tags in WFQ because a WFQ schedule must intenally emulate a fluid seve in ode to calculate finish times o sevice levels. Moeove, events can appea fequently in the fluid system since vitually all backlogged flows may finish sevice at the same time. In ode to find a suitable tade-off between the complexity of the scheduling algoithm, the fainess fo the distibution of excess bandwidth, and the povision of shap delay bounds, diffeent appoaches have been applied in ode to implement WFQ. Self-clocked fai schedules So-called self-clocked methods no longe emulate a fluid system but estimate scheduling tags by the tags of packets which ae cuently queued in the packet system. Selfclocked fai queueing (SCFQ) [11] uses the finish time of the packet cuently in sevice fo the estimation of the state of the fluid seve. Contay to that, stat-time fai queuing (SFQ) as descibed in [12] uses the stat time of the packet cuently in sevice fo the estimation and seves packets in inceasing ode of thei stat times. Minimum stating-tag fai queueing (MSFQ) [6] seves packets in inceasing ode of thei finish times. The state of the fluid seve is estimated by the minimum of the stat times of backlogged flows. A second pioity queue is theefoe equied. The same appoach has been independently published in [7] unde the name time-shift scheduling. Self-clocked algoithms decease the implementation complexity of WFQ but usually povide wose delay bounds. Appoximation by potential functions Moe methodical appoaches of fai queuing designs ae based on the theoy of ate-popotional seves (RPS) [31]. A schedule of the type RPS keeps tack of the state of the fluid system by a system potential function (often also called vitual time o sevice). Scheduling tags can then be seen as the level of the system potential function nomalized to the weight of the coesponding flow at which the packet would depat the fluid system. Algoithms which do not 12

13 emulate the fluid seve pecisely use a so-called base potential function in ode to ecalibate the system potential at cetain points of time. The system potential then inceases linealy between ecalibations (assuming the system is not idle). Diffeent base potential functions as well as diffeent ecalibation time intevals can be chosen in ode to find a suitable tade-off between fainess and complexity of the schedule. RPS based schedules achieve the same wost-case delay bounds as WFQ. Stating potentialbased fai queueing (SPFQ) and fame-based fai queueing (FFQ) have been pesented in [30]. SPFQ ecalibates the system potential at evey packet depatue. The base potential is updated at evey packet aival and is set to the minimum stat potential of all backlogged flows, that is, the potential at which the coesponding packet would begin getting sevice in the fluid system. Thus, SPFQ is vey simila to MSFQ. MSFQ howeve does not use a system potential and ecalibates at packet aivals. FFQ uses a simple base potential than SPFQ and lage intevals between ecalibations at the expense of fainess. Minimum delay self-clocked fai queueing (MD-SCFQ) [4] uses the same ecalibation intevals as SPFQ togethe with a simplified base potential which does not need to maintain a second pioity queue in ode to manage stat potentials. Howeve, MD-SCFQ can achieve bette fainess than SPFQ fo cetain flow settings. Eligible packet selection Although the amount of sevice by which a WFQ system can be behind a fluid system is bounded, the WFQ system schedule can be quite ahead of the fluid system. This behavio shows undesied popeties if feedback congestion contol is used e.g. fo the egulation of best-effot taffic [2]. In ode to etain fainess between the flows shaing a link not only on the aveage but also on a fine time ganulaity, thee ae scheduling algoithms which use two distinct pioity queues fo soting. Wost-case fai weighted fai queueing (WF 2 Q) [2] and its moe efficient implementation WF 2 Q+ in [1] sot aiving packets accoding to thei stat time in the fluid system. Only packets which ae eligible fo tansmission, that is, fo which sevice would have been stated in the fluid system, ae then tansfeed to the second pioity queue which is soted accoding to the finish times. WF 2 Q+ does not need to emulate a fluid seve but uses an appoximation simila to SPFQ and MSFQ. Howeve, WF 2 Q+ consides only eligible packets fo tansmission. Leap Fowad Vitual Clock [33] tansfes packets fom backlogged but ovesubscibed flows to a second pioity queue. Packets esiding in this queue ae not eligible fo tansmission yet. Cae is taken that packets ae witten back to the fist pioity queue befoe any delay bound may be missed. Both algoithms have in common that the full contents of one pioity queue must be copied to the othe pioity queue between two scheduling decisions in the wost-case. Round-Robin vaiants Thee ae scheduling algoithms with vey low complexity which enhance the concept of a Round-Robin schedule with vitual sevice ideas. Howeve, the algoithms pesented in [5] and [27] cannot povide shap delay o fainess bounds as schedules which use a fluid seve as efeence model. Hieachical gouping If a single level of flows o flow classes is not detailed enough in ode to distinguish QoS, one may think about using seveal levels of schedules within a hieachy [1, 10, 32]. Howeve, the delay and fainess popeties of the schedules ae accumulated though the levels of the hieachy. Moeove, in the latte case, whee a schedule based on sevice cuves is used [26], one should be awae of the complexity ovehead involved by applying sevice cuves which model moe complex behavio than leaky bucket constaint souces. Modeled schedules The schedule stage consists of the scheduling tag calculation, enqueue and dequeue opeations on a pioity queue as well as context infomation updates of the vitual sevice 13

14 measue. Special hadwae such as [3, 24, 20] which acceleates the opeations of pioity queues unde specific constaints has not been consideed since we ae inteested in an implementation on geneal pupose, pogammable hadwae. We theefoe focus on the influence of a data stuctue fo pioity queues poposed in [29] on the pefomance of the schedules. The oiginal weighted fai queuing (WFQ) algoithm [23], a self-clocked vaiant MD-SCFQ [4], and a fixed pioity schedule ae consideed. We estict ou study on algoithms which only need a single pioity queue. It is assumed that the bandwidth of the output link is not ovebooked. Specific paametes min [bit/s] minimally esevable ate in a WFQ system max [bit/s] maximally esevable ate in the WFQ appoximation system pcs numbe of pioity classes (fixed pioity schedule only) npc pi,d i numbe of elements cuently stoed in the coesponding pioity class queue with pioity p i and destination node d i (fixed pioity schedule only) add ip [bit] numbe of addess bits needed fo andom accesses in the memoy used fo the IP heade stoage add ss [bit] numbe of addess bits needed fo andom accesses in the memoy used fo the scheduling tag stoage add pl [bit] numbe of addess bits needed fo andom accesses in the memoy used fo the payload stoage lat ip [s] aveage memoy access latency fo packet heade RAM lat ss [s] aveage memoy access latency fo scheduling tag RAM ts [bit] scheduling tag pecision pn di numbe of elements cuently stoed in the coesponding QoS queues fo destination node d i. (WFQ/WFQ appox. schedule only) cp [bit] backlog counte pecision (WFQ/WFQ appox. schedule only) nsic numbe of sevice inteval classes (WFQ appoximation only) x numbe of inteval subdivisions (WFQ appoximation only) 6.1 Fixed Pioity schedule Enqueue Since pioity classes ae statically detemined fo each IP flow, packet scheduling tag calculations ae not equied fo this schedule. Howeve, the packet must still be appended to the coesponding FIFO queue which collects packets of all flows of a paticula pioity class with the same destination node. Two possible implementations fo this lookup ae consideed hee. Simplification: additional opeations and data stuctues equied if a queue becomes backlogged o empty ae not consideed. lookup implemented as 2D-aay: the lookup table is oganized as a 2D-aay of pointes using the destination node and the pioity class as indices. The pointes diect to the end of the coesponding FIFO queue. space equied: (N 1) pcs add ip thoughput: add ip calculations: one 2D addess offset calculation fo evey packet delay pe packet: clk 1 + lat ip 14

15 lookup implemented as 2D-vecto: the lookup table is oganized as a vecto of pointes. The destination node is used as index. each pointe may point to anothe vecto of pointes that is indexed by the pioity class. The pointes in these vectos diect to the end of the coesponding FIFO queue. wost case space equied: (N 1) add ip (1 + pcs) thoughput: 2 add ip calculations: two addess offset calculations fo the vecto accesses fo evey packet ae needed. delay pe packet: 2 (clk 1 + lat ip )withg = add ip FIFO queues fo the diffeent pioity classes (linked lists): space: p i,d i npc pi,d i (add ip + add pl + vis) thoughput: (2 add ip + add pl + vis) calculations: allocate memoy fo new node and adjust two pointes fo evey packet. delay pe packet: clk 1 + lat ip with g = add ip + add pl + vis Dequeue Simplification: additional opeations and data stuctues equied if a queue becomes backlogged o empty ae not consideed. That is, all backlogged queues emain backlogged afte a dequeue opeation. space: aleady consideed at the enqueue opeation thoughput: (2 add ip + add pl + vis) calculations: deallocate memoy, adjust a pointe fo evey packet delay pe packet: intoduced by the scheduling algoithm: d m = ippsmax (1+ m p=1 j Sp 1) (1 m 1 p=1 µp) wost-case delay fo a packet in the class of pioity m using the bounds given in [37], assuming a constant inte-aival time fo packets of vaiable length (i.e. X ave X min in [37]), with S p the set of flows at pioity level p, i the minimal ate of flow i, withµ p = ippsmax ipps j S min p j and pcs p=1 µ p 1 (schedulable setting). intoduced by calculations: clk 1 intoduced by memoy accesses fo evey packet: lat ip with g = add ip + add pl + vis Example See Tab. 4. The column del.f ix summaizes the delay intoduced by calculations and memoy accesses. The column del.vaiation stands fo the wost-case delay bounds povided by fixed pioity scheduling fo the highest pioity IP flow compaed with the lowest pioity flow. The column sp vec only consides the amount of memoy needed fo the vecto lookup. The memoy space equied fo IP heades depends on the amount of UBR taffic and the chaacteistics of the IP flows such as bust lengths. Fo instance, suppoting up to 2 13 IP flows with an aveage of two maximum sized packets queued in the system equies 1.2 Mbit of IP heade stoage. 15

16 Fixed model paametes ipps max 1536 byte maximal packet size 40 byte minimal packet size vis 28 bit VPI/VCI field size add pl 22 bit payload addesses add ip 22 bit packet heade addesses clk 200 MHz CPU clock nc 2 13 max. numbe of connections pe ATM node Model paametes vecto lookup, FPS N lat ip,add + lat ip pcs sp vec th ops del.fix del.vaiation [Mbit/s] [ns] [Kbit] [Mbit/s] [10 6 ] [ns] [ms] WFQ schedule Detemination of the scheduling tag Table 4: Fixed pioity schedule example. A WFQ schedule [23, 36] uses a tag calculation of the fom V i =max(v (a i ),V i 1 )+ ipps i i with V (a i ): global vitual time measue at the packet aival time a i, V i 1 : tag of the peceding packet of the same connection, ipps i, i : packet size of cuent packet and the eseved ate of the connection espectively (the connection context). space: the global vitual time should be stoed in a egiste o on-chip RAM. A global vitual time egiste is needed fo evey active schedule in the node, i.e., up to (N-1) egistes may be needed. Moeove, the context infomation is aleady saved in egistes, since its value was detemined duing the IP oute lookup. That is, only the tags of the peceding packet must be stoed fo evey connection. tag lookup implemented as table: nc ts 16

17 1 tag lookup implemented as p-tie, p pointes pe node: p 1 (p log p nc 1) p add ss + nc ts thoughput: table lookup: 2 ts tie implementation: (( log p nc add ss + ts)+ts) fo eading and stoing tag values calculations: a maximum calculation of the global vitual time and the tag of the peceding packet, an addition of the context infomation fo evey packet (max. times pe second). table lookup: one addess offset calculation fo eading out the pevious tag. Fo stoing the new tag, the calculated addess can be eused. tie implementation, p pointes pe node: log p nc addess offset calculations fo eading out the pevious tag. Fo stoing the new tag, the last jump addess can be eused. delay fo a packet: table lookup: 3 clk 1 +2 lat ss with g = ts tie implementation: ( log p nc +2) clk 1 +( log p nc +2) lat ss with g = max(add ss,ts,cp) Enqueue opeation The IP heade must be inseted into a soted pioity queue accoding to its scheduling tag. The packets of connections with the same destination node on the ATM line use the same pioity queue. A wostcase situation could aise if all nc connections wee set up to the same destination node on the ATM line and theefoe had to shae a single pioity queue. The size of the pioity queue can be educed by only looking at the head elements of the connection queues, that is, only the head elements must be soted accoding to thei pioity tags. A heap oganized binay tee [18] is used fo the implementation of the pioity queue. We have chosen this simple, matue data stuctue since we do not depend on quick list mege opeations, since heaps can be analyzed well and have symmetic complexity fo enqueue and dequeue opeations and theefoe still compete easonably well with moe sophisticated pioity queue implementations [25, 17]. Moeove, since inset and delete opeations occu fequently, a heap is pefeed athe than a balanced tee implementation, since additional opeations fo maintaining a balanced data stuctue can cause noticeable ovehead. Howeve, a tee implementation is chosen instead of an aay, since the numbe of elements to be stoed cannot be statically detemined. The binay tee consists of nodes with the following fields: two pointes to the left and ight child and a value field. Globally, two values must be stoed: the numbe of elements stoed in the tee and a pointe to the oot node. The binay coding of the numbe of nodes is used to tavese the tee in ode to find the last inseted element o the new position fo insetions. Tavesing the tee, pointes to nodes along the path ae tempoaily saved (max. log 2 nc pointes) in egistes in ode to quickly find the nodes in the backwad diection. Note that the exchange of two tee nodes is done by copying the value fields and not by adjusting pointes. The copying only equies two memoy accesses. Adjusting pointes howeve would need up to five memoy accesses. space: lookup fo destination node: (N 1) add ip fo detemining the coesponding pioity queue with a simple table lookup. heap oganized binay tee (p-queue): d i (2 add ip +(ts + add pl + vis + con)) pn di 17

18 thoughput: lookup: add ip fo detemining the pioity queue. heap oganized binay tee: (( log 2 (nc +1) +1) add ip + log 2 (nc +1) 2 (ts + add pl + vis + con)+(ts + add pl + vis + con)) (find the insetion point, adjust pointe, wost case numbe of enty exchanges, wite value field of new enty) calculations (pe packet): lookup: an addess offset calculation fo detemining the pioity queue. heap oganized binay tee: incement node counte fo d i s pioity queue, allocate memoy fo new node, detemine the position fo insetion (almost no ovehead, since a bit in the numbe of node decides, whethe the left o ight child node must be chosen), max. log 2 (nc+ 1) compaisons of ts bit values delay: ( log 2 (nc +1) +2) clk 1 +(3 log 2 (nc +1) +2) lat ip with g = ts + add pl + vis + con. Accesses may be divided into small addess field accesses (g = add ip ) and lage infomation field accesses. Then, the fomula can be efined to ( log 2 (nc +1) +2) clk 1 +( log 2 (nc +1) +1) lat ip,add +(2 log 2 (nc +1) +1) lat ip Dequeue opeation The schedules of the pioity queues all togethe must schedule packets at most times pe second. A WFQ schedule needs to find the packet with the highest pioity in the pioity queue what usually means finding the packet with the smallest scheduling tag. space: the space equiements ae aleady consideed at the enqueue opeation. thoughput (needed by the p-queue): (( log 2 nc +1) add ip + log 2 (nc 1) 2 (ts + add pl + vis + con)+(ts + add pl + vis + con)+(add pl + vis + con)) (find the deletion point, eset pointe, wost case numbe of enty exchanges, wite value field, ead value field of top element) calculations (needed by p-queue): the packet enty with the smallest scheduling tag can be found quickly at the oot of the binay tee. Howeve, the enty at the last inseted position is exchanged with the oot and enqueued again in the binay tee to maintain the heap oganization. The last inseted position can be detemined fom the numbe of nodes in the tee. The needed opeations fo evey packet ae: log 2 nc compaisons of ts bit values, decement node counte, deallocate memoy of last inseted node. delay fo the head of line packet of a flow, coesponds to total delay if the IP souces would be constant bit-ate souces: caused by calculations and memoy accesses: ( log 2 nc +1) clk 1 +( log 2 nc +2 log 2 (nc 1) +3) lat ip. Again, accesses may be divided into small addess field accesses and lage infomation field accesses. Then, the fomula can be efined to ( log 2 nc +1) clk 1 +( log 2 nc +1) lat ip,add +(2 log 2 (nc 1) +2) lat ip. 18

19 ipps i scheduling algoithm chaacteistics: i + ippsmax. The tem ipps i i esults fom the guaanteed tanspot delay. The second tem consides the situation whee a maximum-sized packet has been chosen fo tansmission befoe the aival of the cuent packet Context infomation update Maintaining and updating the global vitual time V of the coesponding emulated fluid fai queuing system basically needs fou egistes pe schedule (i.e. active destination node): a egiste fo stoing V, a egiste fo stoing the sum of the eseved ates ( B(t) i) of all backlogged connections, a egiste fo stoing the system time of the last vitual clock update, and a egiste fo stoing the time at which V must be updated next because a packet has been seved in the fluid system. An update of V is pefomed as follows: V (t + dt) :=V (t) + dt B(t) i. B(t) is the set of backlogged connections in the fluid system at time t. The time fo the next update of V if no moe packets aive is given by next(t) =t +(V i,min V (t)) B(t) i. Note that the schedule needs its own pioity queue fo the emulation of the fluid system since the contents of the fluid system pioity queue and the WFQ pioity queue may diffe. Moeove, only aveage bounds ae detemined since they ae calculated on a pe packet basis. In the wost-case all vitually backlogged connections in the fluid system may be eleased at the same point of time and geneate consideable busts of events. space: counte lookup: nc cp. Lookup in a vecto of countes which states how many packets ae still in the coesponding packet queue of the connection. pioity queue fo detemining V i,min in the fluid system (heap oganized binay tee): d i (2 add ss + ts) pn di thoughput: countes: 4 cp (enqueue and dequeue). It is assumed that at least one byte must be ead o witten. p-queue enqueue and dequeue opeations: 2 (( log 2 (nc +1) +1) add ss + log 2 (nc +1) 2 ts + ts) calculations: afte each calculation of next(t) a time is set with the diffeence of next(t) andthecuent system time. enqueue (packet aival): in the wost-case, a (vitual) connection queue becomes backlogged which was empty befoe. In ode to detemine the backlog state, an addess offset calculation is needed. Moeove, the backlog counte must be incemented. Then, the sum egiste might be updated by the amount of the eseved ate of that connection if the queue was empty befoe. An addition opeation is needed fo that. Moeove, the atio egiste must be ecalculated with a division. V can be updated by anothe addition. Last, the calculation of next(t) needs a subtaction, a multiplication, and an addition. dequeue (in the emulated fluid system): afte the dequeue opeation, a (vitual) connection queue may no longe be backlogged. The backlog state must be adjusted accodingly. The backlog counte must be updated and theefoe an addess offset calculation and a decement opeation ae equied. The egistes must be updated accodingly, that means, two subtactions, a division, a multiplication, and two additions ae needed in the wost case. 19

20 p-queue opeations: 2 ( log 2 (nc +1) +1) (compaisons of ts values and incement/decement of the nodes counte) delay pe packet: ( ( log 2 (nc +1) +1)) clk 1 +(4+2 (3 log 2 (nc +1) +2)) lat ss Example See Tab. 5. The additional delay (besides the tanspot delay) intoduced by the chaacteistics of the scheduling algoithm ( ippsmax )isabout80µs fo a line ate of 155 Mbit and has not been consideed in the column del. Again, two access types have been distinguished. Shot accesses fo tavesing the tee stuctue of the heap only equie a lat ip,add delay. Long tansfes which access the value fields of the tee nodes need a lat ip delay. Two packets pe connection ae queued in the system on the aveage. 6.3 WFQ appoximation Connections ae gouped accoding to thei sevice intevals, as suggested in [29], which ae defined by Φ i := ipps i i. ipps i is the size of the packet at the head of the FIFO queue of connection i. In ode to estict the numbe of goups, the set of all possible sevice intevals must be educed to a bounded set accepting a cetain penalty in tems of delay and fainess. The ange of possible sevice intevals is unifomly exponentially spanned by a set of G goups with a ganulaity : Φ max = G 1 Φ min. Φ The coesponding goup g of a packet is theefoe detemined by g = log i Φ min +1. Thus, whena packet of a connection has been seved, the connection might belong to anothe sevice inteval goup. Howeve, using this gouping scheme and accepting a cetain delay penalty, soting the heads of the connection queues within a goup accoding to scheduling tags can be easily achieved by a simple linked list and two additional pointes, see Fig. 4. Moeove, the schedule only needs to look at the scheduling tags of one packet of each goup (the gay elements in Fig. 4), not at the tags of all packets in the system. The scheduling tags within a sevice inteval goup can span the ange ippsmax Φ g which also is sevice intevals flow queues Figue 4: Soting data stuctue. the maximum delay penalty intoduced by this gouping scheme. This value can be deceased by futhe 20

21 Fixed model paametes 40 byte minimal packet size vis 28 bit VPI/VCI field size con 14 bit flow context infomation add pl 22 bit payload addesses add ip 22 bit packet heade addesses clk 200 MHz CPU clock min 128 Kbit minimal suppoted ate of a flow cp 16 bit backlog counte pecision ts 32 bit scheduling tag size Model paametes WFQ, aveage case N nc lat ss lat ip,add +lat ip sp ip sp ss th ip th ss ops ip ops ss del ip del ss [Mbit/s] [ns] [ns] [Kbit] [Kbit] [Mbit/s] [Mbit/s] [10 6 ] [10 6 ] [µs] [µs] Table 5: WFQ schedule example. subdividing the ange of scheduling tags within a goup linealy into x subgoups. The coesponding 21

22 subgoup can be detemined by anothe division opeation. A pioity queue pe goup keeps tack of the head elements of the subgoups Scheduling tag calculation The descibed data stuctue can be applied fo a vaiety of WFQ vaiants Enqueue opeation Using the connection identifie (VPI/VCI pai) a lookup is pefomed which detemines the position of the end of the connection queue. If the connection queue is empty, the packet is added at the end of the linked list of the coesponding sevice inteval goup/subgoup to be the head element in its connection queue. The lookup vecto must be adjusted to point to the new end of the coesponding connection queue as well as the pointe which diects to the end of the linked list of a sevice inteval goup. Moeove, the packet may be enqueued in the goup s pioity queue and the schedule s pioity queue in the wost-case. These tansfes have been consideed in the dequeue opeation. space: lookup vecto (destination node): (N 1) add ip fo detemining the esponsible schedule lookup vecto (end of flow queue): nc add ip sevice inteval class lookup: 2 add ip nsic (N 1) list and queue elements: d i (2 add ip +(add pl +vis+con+ts)) max(pn di nsic nsic x, 0) pioity queue elements (schedule), heap oganized aay [18]: (N 1) nsic (ts + add ip + vis + con + log 2 nsic ) subgoup pioity queues (x enties pe goup): (N 1) nsic x (ts+add ip +vis+con+ log 2 x ) thoughput: detemining the schedule: add ip queuing: (2 add ip +(add pl + vis + con + ts)+2 add ip ) (two lookups, witing of the value field, pointe updates) calculations fo evey packet: an addess offset calculation fo detemining the esponsible schedule, an addess offset calculation fo detemining the state of the coesponding connection queue, calculation of the matching sevice inteval class (a division, a log opeation (counted as 10 calculations), a floo opeation and an incement) and subgoup (a division), an addess offset calculation fo detemining the end of the linked list if connection queue empty, allocate memoy fo new enty, update pointes in class lookup vecto and at the end of the list. delay pe packet: 18 clk 1 +6 lat ip with g = max(add ip, add pl + vis + con + ts). Accesses may be divided into small addess field accesses (g = add ip ) and lage infomation field accesses. Then, the fomula can be efined to 18 clk 1 +5 lat ip,add + lat ip. 22

Any modern computer system will incorporate (at least) two levels of storage:

Any modern computer system will incorporate (at least) two levels of storage: 1 Any moden compute system will incopoate (at least) two levels of stoage: pimay stoage: andom access memoy (RAM) typical capacity 32MB to 1GB cost pe MB $3. typical access time 5ns to 6ns bust tansfe