Packet Scheduling in a Low-Latency Optical Interconnect with Electronic Buffers

Packe cheduling in a Low-Laency Opical Inerconnec wih Elecronic Buffers Lin Liu Zhenghao Zhang Yuanyuan Yang Dep Elecrical & Compuer Engineering Compuer cience Deparmen Dep Elecrical & Compuer Engineering ony Brook Universiy ony Brook, NY 794, UA Florida ae Universiy Tallahassee, FL 30306, UA ony Brook Universiy ony Brook, NY 794, UA ABTRACT Opical inerconnec archiecures wih elecronic buffers have been proposed as a promising candidae for fuure high speed inerconnecions Ou of hese archiecures, he Op- Cu swich [] achieves low laency and minimizes opicalelecronic-opical (O/E/O) conversions by allowing packes o cu-hrough he swich whenever possible In an OpCu swich, a packe is convered and sen o he recirculaing elecronic buffers only if i canno be direcly roued o he swich oupu In his paper, we sudy packe scheduling in he OpCu swich, aiming o achieve overall low packe laency while mainaining packe order We firs decompose he scheduling problem ino hree modules and presen a basic scheduler wih saisfacory performance To relax he ime consrain on compuing a schedule and improve sysem hroughpu, we furher propose a mechanism o pipeline packe scheduling in he OpCu swich by disribuing he scheduling ask o muliple sub-schedulers An adapive pipelining scheme is also proposed o minimize he exra delay inroduced by pipelining Our simulaion resuls show ha he OpCu swich wih he proposed scheduling algorihms achieve close performance o he ideal oupu-queued (OQ) swich in erms of packe laency, and ha he pipelined mechanism is effecive in reducing scheduler complexiy and improving hroughpu Index Terms: Opical inerconnec, packe scheduling, pipelined algorihm I INTRODUCTION In recen years, inerconnecs draw increasingly more aenion due o he fac ha hey end o become a boleneck a all levels: inra-chip, chip-o-chip, board level, and compuer neworks There are many requiremens posed on an inerconnec, such as low laency, high hroughpu, low error rae, low power consumpion, as well as scalabiliy Finding a soluion ha can saisfy all hese needs is a non-rivial ask Opical fibers, feaured wih high bandwidh and low error rae, are widely recognized as he ideal media for inerconnecs ome opical inerconnec prooypes have been buil and exhibied, for example, he recen PERC (Producive, Easy-ouse, Reliable Compuing ysem) projec [4] and OMOI (Opical hared Memory upercompuer Inerconnec ysem) projec [3][4][5] a IBM I has gradually become consensus ha c 202 IEEE Personal use of his maerial is permied Permission from IEEE mus be obained for all oher uses, in any curren or fuure media, including reprining/republishing his maerial for adverising or promoional purposes, creaing new collecive works, for resale or redisribuion o servers or liss, or reuse of any copyrighed componen of his work in oher works fuure high speed inerconnecs should exploi as many advanages opics can provide as possible One of curren major hurdles in opical inerconnec design is he absence of a componen equivalen o he elecronic random access memory, or opical RAM Currenly he mos common approach o buffering in opical domain is by leing he signal go hrough an exra segmen of fiber, namely, he fiber delay line (FDL) An FDL generaes a fixed buffering delay for any opical signal, which is in fac he propagaion delay for he signal o ransfer over he FDL To provide flexible delays, FDLs have o be combined wih swiches Exensive research has been devoed o he realizaion of large, powerful all-opical buffers [9] [0] [] bu he random accessibiliy is sill absen Alernaively, here are emerging echniques ha aim a slowing he ligh down, for example, [7][8] While hese researches presen ineresing resuls owards implemening opical buffers wih coninuous delay, so far i is sill unclear wheher slow ligh can provide sufficienly large bandwidh and buffering capaciy for i o be used in pracical sysems Therefore, currenly elecronic buffer seems o be he only feasible opion o provide pracical buffering capaciies Anoher core componen of an inerconnec is he packe scheduler The scheduling process inroduces exra delay because he scheduling algorihm may need ime o converge In addiion, as he scheduler is usually locaed a he cener of he swich, if an inpu por mus receive a gran from he scheduler before sending packes, he round-rip delay beween he inpu por and he scheduler becomes significan in low laency applicaions For example, i was esimaed in [2], [5] ha he round-rip delay can be as long as 64 ime slos, where a ime slo is abou 50ns which is he ime needed o ransmi a packe To address hese challenges, a low laency swiching archiecure ha combines opical swiching wih recirculaing elecronic buffer was recenly proposed in [] In he following, we simply refer o i as he OpCu (Opical Cu-hrough) swich Fig (a) shows a high-level view of he swich The key feaure of he OpCu swich is ha he arrived opical packes will be roued o he oupu direcly, or cu hrough he swich, whenever possible Only hose ha canno cu hrough are sen o he receivers, convered o elecronic signals and buffered, which can be sen o he oupu pors laer by he opical ransmiers By allowing packes o cu hrough, he OpCu swich significanly reduces boh packe laency and O/E/O conversions Also, he OpCu swich does no hold any packe a he inpu pors I always sends packes o he swiching fabric, because wih he number of receivers equal o he number of inpu pors, here can always be found a receiver o pick up a packe ha needs o be buffered Hence, he OpCu archiecure eliminaes

2 N Opical Packes Transmiers Opical wiching Fabric Elecronic Buffer (a) (b) N Opical Packes Receivers : Amplifier N : plier : OA gae : Coupler : Receiver : Transmier Fig (a) A high level view of he OpCu swich (b) A possible implemenaion of he OpCu swich he round-rip delay beween he inpu pors and he scheduler Due o hese reasons, he OpCu swich archiecure holds he poenial of achieving very low laency In addiion, some recen research, for example, [6], also shows ha such hybrid opical packe swiching wih shared elecronic buffer could lead o subsanial saving in power and cos over all-opical or elecrical echnologies The feasibiliy and he performance of he OpCu archiecure was sudied in [] under a simple random packe scheduling algorihm While he random algorihm is convenien for analyical modeling, i is impracical because i is difficul o implemen random funcions a high speed[6] In addiion, he random scheduling algorihm canno mainain packe order, which is generally desired for swiches [2], [3] There has been a lo research on scheduling algorihms in packe swiches One of he mos exensively sudied opics is packe scheduling algorihms for he inpu-queued (IQ) swich The IQ swich is usually assumed o work in a ime-sloed fashion Packes are buffered a each inpu por according o heir desined oupu por, or in virual oupu queues (VOQ) The scheduling problem for a ime slo is formalized as a biparie maching problem beween he inpu pors and he oupu pors of he swich Exising scheduling algorihms for IQ swiches can be divided ino wo caegories: maximum weighed maching based opimal algorihms and maximal sized maching based fas algorihms The firs caegory includes algorihms such as Longes Queue Firs (LQF) and Oldes Cell Firs (OCF) [2] These algorihms have impracically high compuing complexiy, bu are of heoreical imporance as hey deliver 00% hroughpu under virually any admissible raffic The second caegory includes, for example, Round Robin Greedy cheduling (RRG) [7], Parallel Ieraive Maching (PIM) [5] and ilip [6] These algorihms only look for a maximal maching in each ime slo hence have pracical ime complexiy They are herefore preferred in real sysems, alhough hey can only give sub-opimal schedules However, in a high speed or ulra high speed sysem, even hese maximal maching scheduling algorihms may become a poenial problem As he lengh of a ime slo shrinks wih he increase in line card rae, i may become oo sringen o calculae a schedule in each single ime slo In such a case, pipelining scheduling is needed o relax he ime consrain In [7] a pipelined version of he RRG algorihm was repored, in which each inpu por is assigned a scheduler The scheduler of an inpu por selecs an idle oupu por, and passes he resul o he nex inpu por The process goes on unil all inpu pors have been visied However, his approach inroduces an exra delay equal o he swich size In [8] he pipelined maximal-sized maching algorihm (PMM) was proposed, which employs muliple idenical schedulers Each of hese schedulers independenly works owards a schedule for a fuure ime slo As poined ou in [9], PMM is more a parallelizaion han a pipeline since here is no informaion exchange beween schedulers [9] furher proposed o pipeline he ieraions of ieraive maching algorihms such as PIM and ilip by adoping muliple sub-schedulers, each of which aking care of one single ieraion and passing he inermediae resul o he nex sub-scheduler in he pipeline One problem wih his approach is ha i may generae grans for ransmission o an empy VOQ since a any ime a sub-scheduler has no idea abou he progress a oher sub-schedulers and may ry o schedule a packe ha has already been scheduled by oher subschedulers As a resul, he service a VOQ receives may exceed is acual needs and is wased In his paper, we will sudy he packe scheduling problem in he OpCu swich The res of his paper is organized as follows ecion II inroduces he OpCu swich archiecure ecion III presens he basic packe scheduler for he OpCu swich ecion IV proposes a mechanism o pipeline he scheduling procedure, including an adapive pipelining scheme ecion V presens he simulaion resuls ecion VI compares he Op- Cu swich wih some recenly proposed archiecures Finally ecion VII concludes he paper II THE OPCUT WITCH In his secion we briefly describe he archiecure of he Op- Cu swich The OpCu archiecure may adop any swiching fabric ha provides non-blocking connecions from he inpus o he oupus One possible swiching fabric is shown in Fig (b) I has N inpu fibers and N oupu fibers, numbered from o N, and here is one wavelengh on each fiber In he following, we inerchangeably use inpu (oupu) fiber, inpu (oupu) por, and inpu (oupu) Each inpu fiber is sen o an amplifier The amplified signal is broadcas o N oupu fibers under he conrol of OA gaes In addiion, each signal is also broadcas o N receivers An oupu fiber receives he signal and roues i o he processor or he nex sage swich A receiver convers he opical packe ino elecronic forms and sores i in is buffer There is one ransmier per receiver buffer A ransmier can read he packes and broadcas he seleced packe o he oupu fibers, also under he conrol of OA gaes, such ha he packes

3 in he buffer may be sen ou In his paper, we follow he same assumpions as oher opical swich designs ha he swich works in ime slos, and packes of fixed lengh fi in exacly a ime slo The lengh of a ime slo is abou 50 ns, similar o ha in he OMOI swich [2], [5] Before receiving he packes, he swich is informed abou he desinaions of he packes This can be achieved, for example, by leing a group of processors share an elecronic connecion o send he packe headers o he swich The headers are sen o he swich before he packe is sen o allow he swich o make rouing decisions and o configure he connecions Noe ha he cos associaed wih his elecronic connecion is likely o be small, because he link is shared by muliple processors and he link speed can be much lower han he daa link A he beginning of each ime slo, up o one packe may arrive in opics a each inpu por We define a flow as he sream of packes from he same inpu por and desined for he same oupu por Unlike oher opical swiches wih elecronic buffers in which every packe goes hrough he O/E/O conversions, in he Op- Cu swich packes are convered beween opics and elecronics only when necessary Whenever possible, arrived opical packes are direcly sen o heir desired oupu por, or, cu-hrough he swich A packe ha canno cu-hrough is picked up by one of he receivers and sen o he elecronic buffer Laer when is desined oupu por is no busy, he packe can be feched from he elecronic buffer, convered back ino opics, and scheduled o he swich oupu In each ime slo, a receiver picks up a mos one packe, and a ransmier sends ou a mos one packe In oher words, here is no speedup requiremen The cos of he OpCu swich is mainly deermined by he number of ransmiers, he number of receivers, and he size of he swiching fabric I can be seen ha he OpCu swich needs N ransmiers and N receivers The swich fabric is N 2N + N N, where he N 2N par connecs he inpus o he oupus and receivers, while he N N par connecs he ransmiers o he oupus To connec more processors, muliple OpCu swiches can be conneced according o a cerain opologies, similar o he Infiniband swich III THE BAIC PACKET CHEDULER FOR THE OPCUT WITCH The key challenge o achieve low laency in a swich is he design of he packe scheduling algorihm Due o is feed-back buffer srucure, exising scheduling algorihms canno be direcly applied o he OpCu swich Keeping packe order also becomes more challenging in he OpCu swich For example, in inpu-queued swiches, packes belonging o he same flow are sored in he same Virual Oupu Queue (VOQ), hus packe order is preserved as long as each VOQ works as a FIFO In he OpCu swich, however, packes from he same flow may be picked up by differen receivers The scheduling algorihm for an OpCu swich should give answers o he following hree quesions: Quesion For he newly arrived packes, wheher hey may go o he oupu por direcly or hey should be buffered Quesion 2 For a packe ha should be buffered, which receiver should be used o pick i up Quesion 3 For he oupu pors ha are no receiving he new packes, which of he buffered packes may be sen o he oupu por This secion is organized around how he hree quesions can be answered We sar wih he noaions and he basics of he scheduler A Noaions and Basics of he cheduler In an OpCu swich, inpu i is denoed as I i, oupu j is denoed as O j, and receiver r is denoed as R r Flow ij is defined as he sream of packes arrived a I i desined o O j We also refer o he ime slo in which a packe arrives a he swich inpu as he imesamp of ha packe Among all packes of a flow currenly a he swich inpu or in he buffer, he one wih he oldes imesamp is referred o as he head-of-flow packe Mainaining packe order means ha a packe mus be a headof-flow packe a he insan i is being ransmied o he swich oupu The OpCu scheduler adops round-robin scheduling when he scheduler has muliple choices and has o choose one imilar o [6], he round-robin scheduler akes a binary vecor as inpu, and mainains a poiner o make he decisions Le [r, r 2,, r N ] be he inpu binary vecor and le g be he curren round-robin poiner g where g N The scheduler picks he highes prioriy elemen defined as he firs encounered when searching he elemens in he vecor from r g in an ascending order of he indices, while wrapping around back o r when reaching r N Incremening he round-robin poiner g by one beyond x means ha g x + if x < N and g if x = N B Queueing Managemen For each oupu por, he scheduler of he OpCu swich keeps he informaion of he packes ha are in he buffer and are desined o he oupu por in a virual inpu queue (VIQ) syle Basically, for oupu O j, he scheduler mainains N queues denoed as F ij for i N For each packe arrived a I i desined for O j and are currenly being buffered, F ij mainains is imesamp, as well as he index of he buffer he packe is in Noe ha F ij does no hold he acual packes Packes are sored a he receiver buffers I would make he scheduling much easier if each receiver mainains a dedicaed queue for each flow However, his will resul in N 3 queues over all receivers and will lead o much higher cos which is unlikely o be pracical when N is large Insead, no queue is mainained in any receiver buffer, and an auxiliary array is adoped in each buffer o faciliae he locaing of a specific packe The auxiliary array is indexed by (parial) imesamps o sore he locaion of packes in he buffer ince in each ime slo a receiver picks up a mos one packe, i is able o locae a packe in consan ime given he imesamp of he packe and he auxiliary array Noe ha some elemens of he array may be empy bu he packes can always be sored coninuously in he buffer As an implemenaion deail, he auxiliary array can be used in a wrap-around fashion hus does no need o have very large capaciy For insance, if he index of he array is 8-bi long, hen he array sores he locaion of up o 256 packes Consequenly, only he lower 8 bis of he imesamp is needed o locae a

4 packe in he buffer A conflic occurs only if a packe is roued o a receiver, and anoher packe picked up by he same receiver a leas 256 ime-slos earlier is sill in he buffer When his is he case, i usually indicaes heavy congesion Hence i is reasonable o discard one of he packes C The Basic cheduling Algorihm Nex we describe a basic scheduling algorihm for he OpCu swich To mainain packe order, he basic algorihm adops a simple sraegy Basically, i allows a packe o be sen o an oupu only if his packe is a head-of-flow packe The basic algorihm consiss of hree pars, each for answering one of he hree quesions C Par I Answering Quesion For Quesion, he basic scheduling algorihm consiss of wo seps ep : Reques If a packe arrives a I i desined o O j, he scheduler checks F ij If i is empy, I i sends a reques o O j ; oherwise, I i does no send any reques ep 2: Gran If O j receives any requess, i chooses one o gran in a round-robin manner Tha is, i will receive a binary vecor represening he requess sen by he inpus I picks he highes prioriy elemen based on is roundrobin poiner and grans he corresponding inpu Then i incremens is round-robin poiner by one beyond he graned inpu In each ime slo, since here is a mos one packe arriving a each inpu, an inpu needs o send a reques o a mos one oupu and will receive no more han one gran Therefore, he inpu will send he packe (or le he packe cu-hrough) as long as i receives a gran The enire cu-hrough operaion can be done by a single ieraion of any ieraive maching algorihm C2 Par II Answering Quesion 2 For quesion 2, he basic scheduling algorihm simply connecs he inpus o he receivers according o he following schedule: A ime slo, he packe from I i will be sen o R r where r = [(i + ) mod N] + Noe ha insead of a fixed one-o-one connecion, he inpus are conneced o he receivers in a round-robin fashion for beer load balancing As an example, according o our simulaion, in an 8 8 OpCu swich, when here is a fully-loaded inpu por, he maximum overall hroughpu is around 085 if he inpus are conneced o he receivers in he above way, versus 070 wih fixed connecion C3 Par III Answering Quesion 3 For Quesion 3, he scheduler requires one decision making uni for each oupu and one decision making uni for each buffer I hen runs he well-known ilip algorihm [6] beween he receivers and he oupus Each ieraion of he algorihm consiss of hree seps: ep : Reques Each unmached oupu sends a reques o every buffer ha sores a head-of-flow packe desined o his oupu ep 2: Gran If a buffer receives any requess, i chooses one o gran in a round-robin manner Tha is, i will receive a binary vecor represening he requess sen by he oupus I picks he highes prioriy elemen based on is round-robin poiner and grans he corresponding oupu The poiner is incremened o one locaion beyond he graned, if and only if he gran is acceped in ep 3 ep 3: Accep If an oupu receives any grans, i chooses one o accep in a round-robin manner Tha is, i will receive a binary vecor represening he grans sen by he buffers I picks he highes prioriy elemen based on is round-robin poiner and acceps he gran from he corresponding buffer Then i incremens is round-robin poiner by one beyond he graned buffer A he end of he algorihm, he scheduler informs each buffer which packe o ransmi I does so by sending he porion of he packe s imesamp ha is needed for he buffer o locae he packe Wih ha informaion, he argeed packe can be found in consan ime and sen hrough he ransmier The swich is configured accordingly o roue he packes o heir desined oupu por IV PIPELINING PACKET CHEDULING Our simulaion resuls show ha he basic scheduling algorihm inroduced above can achieve saisfacory average packe delay However, in a high speed or ulra high speed environmen, i may become difficul for he scheduler o compue a schedule in each single ime slo In such a case, we can pipeline he packe scheduling o relax he ime consrain Furhermore, by pipelining muliple low-complexiy schedulers, we may achieve performance comparable o a scheduler wih much higher complexiy In his secion we presen such a pipeline mechanism A Background and Basic Idea Wih pipelining, he compuing of a schedule is disribued o muliple sub-schedulers and he compuing of muliple schedules can be overlapped Thus, he compuing of a single schedule can span more han one ime slo and he ime consrain can be relaxed Anoher consideraion here is relaed o fairness By adoping he ilip algorihm in he hird sep (ie, deermining he maching beween elecronic buffers and swich oupus), he basic scheduling algorihm ensures ha no connecion beween buffers and oupus is sarved However, here is no such guaranee a he flow level In addiion, as menioned earlier, a packe ha resides in he swich for oo long may lead o packe dropping To address his problem and achieve beer fairness, i is generally a good idea o give cerain prioriy o older packes during scheduling Combining he above wo aspecs, he basic idea of our pipelining mechanism can be described as follows We label each flow based on he oldness of is head-of-flow packe Among all flows desined o he same oupu, a flow whose head-of-flow packe has he oldes imesamp is called he oldes flow of ha oupu Noe ha here may be more han one oldes flow for an oupu imilarly, he flows wih he i h oldes head-of-flow packes are called he i h oldes flows Insead of aking all flows ino consideraion, we consider only up o he

5 ime slo 2 ime slo ime slo FDL delay new packes cu hrough ss calculaes ss 2 calculaes 2 is excecued oupus announce buffer saes Fig 2 Timeline of calculaing schedule for ime slo ime k h oldes flows for each oupu when scheduling packes from he elecronic buffer o he swich oupu This may sound a lile surprising bu laer we will see ha he sysem can achieve good performance even when k is as small as 2 Then he procedure of deermining a schedule is decomposed ino k seps, wih sep i handling he scheduling of he i h oldes flows By employing k sub-schedulers, he k seps can be pipelined Like he basic scheduling algorihm, he pipelined algorihm mainains packe order since only head-of-flow packes are qualified for being scheduled Nex we will presen he pipeline mechanism in more deail Basically, like in prioriized-ilip [6], he flows are classified ino differen prioriies In our case he prioriizaion crierion is he oldness of a flow By pipelining a he prioriy level, each sub-scheduler deals wih only one prioriy level and does no have o be aware of he prioriizaion Furhermore, since each sub-scheduler only works on a subse of all he scheduling requess, on average i converges faser han a single cenral scheduler To explain how he mechanism works, we will sar wih he simple case of k = 2, ha is, using only he oldes flows and second oldes flows when scheduling We will also show ha when k = 2, a common problem in pipelined scheduling, called duplicae scheduling, can be eliminaed in our mechanism Laer we will exend he mechanism o allow an arbirary k, and discuss poenial challenges and soluions B Case of k = 2 Wih k = 2, wo sub-schedulers, denoed as ss and ss 2 are needed o pipeline he packe scheduling ss ries o mach buffers wih he oldes flows o he oupu pors, while ss 2 deals wih buffers wih he second oldes flows The imeline of calculaing he schedule o be execued in ime slo, denoed as, is shown in Fig 2 The calculaion akes wo ime slos o finish, from he beginning of ime slo 2 o he end of ime slo When ime slo sars, is ready and will be physically execued during his ime slo In ime slo 2, he cu-hrough operaion for is performed and he resul is sen o he sub-schedulers, so ha he sub-schedulers know in advance which oupu pors will no be occupied by cu-hrough packes a ime To provide he delay necessary o realize pipelining, a fiber delay line wih fixed delay of wo ime slos are appended o each inpu por As a resul, newly arrived packes are aemped for cuing-hrough a he beginning of ime slo 2, bu hey do no physically cu-hrough and ake up corresponding oupu pors unil ime slo Laer in ecion IV-D we will discuss how his exra delay inroduced by pipelining may be minimized As menioned in ecion III-C, since he calculaion of cuing-hrough is very simple and can be done by ilip wih one ieraion, or LIP, here is no need o pipeline his sep A he same ime of cuing-hrough operaion, each oupu por checks he buffered packes from all flows and finds is oldes and second oldes flows, as well as in which buffer hese flows are sored The oupus hen announce o each buffer is sae The sae of a buffer consiss of wo bis and has he following possible values: 0 if his buffer conains neiher oldes nor second oldes flow for he oupu; 2 if he buffer conains one second oldes flow bu no oldes flow; oherwise A buffer is said o conain an i h flow of an oupu if i conains he headof-flow packe of ha flow Noe ha he sae being acually includes wo cases, ie he buffer has an oldes flow only, or has boh an oldes and a second oldes flow The poin here is ha we do no need o disinguish beween hese wo cases This is due o he fac ha in a ime slo a mos one packe can be ransmied from a buffer o he swich oupu Then if a buffer has an oldes flow for an oupu and a packe is scheduled from his buffer o he oupu, no more packes from oher flows can be scheduled in he same ime slo; on he oher hand, if no packe from he oldes flow is scheduled o he oupu, no packe from he second oldes flows can be scheduled eiher since oherwise a packe from he oldes flow should have been scheduled insead Thus as long as a buffer conains an oldes flow for an oupu, we do no need o know wheher i conains a second oldes flow for ha oupu or no Fig 3 provides a simple example wih N = 3 ha shows how he announcing of oldes and second oldes flows works In his example, we focus on one agged oupu and hree flows associaed wih i As shown in he figure, packes p and p 2 arrive in he same ime slo bu from differen flows p 3 arrives following p 2 A few ime slos laer, p 4 belonging o flow 3 arrives We assume ha some ime laer p, p 2 and p 4 become he head-of-flow packe for he hree flows, respecively I can be seen ha flows and 2 are he oldes flows, and flow 3 is he second oldes flow As shown in he figure, assume ha p and p 2 are sored in buffers and 2, respecively, and boh p 3 and p 4 are in buffer 3 Then he agged oupu will make he announcemen as o buffers and 2, and 2 o buffer 3, which informs he sub-schedulers ha buffers and 2 have an oldes flow for his oupu, and buffer 3 has a second oldes flow bu no oldes flow for his oupu Afer receiving he resul of cuing-hrough operaion, and he announcemens from he oupus, sub-scheduler ss is now se o work Noe ha while he sub-schedulers work direcly wih buffers, hey essenially work wih flows, in paricular, head-of-flow packes, since hey are he only packes eligible for ransmission for he sake of mainaining packe order Denoe he se of oupu pors ha will no be occupied by cu-hrough packes a ime slo as O Wha ss does is o mach he oupu pors in O o he buffers conaining an oldes flow of hese oupu pors Theoreically, his process can be done by any biparie maching algorihm For simpliciy, he ilip algorihm

6 flow flow 2 flow 3 packe arrivals p p 2 p 3 p 4 buffer saus p 4 p p 2 p 3 announcemen Fig 3 An example of how an oupu makes he announcemen The informaion of all packes ha are in he buffer and desined for he oupu por is mainained for each oupu por Based on ha informaion, an oupu can find he oldes and second oldes flows, and where he head-of-flow packes are buffered Then i can make he announcemen accordingly is adoped In each ieraion of he ilip algorihm, if here is more han one buffer requesing he same oupu por, ss decides which of hem he oupu should gran Then, in case a buffer is graned by muliple oupu pors, ss deermines which gran he buffer should accep The decisions are made based on he round-robin poiners mainained for each oupu por and buffer The number of ieraions o be execued depends on many facors, such as performance requiremen, swich size, raffic inensiy, ec Neverheless, as menioned earlier, i can be expeced ha he resul will converge faser han ha of a single cenral scheduler since he sub-scheduler handles only a subse of all he scheduling requess ss has one ime slo o finish is job A he beginning of ime slo, ss sends is resul o he oupu pors so ha he oupu pors can updae he VIQs and announce he laes buffer sae Meanwhile, ss relays he resul o ss 2 The funcionaliy of ss 2 is exacly he same as ss, ie maching oupu pors o buffers according o some pre-chosen algorihm The difference is ha, ss 2 only works on oupu pors ha are in O and are no mached by ss, and buffers ha are announced wih sae 2 by a leas one of hese oupu pors When ss 2 finishes he job a he end of ime slo, he maching based on which he swich will be configured in ime slo is ready Meanwhile he packes ha arrived a he beginning of ime slo 2 have gone hrough he wo-ime-slo-delay FDLs and reached he swich inpu In ime slo, he buffers are noified which packe o send, and he swich is configured accordingly Packes are hen ransmied o he swich oupu, eiher direcly from he swich inpu or from he elecronic buffer ime slo 0 2 3 + +2 ss ss 2 schedule 2 3 4 5 2 2 3 4 2 2 2 3 +2 + 2 +3 +2 2 + Fig 4 The pipelined scheduling procedure for k = 2 +4 +3 2 +2 2 The complee picure of he pipeline packe scheduling for k = 2 is shown in Fig 4 As menioned earlier, is he schedule execued in ime slo i denoes he par of ha is compued by sub-scheduler ss i during ime slo i A poenial problem wih pipelined scheduling algorihms is ha i is possible for a packe o be included in muliple schedules, or, being scheduled for more han once This is called duplicae scheduling I could occur under wo differen condiions: ) in he same ime slo, differen schedulers may ry o include he same packe o heir respecive schedule, since a scheduler is no aware of he progress a oher schedulers in he same ime slo; 2) wih pipelining, here is usually a delay beween a packe being included in a schedule and he schedule being physically execued During such inerval he packe may be accessed by anoher scheduler ha works on he schedule for a differen ime slo In oher words, a scheduler may ry o schedule a packe ha was already scheduled by anoher scheduler bu has no been physically ransmied ye Duplicae scheduling of a packe leads o wase of bandwidh resources, which consequenly causes underuilizaion of bandwidh and limis hroughpu In an inpu-queued swich, when a packe p is graned for ransmission more han once by differen sub-schedulers, exra grans may be used o ransmi he packes behind p in he same VOQ if he VOQ is backlogged On he oher hand, if he VOQ is empy, all bu one grans are wased Wih he OpCu swich archiecure, he consequence of duplicae scheduling is even more serious, in ha exra grans for a packe canno be used o ransmi packes behind i in he same buffer This is due o he fac ha in an OpCu swich packes from he same flow may be disribued o differen buffers, and a buffer may conain packes from differen flows Duplicae scheduling is apparenly undesirable bu is usually difficul o avoid in pipelined algorihms For example, he algorihms in [7] [8] [9] all suffer from his problem, even wih only wo-sep pipelining I was proposed in [9] o use prefiler and pos-filer funcions o reduce duplicae scheduling However, on one hand, hese funcions are quie complex, and on he oher hand, he problem canno be eliminaed even wih hose funcions The difficuly roos in he naure of pipelining, ha schedulers may have o work wih daed informaion, and he progress a one scheduler is no ransparen o oher schedulers Forunaely, as will be seen nex, when k = 2 our mechanism manages o overcome his difficuly and compleely eliminaes duplicae scheduling Firs of all, i is worh noing ha he oldness of a flow is solely deermined by he imesamp of is head-of-flow packe Thus we have he following simple bu imporan lemma Lemma : Unless is head-of-flow packe depars, a flow canno become younger Nex we deal wih he firs condiion ha may lead o duplicae scheduling Tha is, we show ha in any ime slo he wo sub-schedulers will no include he same packe in heir respecive schedule In fac we have a even sronger resul here, as shown by he following heorem: Theorem : During any ime slo, sub-scheduler ss and ss 2 will no consider he same flow when compuing heir schedule In oher words, if we denoe Fi as he se of flows ha ss i akes ino consideraion in ime slo, hen F F 2 = for any 0 Proof: Firs noe ha for = 0 here is no second oldes flow, F2 =, hus he heorem holds Now assume for some > 0, he heorem held up o ime slo bu no in ime slo In oher words, here exiss a flow f such ha f F and f F2 Noe ha f F2 indicaes f was no an oldes flow a ime Thus a here exised a leas one flow ha was older han

7 f and desined o he same oupu as f Denoe such an flow as f, hen f F since i was an oldes flow a ha ime Besides, i can be derived ha no packe from f was scheduled by ss in ime slo Oherwise, he corresponding oupu por should be mached, and a ime ss 2 would no consider any flow associaed wih ha oupu, including f Furhermore, since f F, i follows ha f / F2, given ha he heorem held in ime slo Then neiher ss nor ss 2 could schedule any packe belonging o f in ime slo According o Lemma, f is sill older han f a ime slo Consequenly, f is no an oldes flow a, and f F canno hold, which conradics he assumpion This implies ha he heorem mus hold for ime slo if i held for ime slo Tha proves he heorem for 0 Nex we consider condiion 2 I is possible for condiion 2 o occur beween 2 and + 2 due o he exisence of a ime glich: he buffer saes based on which 2 + is calculaed are announced a he beginning of ime slo A ha ime 2 is no calculaed ye Thus i is possible ha a packe is included in boh 2 and 2 + In conras, 2 and + can never overlap, since he laer is calculaed based on he informaion announced afer being updaed wih 2 For he same reason, sub-schedules i and +x j would never include he same packe for any 0, i, j {, 2}, as long as x > Thus he ask of eliminaing condiion 2 reduces o making sure ha 2 and 2 + do no overlap, which can be achieved as follows When an oupu makes is announcemen, insead of hree possible saes as inroduced earlier in his secion, each buffer may be in a forh sae denoed by value 3 (his is doable since he sae of a buffer is 2-bi long), which means ha his buffer conains a hird oldes flow and no oldes or second oldes flow for his oupu Furhermore, we call a flow a solo flow if i is he only i h oldes flow, and a buffer a solo buffer for an oupu por if i conains a solo flow of ha oupu por Now suppose ss 2 mached an oupu por op o a buffer bf in 2 based on he announcemens in ime slo 2 Then when 2 + is being compued, bf is excluded from 2 + if op again announced bf as a sae-2 buffer On one hand, if here exiss a leas one buffer oher han bf ha was announced wih sae 2 by op in ime slo, ss 2 will work wih hese buffers On he oher hand, if bf was a solo buffer for op based on he announcemen a ime slo, ss 2 will work on sae-3 buffers insead Consequenly, we have he following heorem Theorem 2: The mehod inroduced above ensures ha 2 and 2 + will no inroduce duplicae scheduling of a packe Proof: Firs, 2 and 2 + may include he same packe only if ss 2 maches a buffer o he same oupu por in boh 2 and 2 + Hence i is assumed ha buffer bf is mached o oupu por op in boh ime slos and by ss 2 (As a reminder, 2 is calculaed in ime slo based on oupu announcemens made in ime slo 2) For his o occur, he sae of bf announced a ime slo can only be 3 according o he above mehod Besides, bf canno be a sae- buffer of op for ime slos 2 and, since oherwise bf should no be considered by ss 2 Then he saes of bf announced by op a ime slos 2 and, based on which 2 and 2 + are calculaed respecively, have only wo possible combinaions: 2 a ime slo 2 and 3 a ime slo ({2, 3}), or 3 a ime slo 2 and 3 a ime slo ({3, 3}) We will show ha under neiher of he combinaions could duplicae scheduling occur {2, 3}: In his case, by maching bf o op, 2 acually schedules o op he head-of-flow packe of some second oldes flow announced by op a ime slo 2 The headof-flow packe is buffered in bf imilarly, 2 + schedules o op he head-of-flow packe of a hird oldes flow announced a ime slo Denoe he wo head-of-flow packes as p a and p b, and he wo flows as f a and f b On one hand, if flow f a and flow f b are differen, packe p a and packe p b mus be differen On he oher hand, if flow f a and flow f b are he same flow, packe p a and packe p b are sill differen according o Lemma, since he flow becomes younger (second oldes a ime slo 2 and hird oldes a ime slo ) {3, 3}: Given ha he sae of bf is announced as 3 a ime slo bu ss 2 akes i ino consideraion when compuing 2 +, i mus be rue ha in 2 ss 2 grans a buffer wih a second oldes flow of op announced a ime slo 2 and ha buffer is a solo buffer of op, which canno be bf whose sae announced a ime slo 2 is 3 This conradics wih he assumpion ha bf is mached o oupu por op in boh ime slos by ss 2 Combining he wo cases, he heorem is proved By now, duplicae scheduling is compleely ruled ou in our mechanism I is worh poining ou ha he scheduler will no omi any packe in he buffer Firs, he scheduler always mainains packe order in a flow; herefore, a packe will no ge skipped wihin a flow and will evenually become he headof-flow packe econd, a flow will eiher has is head-of-flow packe scheduled, or as Lemma indicaes, will evenually become an oldes flow ince all oldes flows are serviced in a round-robin manner, i is guaraneed ha no flow will ge sarved if he swich is no overloaded, and any head-of-flow packe will be scheduled wihin one round-robin cycle C Case of k > 2 We now exend our resul for k = 2 o he case ha k is an arbirary ineger beween 3 and N The sysem performance can be improved a he cos of exra subschedulers While he basic idea remains he same as k = 2, here are a few implemenaion deails ha need o be addressed when k becomes large Duplicae scheduling can no longer be eliminaed wih an arbirary k due o he increased scheduling complexiy Neverheless we will propose several approaches o reducing i The basic pipelined scheduling procedure is given in Fig 5 An FDL of lengh k is aached o each inpu por o provide he necessary delay for compuing he schedules k idenical subschedulers, ss, ss 2,, ss k are employed, ss i dealing wih buffers ha conain an i h oldes flow of some oupu por Inermediae resuls are passed beween adjacen sub-schedulers and used o updae he VIQ saus The compuing of he schedule o be execued in ime slo spans k ime slos, from he beginning of ime slo k o he end of ime slo The announcemen of buffer saes from an oupu por o he subschedulers can be done exacly he same way as ha for k = 2, excep ha he sae of a buffer for an oupu is now of lengh

8 ime slo 0 2 k k + k+ ss k k+2 +k +k+ 2k 2k k k+ 2k 2 2k +k ss 2 2 +k 2 2 2 2 2 ss k schedule k k k+ k k Fig 5 Pipelined scheduling procedure for an arbirary k + k +2 k + FDL Fig 6 A possible implemenaion of an FDL ha can provide flexible delays o fi he needs of pipeline wih differen number of sub-schedulers There are log K + sages The i h sage is able o provide eiher zero delay or 2 i ime slo delay log(k + ) bis We have addressed he solo buffer problem for k = 2 o eliminae duplicae scheduling Namely, if sub-scheduler ss 2 mached a buffer bf o an oupu por op in 2, i will no consider bf as a sae-2 buffer for op when compuing 2 + even if i was announced so In case bf is he solo buffer of op, ie he buffer announced by op o conain he only second oldes flow of i, ss 2 will work on sae-3 buffers for op rying o keep work conserving For an arbirary k, he rule is sill kep, ha if ss i mached a buffer bf o an oupu por op in i, i will no consider bf as a sae-i buffer for op when compuing However, if bf is he solo buffer of op, ss i will no urn o buffers wih sae i + The reason is ha, while his mehod involves only ss 2 when k = 2, i may cause a chain effec when k > 2: if ss i ses o work on buffers wih sae i + a some ime, hen ss i+ needs o work on buffers wih sae i + 2 for he same schedule In case here is only a solo buffer wih sae i + and is mached by ss i again, hen in he nex ime slo, ss i may have o work on buffers wih sae i + 2 and ss i+ has o work on buffers wih sae i + 3 The process could go on and become oo complicaed o implemen Therefore, if an oupu announced he same buffer as he solo buffer in wo consecuive ime slos, say, and, and ss i mached his buffer o + i, i will no ry o mach he oupu o any i In oher words, we will le ss i be idle for he oupu in ime slo + i in ha case By allowing a sub-scheduler o be idle for some oupu por in cerain ime slo, we preven he possibiliy ha he sub-scheduler schedules a packe ha was already scheduled and blocks oher sub-schedulers behind i in he pipeline from scheduling a packe o ha oupu por Unforunaely, he cos is ha Theorem does no hold for k > 2 To see his, firs noe he oupu in +k i i buffer in +k+ i ha Fi is essenially he se of he i h oldes flows of every oupu por a he beginning of ime slo + i For insance, F is he se of he oldes flows a ime slo and F3 is he se of he hird oldes flows a ime slo 2 If here is a flow f such ha f Fi, hen i is one of he i h oldes flows for some oupu por a ime slo + i During he ime inerval, denoed as T, from ime slo + i o ime slo j for some j < i, a mos i j flows for ha oupu can be scheduled Therefore, a ime slo + j, f is a leas he i (i j) = j h oldes flow If f is indeed he j h oldes flow, which can occur if and only if i j flows ha are younger han f have been scheduled during T and all of hem are solo flows, f Fj holds In ha case, f Fi F j holds, and ss i and ss j may schedule he same packe during ime slo Neverheless, as can be seen, he possibiliy ha Fi overlaps wih F j is raher small and should no significanly affec he overall sysem performance In fac, if we le P r denoe he probabiliy ha an oupu por op announces a buffer bf as he buffer which conains he solo second oldes flow and bf is laer mached o op by ss 2 based on he announcemen, hen according o our simulaions for k = 4, when he raffic inensiy is as high as 09, P r is less han 2% The probabiliy for he case of muliple solo flows is roughly exponenial o P r and hus is even smaller D Adapive Pipelining We have discussed he mechanism o pipeline packe scheduling in he OpCu swich for any fixed k In he following we will enhance i by adding adapiviy The moivaion is ha, in our mechanism, he exra delay inroduced by pipelining is equal o he number of acive sub-schedulers, or k When raffic is ligh, a small number of sub-schedulers may be sufficien o achieve saisfacory performance, or pipeline is no necessary a all In his case, i is desirable o keep k as small as possible o minimize he exra delay On he oher hand, when he raffic becomes heavy, more sub-schedulers are acivaed Alhough he delay of pipelining increases, now more packes can be scheduled o he swich oupu since more packes are aken ino consideraion for scheduling due o he addiional sub-schedulers The firs sep owards making he pipelined mechanism adapive is o inroduce flexibiliy o he FDLs aached o he swich oupu pors ince k sub-schedulers working in pipeline require a k ime slo delay of he newly arrived packes, he FDL needs o be able o provide inegral delays beween 0 and K ime slos, where K is he maximum number of sub-schedulers ha can be acivaed Clearly, K N A possible implemenaion of such an FDL is shown in Fig 6 The implemenaion adops he logarihmic FDL srucure [22] and consiss of log K + sages A packe encouners no delay or 2 i ime slo delay in sage i, depending on he inpu por i arrives a he swich of sage i and he sae of he swich Through differen configuraions of he swiches, any inegral delay beween 0 and K can be provided The number of packe arrivals in each ime slo is recorded, and he average over recen W ime slos is calculaed and serves as he esimaor of curren raffic inensiy This average value can be efficienly calculaed in a sliding window fashion: le A i denoe he number of packe arrivals in ime slo i, hen a he end of ime slo, A is updaed according o A = A (A w+ A )/W An arbiraor decides wheher a sub-scheduler needs o be urned on or off based on A If during cerain consecuive ime slos, A remains larger han a prese hreshold for he curren value of k, an addiional subscheduler will be pu ino use imilarly, if A drops below some

9 ime slo ss ss 2 ss 3 schedule i i+ i+2 i+3 i+ i+2 2 2 i ss 3 urned on i+ i+2 i+5 i+4 2 i+3 3 i+2 i+3 i+6 i+5 2 i+4 3 i+3 j j+ 2 j 3 j ss 3 urned off j j+3 j+ 3 j Fig 7 An example of sub-schedulers being urned on and off j+ j+2 j+3 j+4 j+2 j+3 2 2 j+ j+2 seps of ilip in each ime slo p-k2-2lip runs wo such subschedulers and covers up o he second oldes flows of each inpu por, while p-k4-2lip runs four sub-schedulers and covers up o he fourh oldes flows For comparison purpose, we implemened he basic non-pipelined scheduler running ilip as well, denoed as np-ilip Also included in he simulaions are he sraighforwardly pipelined ilip scheduler (denoed as p- ilip) - i sub-schedulers, each of which execues one ieraion of ilip in a ime slo Unlike he proposed pipelined schedulers, he sraighforward approach is no aware of he duplicae scheduling problem hreshold and does no bounce back in cerain ime inerval, an acive sub-scheduler can be urned off The value of W can be adjused o rade-off beween sensiiviy and reliabiliy: if W is large, he averaging of raffic inensiy is over a relaively long ime period, and i is less likely ha a small jier will rigger he acivaion of an addiional subscheduler However, more ime is needed for i o deec a subsanial increase in raffic inensiy, and vice versa An example of adapive pipelining is given in Fig 7 The basic idea is he same for any k value, hus we only show he process from wo sub-schedulers o hree sub-schedulers and hen back o wo o keep i nea The sae in he figure indicaes he sub-scheduler is off, and a means he subscheduler is on bu will be idle in he ime slo The arrows in he figure illusrae how he inermediae resuls are relayed among sub-schedulers a ransiion poins when a sub-scheduler is being urned on or off V PERFORMANCE EVALUATION In his secion, we evaluae he performance of he swich under wo widely used raffic models: he uniform Bernoulli raffic and he non-uniform bursy raffic Boh models assume ha he arrival a an inpu por is independen of oher inpu pors The uniform Bernoulli raffic assumes ha he packe arrival a an inpu por is a Bernoulli process and he desinaion of an arrived packe is uniformly disribued over all oupu pors The non-uniform bursy raffic assumes ha an inpu por alernaes beween he on sae and he off sae, wih he lengh of a sae following geomeric disribuion If and only if in an on sae, a packe arrives a he inpu por a he beginning of every ime slo Therefore raffic inensiy is given by l on /(l on +l off ), where l on and l off denoe he average lengh of on and off saes in erms of ime slos, respecively All packes arriving during he same on period are desined for he same oupu por and form a burs ame as in [2], l on (or, equivalenly, he average burs lengh) is se o 0 in our simulaions A packe arrived a I i is desined o O i wih probabiliy µ + ( µ)/n and is desined o O j wih probabiliy ( µ)/n for j i, where µ is he unbalance facor and is se o be 05 which is he value ha resuls in he wors performance according o [20] We have evaluaed OpCu swiches of differen sizes wih boh non-pipelined and pipelined schedulers Each simulaion was run for 0 6 ime slos We implemened wo insances of he proposed pipelining mechanism, denoed as p-k2-2lip and p-k4-2lip, respecively Boh of hem are buil on sub-schedulers execuing wo A Cu-Through Raio Firs we invesigae he packe cu-hrough raio, which indicaes how much porion of packes can cu-hrough he swich wihou experiencing elecronic buffering Apparenly, if only a iny porion of packes could cu-hrough, or packes could cuhrough only when he raffic inensiy is ligh, he OpCu swich would no be very promising From Fig 8, we can see ha when he load is ligh, he cu-hrough raio is high wih all schedulers under boh raffic models and swich sizes However, For p- ilip schedulers, he raio drops sharply wih he incremen in raffic inensiy For all he oher simulaed schedulers, he raio decreases much slower, and says above 60% under Bernoulli uniform raffic and 30% under bursy non-uniform raffic even when he load rises o 09 We noice ha under Bernoulli uniform raffic, here is a sharp drop in he cu-hrough raio for boh pipelined schedulers For he 6 6 swich, he drop occurs a 093 load for p- k2-2lip and 095 load for p-k4-2lip For he 64 64 swich i occurs a slighly higher loads As will be confirmed shorly by he average packe delay, hese are he poins a which he OpCu swich is sauraed wih he respecive pipelined scheduler However, i is worh poining ou ha higher cu-hrough raio does no necessarily imply beer overall performance In paricular, hroughpu is no direcly relaed o cu-hrough raio, as packes can always be ransmied from he buffers o he swich oupu Thus while non-pipelined scheduler np-2lip resuls in a higher cu-hrough raio han p-k2-2lip and p-k4-2lip under higher-han-093 uniform Bernoulli raffic, i can be seen from Fig 9 ha he pipelined schedulers acually achieve beer delay and higher hroughpu han np-2lip under ha raffic model An ineresing observaion is ha for bursy non-uniform raffic, such sharp drops in cu-hrough raio do no exis This is likely due o he naure of non-uniform raffic ha, excep for hose hospo flows, mos flows conains much fewer packes For hose packes, cuing-hrough becomes easier compared o he uniform-raffic scenario, since wheher a packe can cuhrough is independen of packes from oher flows B Average Packe Delay Fig 9 shows he average packe delay of he OpCu swich under differen schedulers, raffic models and swich sizes The ideal oupu-queued (OQ) swich is implemened o provide he lower bound on average packe delay I can be seen ha he sraighforward pipelining approach, p-ilip, performs very poor due o underuilizaion of bandwidh caused by he du-