Scalable, Optimal Flow Routing in Datacenters via Local Link Balancing

Size: px

Start display at page:

Download "Scalable, Optimal Flow Routing in Datacenters via Local Link Balancing"

Clarence Young
6 years ago
Views:

1 Scalable, Optmal Flow Routng n Datacenters va Local Lnk Balancng Sddhartha Sen, Davd Shue, Sunghwan Ihm, and Mchael J. Freedman Prnceton Unversty ABSTRACT Datacenter networks should support hgh network utlzaton. Yet today s routng s typcally load agnostc, so large flows can starve other flows f routed through overutlzed lnks. Even recent proposals lke centralzed schedulng or end-host mult-pathng gve suboptmal throughput, and they suffer from poor scalablty and other lmtatons. We present a smple, swtch-local algorthm called LocalFlow that s optmal (under standard assumptons), scalable, and practcal. Although LocalFlow may splt an ndvdual flow (ths s necessary for optmalty), t does so nfrequently by consderng the aggregate flow per destnaton and allowng slack n dstrbutng ths flow. We use an optmzaton decomposton to prove Local- Flow s optmalty when combned wth unmodfed end hosts TCP. Splttng flows presents several new techncal challenges that must be overcome n order to nteract effcently wth TCP and work on emergng standards for programmable, commodty swtches. Snce LocalFlow acts ndependently on each swtch, t s hghly scalable, adapts quckly to dynamc workloads, and admts flexble deployment strateges. We present detaled packet-level smulatons comparng LocalFlow to a varety of alternatve schemes, on real datacenter workloads. Categores and Subject Descrptors C.2.2 [Computer-Communcaton Networks]: Network Protocols Routng protocols Keywords Flow routng; Datacenter networks; Local algorthms; Optmzaton decomposton 1. INTRODUCTION The growth of popular Internet servces and cloud-based platforms has spurred the constructon of large-scale datacenters contanng (hundreds of) thousands of servers, leadng to a rash of research proposals for new datacenter networkng archtectures. Many Current afflatons: Mcrosoft Research, Google, Inc. Permsson to make dgtal or hard copes of part or all of ths work for personal or classroom use s granted wthout fee provded that copes are not made or dstrbuted for proft or commercal advantage, and that copes bear ths notce and the full ctaton on the frst page. Copyrghts for thrd-party components of ths work must be honored. For all other uses, contact the owner/author(s). Copyrght s held by the author/owner(s). CoNEXT 13, December 9 12, 213, Santa Barbara, Calforna, USA. ACM /13/12. such archtectures (e.g., [1, 2]) are based on Clos topologes [13]; they prmarly focus on ncreasng bsecton bandwdth, or the communcaton capacty between any bsecton of the end hosts. Unfortunately, even wth full bsecton bandwdth, the utlzaton of the core network suffers when large flows are routed poorly, as collsons wth other flows can lmt ther throughput even whle other, less utlzed paths are avalable (see Fgure 2). The problem of smultaneously routng flows through a capactated network s the mult-commodty flow (MCF) problem. Ths problem has been studed extensvely by both the theoretcal and networkng systems communtes. Solutons deployed n datacenters today are typcally load agnostc, however, such as Equal-Cost Mult-Path (ECMP) [23] and Valant Load Balancng (VLB) [42]. More recently, the networkng communty has proposed a seres of load-aware solutons ncludng both centralzed solutons (e.g., [2, 7]), where routng decsons are made by a global scheduler, and dstrbuted solutons, where routng decsons are made by end hosts (e.g., [26, 43]) or swtches (e.g., [27]). As we dscuss n 2, these approaches have lmtatons. Centralzed solutons lke Hedera [2] face serous scalablty challenges wth today s datacenter workloads [6, 28]. End-host solutons lke MPTCP [43] offer greater parallelsm but cannot predct downstream collsons, forcng them to contnuously react to congeston. Swtch-local solutons lke FLARE [27] scale well but are ll-suted to datacenters. Most of the exstng solutons do not splt flows across multple paths, makng them necessarly (and sgnfcantly) suboptmal [18], as our evaluaton confrms. But splttng flows s problematc n practce because t causes packet reorderng, whch n the case of TCP may lead to throughput collapse [3]. Solutons that do splt flows are ether load agnostc [11, 15], suboptmal and complcated [19, 43], or rely on specfc traffc patterns [27]. We argue that swtch-local solutons hold the best promse for handlng today s hgh-scale and dynamc datacenter traffc patterns optmally. We present LocalFlow, the frst practcal swtch-local algorthm that routes flows optmally n symmetrc networks, a property we defne later. Most proposed datacenter archtectures (e.g., fat-trees [1, 2, 31]) and real deployments satsfy the symmetry property. Our optmalty proof decomposes the MCF problem nto two components, one of whch s essentally solved by end hosts TCP, whle the other component s solved locally at each swtch by LocalFlow. In fact, a naïve scheme called PacketScatter [11, 15], whch essentally round-robns packets over a swtch s outgong lnks, also solves the latter component. However, PacketScatter s load agnostc: t splts every flow, whch causes packet reorderng and ncreases flow completon tmes, and t does not handle network falures well. 151

2 flow 1 flow 2 flow 3 S S even slght load mbalances sgnfcantly degrade throughput (e.g., by 17%). Our evaluaton also uncovered several other nterestng fndngs, such as the hgh throughput of LocalFlow-NS on VL2 topologes [2]. We next compare LocalFlow to the landscape of exstng solutons. We defne our network archtecture as well as the symmetry property n 3. We present the LocalFlow algorthm n 4 and our mult-resoluton splttng technque n 5. We conduct a theoretcal analyss of LocalFlow n 6 and evaluate t n 7. We address some deployment concerns n 8 and then conclude. Fgure 1: A set of flows to the same destnaton arrves at swtch S. PacketScatter (left) splts every flow, whereas Local- Flow (rght) dstrbutes the aggregate flow, and only splts an ndvdual flow f the load mbalance exceeds δ. LocalFlow overcomes these lmtatons wth the followng nsghts. By consderng the aggregate flow to each destnaton, rather than ndvdual transport-level flows, we splt at most L 1 flows, where L s the number of canddate outgong lnks (typcally <12). By further allowng splttng to be approxmate, usng a slack parameter δ [,1], we splt even fewer flows (or possbly none!). Fgure 1 llustrates these deas. In the lmt, settng δ = 1 yelds a varant of LocalFlow that schedules flows as ndvsble unts; we call ths varant LocalFlow-NS ( no splt ). Lke PacketScatter, LocalFlow proactvely avods congeston, allowng t to automatcally cope wth traffc unpredctablty. However, by usng flexble, load-aware splttng, LocalFlow splts much fewer flows and can even tolerate falures and asymmetry n the network. The benefts of a swtch-local algorthm are deep. Because t requres no coordnaton between swtches, LocalFlow can operate at very small schedulng ntervals at an unprecedented scale. Ths allows t to adapt to hghly dynamc traffc patterns. At the same tme, LocalFlow admts a wde varety of deployment optons of ts control-plane logc, from runnng locally on each swtch s CPU, to runnng on a sngle server for the entre network. In all cases, the schedulng performed for each swtch s ndependent of the others. Splttng flows ntroduces several techncal challenges n order to acheve hgh accuracy, use modest forwardng table state, and nteract properly wth TCP. (Although we focus on TCP, we also dscuss how to use LocalFlow wth UDP traffc.) Besdes splttng nfrequently, LocalFlow employs two novel technques to splt flows effcently. Frst, t splts ndvdual flows spatally for hgher accuracy, by nstallng carefully crafted rules nto swtches forwardng tables that partton a monotoncally ncreasng sequence number. Second, t supports splttng at multple resolutons to control forwardng table expanson, so rules can represent groups of flows, sngle flows, or subflows. Our mechansms for mplementng mult-resoluton splttng use exstng (for LocalFlow-NS) or forthcomng (for LocalFlow) features of OpenFlow-enabled swtches [34, 37]. Gven the forthcomng nature of one of these features, and our desre to evaluate LocalFlow at large scale, our evaluaton focuses on smulatons. We use a full packet-level network smulator [43] as well as real datacenter traces [6, 2]. Our evaluaton shows that LocalFlow acheves near-optmal throughput, outperformng ECMP by up to 171%, MPTCP by up to 19%, and Hedera by up to 23%. Compared to PacketScatter whch splts all flows, t splt less than 4.3% of flows on a real swtch packet trace and acheved 11% lower flow completon tmes. By modestly ncreasng the duplcate-ack threshold of end hosts TCP, LocalFlow avods the adverse effects of packet reorderng. Interestngly, the hgh accuracy of ts spatal splttng s crucal, as 2. EXISTING APPROACHES We dscuss a broad sample of exstng flow routng solutons along two mportant axes, scalablty and optmalty, whle comparng them to LocalFlow. Scalablty encompasses a varety of metrcs, ncludng forwardng table state at swtches, network communcaton, and schedulng frequency. Optmalty refers to the maxmum flow rates acheved relatve to optmal routng. Centralzed solutons typcally run a sequental algorthm at a sngle server [2, 7, 9]. These solutons lack scalablty, because the cost of collectng flow nformaton, computng flow paths, and deployng these paths makes t mpractcal to respond to dynamc workloads. Indeed, coordnatng decsons n the face of traffc burstness and unpredctablty s a serous problem [7, 2]. The rse of swtches wth externally-managed forwardng tables, such as OpenFlow [34, 37], has enabled solutons that operate at faster tmescales. For example, Hedera s scheduler runs every 5 seconds, wth the potental to run at subsecond ntervals [2], and McroTE s scheduler runs each second [7]. But recent studes [6, 28] have concluded that the sze and workloads of today s datacenters requre parallel route setup on the order of mllseconds, makng a centralzed OpenFlow soluton nfeasble even n small datacenters [14]. Ths nfeasblty motvated our pursut of a parallel soluton. End-host solutons employ more parallelsm, and most gve provable guarantees. TeXCP [26] and TRUMP [22] dynamcally loadbalance traffc over multple paths between pars of ngress-egress routers (e.g., MPLS tunnels) establshed by an underlyng routng archtecture. DARD [44] s a smlar soluton for datacenter networks that controls paths from end hosts. (We dscuss MPTCP further below.) These solutons explctly model paths n ther formulaton, though they lmt the number of paths per source-destnaton par to avod exponental representaton and convergence ssues. Snce end-host solutons lack nformaton about other flows n the network, they must contnuously react to congeston on paths and rebalance load. Swtch-local solutons have more vsblty of actve flows, especally at aggregaton and core swtches, but stll lack a global vew of the network. They acheve hgh scalablty and do not need to model or rely on per-path statstcs. For example, REPLEX [16] gathers (aggregate) path nformaton usng measurements on adjacent lnks and by exchangng messages between neghbors. None of the above solutons splt ndvdual flows, however, and hence cannot produce optmal results, snce the unsplttable MCF problem s NP-hard and admts no constant-factor approxmaton [18]. MPTCP [39, 43] s an end-host soluton that splts a flow nto subflows and balances load across the subflows va lnked congeston control. It uses two levels of sequence numbers and bufferng to handle reorderng across subflows. DeTal [45] modfes swtches to do per-packet adaptve load balancng based on queue occupancy, n order to reduce the tal of flow completon tmes. It reles on layer-2 backpressure and modfcatons to end hosts TCP to avod congeston and handle reorderng. Geoffray and Hoefler [19] propose an adaptve source-routng scheme that uses layer-2 back- 152

3 Control Y Z Core Data Con W X Aggregaton forwardng tables swtch backplane output queues Edge A E F G H I Fgure 2: A FatTree network wth 4-port swtches. VL2 s a varaton on ths topology. End hosts A, G, I smultaneously transmt to E, F, H and collde at swtches Y and X, but there s suffcent capacty to route all flows at full rate. pressure and probes to evaluate alternatve paths. Lke DeTal, ther scheme requres modfcatons to both end hosts and swtches. FLARE [27] s a technque for splttng flows n wde-area networks that can be combned wth systems lke TeXCP. It explots delays between packet bursts to route each burst along a dfferent path whle avodng reorderng. It s nstructve to compare our soluton, LocalFlow, to the above schemes. Lke all of them, LocalFlow splts ndvdual flows, but whereas the above schemes tend to splt every flow, LocalFlow splts very few flows. Ths s largely due to the fact that LocalFlow balances load proactvely, gvng t full control over how flows are scheduled and splt, nstead of reactng to congeston after-the-fact. Unlke most of the above schemes, LocalFlow s purely swtchlocal and does not modfy end hosts. In ths respect t s smlar to FLARE, but unlke FLARE t splts flows spatally (e.g., based on TCP sequence numbers) and does not make tmng assumptons. Our smulatons show that spatal splttng sgnfcantly outperforms temporal splttng. Fnally, LocalFlow acheves nearoptmal routng n practce, whch the other solutons have not demonstrated, despte beng consderably smpler than them. The only flow-splttng scheme that s smpler than LocalFlow s PacketScatter [11, 15], whch we dscussed earler. LocalFlow can be vewed as a load-aware, effcent verson of PacketScatter. We compare the schemes n detal whle dervng LocalFlow s desgn. There s a long hstory of theoretcal algorthms for the MCF problem n each of the settngs above: from centralzed (e.g., [33]), to end-host (e.g., [4]), to swtch-local (e.g., [3]). Although these algorthms gve provably exact or approxmate solutons, they are dsconnected from practce for reasons we have prevously outlned [4]. For example, they assume flows can be splt arbtrarly (and nstantaneously) at any node n the network, whereas ths s dffcult to do n practce. LocalFlow, n contrast, acheves nearoptmal performance n both theory and practce. 3. ARCHITECTURE LocalFlow s a flow routng algorthm desgned for datacenter networks. In ths secton, we descrbe the components of ths network ( 3.1), outlne the man schedulng loop for each swtch ( 3.2), and defne the networks on whch LocalFlow s effcent ( 3.3). 3.1 Components and deployment Fgure 2 shows a typcal datacenter network of end hosts and swtches on whch LocalFlow mght be deployed. The technques we use are compatble wth exstng or forthcomng features of extensble swtches, such as OpenFlow [34, 37], Junper NOS [25], or Csco AXP [12]. The archtecture and capabltes of hardware swtches dffer sgnfcantly from that of end hosts, makng even smple tasks challengng to mplement effcently. The detal n Fgure 2 shows a typcal swtch archtecture, whch conssts of a control plane and a data plane. The data plane hardware has multple physcal (e.g., Ethernet) ports, each wth ts own lne card, whch (when smplfed) conssts of an ncomng lookup/forwardng table n fast memory (SRAM or TCAM) and an outgong queue. The control plane performs general-purpose processng and can nstall or modfy rules n the data plane s forwardng tables, as well as obtan statstcs such as the number of bytes that matched a rule. To handle hgh data rates, the vast majorty of traffc must be processed by the dataplane hardware, bypassng the control plane entrely. The hardware capabltes, resources, and programmablty of swtches are contnually ncreasng [35, 38]. Beng a local algorthm, LocalFlow s demands on these resources are lmted to ts local vew of network traffc, whch s orders of magntude smaller than that of a centralzed scheduler [6, 28]. Snce LocalFlow runs ndependently for each swtch, t supports a wde varety of deployment optons of ts control plane logc. For example, t can run locally on each swtch, usng a rack-local scheduler wth Open- Flow swtches or a separate blade n the swtch chasss of Junper NOS or Csco AXP swtches. Alternatvely, the number of these schedulers can be reduced and ther locatons changed to sut the network s scalablty requrements. In the lmt, a sngle centralzed scheduler may be used. Regardless of the deployment strategy, LocalFlow s ndependence allows each swtch to be scheduled by separate threads, processes, cores, or devces. 3.2 Man LocalFlow schedulng loop LocalFlow runs a contnuous schedulng loop for each swtch. At the begnnng of every nterval, t: 1) Measures the rate of each actve flow. Ths s done by queryng the byte counter of each forwardng rule from the prevous nterval and dvdng by the nterval length. 2) Runs the schedulng algorthm usng the flow rates from step 1 as nput. 3) Updates the rules n the forwardng table based on the outcome of step 2, and reset all byte counters. Steps 2 and 3 are descrbed n 4 and 5, respectvely. Note that Step 1 reles on measurements from the prevous nterval to nform schedulng decsons n the current nterval. Although traffc patterns may change between ntervals, LocalFlow s desgn copes well wth ths unpredctablty, as we shall see. 3.3 Symmetrc networks Although LocalFlow can be run on any network, t only acheves optmal throughput on networks that satsfy a certan symmetry property. Ths property s defned as follows: 153

4 DEFINITION 1. A network s symmetrc f for all sourcedestnaton pars (s, d), all swtches on the shortest paths between s and d that are the same dstance from s have dentcal outgong capacty to d. In other words, any of these swtches are equally good ntermedate canddates for routng a flow between s and d. Usng the example of Fgure 2, swtches Y and Z are both on a shortest path between (A,E), and both have one lnk of outgong capacty to E. Real deployments and most proposed datacenter archtectures satsfy the symmetry property. For example, t s satsfed by fattree-lke networks (e.g., [1, 2, 31]), whch are Clos topologes [13] arranged as mult-rooted trees. FatTree [1] s a three-stage fat-tree network bult usng dentcal k-port swtches arranged nto three levels edge, aggregaton, and core that supports full bsecton bandwdth between k 3 /4 end hosts. Fgure 2 shows a 16-host Fat- Tree network (k = 4). F1 [31] s a recent varant of FatTree that skews the connectons between swtch levels to acheve better fault tolerance, but s stll symmetrc by our defnton. VL2 [2] modfes FatTree by usng hgher-capacty lnks between Top-of-Rack (ToR,.e., edge), aggregaton, and ntermedate (.e., core) swtches. Unlke FatTree, the aggregaton and ntermedate swtches form a complete bpartte graph n VL2. All of these networks can be oversubscrbed by connectng more hosts to each edge/tor swtch, whch preserves ther symmetry property. 4. ALGORITHM LOCALFLOW Ths secton presents LocalFlow, our swtch-local algorthm for routng flows n symmetrc datacenter networks. It s nvoked n Step 2 of the man schedulng loop ( 3.2). At a hgh-level, Local- Flow attempts to fnd the optmal flow routng for the followng maxmum MCF problem: maxmze: U (x ) (1) subject to: fu,v s,d = u:(u,v) E f s,d u:(s,u) E w:(v,w) E s,u = x : s,d :s d f s,d v,w : v,s,d, fu,v s,d c u,v : (u,v) E, lnk capacty c u,v s,d Ths formulaton reflects the complementary roles LocalFlow and TCP play. LocalFlow balances the flow rates fu,v s,d across lnks (u,v) between adjacent swtches, for a fxed set of commodty send rates x. Ths technque s smlar to, but more aggressve than, the orgnal lnk-balancng technque of Awerbuch and Leghton [5]. The ntuton s that f we splt a flow evenly over equal-cost lnks along ts path to a destnaton, then even f t colldes wth other flows mdway, the colldng subflows wll be small enough to stll route usng the avalable capacty. In a symmetrc network wth a fxed set of send rates, ths s equvalent to mnmzng the maxmum lnk utlzaton: mnmax s,d fu,v s,d (u,v) E c u,v. Lnk balancng on ts own does not guarantee an optmal soluton to the maxmum MCF objectve (1), whch depends on the percommodty (concave) utlty functons U. Fortunately, LocalFlow can rely on the end hosts TCP congeston control for ths purpose. Usng an dealzed flud model, t can be shown [32] that assumng backlogged senders (.e., senders have more data to send) and gven a fxed routng matrx, TCP, n ts varous forms, maxmzes the total network utlty. By balancng the per-lnk flow rates, LocalFlow adjusts the flow routng n response to TCP s optmzed send rates, whle TCP n turn adapts to the new routng. We show how ths nteracton acheves the MCF optmum n 6. We frst descrbe a basc load-agnostc soluton for lnk balancng called PacketScatter ( 4.1). We then mprove ths soluton to yeld LocalFlow ( 4.2). Fnally, we dscuss a smple extenson to LocalFlow that copes wth network falures and asymmetry ( 4.3). 4.1 Basc soluton: PacketScatter The smplest soluton for lnk balancng s to splt every flow over every equal-cost outgong lnk of a swtch ( fu,v s,d = fu,w). s,d We call ths scheme PacketScatter. PacketScatter round-robns packets to a gven destnaton over the swtch s outgong lnks; t has been supported by Csco swtches for over a decade now [11]. Recent work by Dxt et al. [15] studes varants of the scheme that select a random outgong lnk for each packet to reduce state. However, ths approach s problematc because even slght load mbalances due to randomness can sgnfcantly degrade throughput, as our evaluaton confrms ( 7.6). Although PacketScatter routes flows optmally, because t uncondtonally splts every flow at ndvdual-packet boundares, t can cause excessve reorderng at end hosts. These out-of-order packets can nadvertently trgger TCP fast-retransmt, dsruptng throughput, or delay the completon of short-lved flows, ncreasng latency. On the upsde, because the splttng s load agnostc, t s hghly accurate and oblvous to traffc bursts and unpredctablty. However, by the same token, t cannot adapt to partal network falures, snce t wll contnue sendng the same flow to under-capactated subtrees. 4.2 LocalFlow We obtan LocalFlow by applyng three deas to PacketScatter that remove ts weaknesses whle retanng ts strengths. The pseudocode s gven n Algorthm 1. Frst, nstead of uncondtonally splttng every flow, we group the flows by destnaton d and dstrbute ther aggregate flow rate evenly over L d outgong lnks (lnes 2-6 of Algorthm 1). Ths corresponds to a smple varable substtuton f d u,v = s f s,d u,v n (1). Ths means that LocalFlow splts at most L d 1 tmes per destnaton. Functon BINPACK does the actual splttng. It sorts the flows accordng to some polcy (e.g., ncreasng rate) and successvely places them nto L d equal-szed bns (lnes 17-25). If a flow does not ft nto the current least loaded bn, BINPACK splts the flow (possbly unevenly) nto two subflows, one whch flls the bn and the other whch rejons the sorted lst of flows (lnes 2-21). When the functon returns, the total flow to the destnaton has been evenly dstrbuted. Our second dea s to allow some slack n the splttng. Namely, we allow the L d bns to dffer by at most a fracton δ (,1] of the lnk bandwdth (lne 19). (For smplcty, we overload δ to mean ether ths fracton or the actual flow rate t corresponds to, dependng on context.) Ths not only reduces the number of flows that are splt, but t also ensures that small flows of rate δ are never splt. Note that small flows are stll bn-packed (and hence scheduled) by the algorthm, and only the last such flow enterng a bn may gve rse to an mbalance. After BINPACK returns, LOCALFLOW ensures that larger bns are placed nto less loaded lnks (lnes 7-1). Ths ensures that the lnks stay balanced to wthn δ even after all destnatons have been processed, as proved n Lemma 6.2. Fgure 1 llustrates the above two deas. In the example shown, no flows are actually splt by LocalFlow because they are accommodated by the δ slack, whereas PacketScatter splts every flow. Snce LocalFlow may splt a flow over a subset of the outgong lnks, possbly unevenly, we cannot use the (load-agnostc) round- 154

5 1 functon LOCALFLOW(flows F, lnks L) 2 dests D = { f.dest f F} 3 foreach d D do 4 flows F d = { f F f.dest = d} 5 lnks L d = {l L l s on a path to d} 6 bns B d = BINPACK(F d, L d ) 7 Sort B d by ncreasng total rate 8 Sort L d by decreasng total rate 9 foreach b B d, l L d do 1 Insert all flows n b nto l 11 end 12 bns functon BINPACK(flows F d, lnks L d ) 13 δ =...; polcy = bncap = ( f Fd f.rate)/ L d 15 bns B d = { L d bns of capacty bncap} 16 Sort F d by polcy 17 foreach f F d do 18 b = argmax b Bd b.resdual 19 f f.rate > b.resdual + δ then 2 { f 1, f 2 } = SPLIT( f,b.resdual, f.rate b.resdual) 21 Insert f 1 nto b; Add f 2 to F d by polcy 22 else 23 Insert f nto b 24 end 25 end 26 return B d Algorthm 1: Our swtch-local algorthm for routng flows on fattree-lke networks. robn scheme of PacketScatter to mplement SPLIT. Instead, we ntroduce a new, load-aware scheme called mult-resoluton splttng that splts traffc n a flexble manner, by nstallng carefully crafted rules nto the forwardng tables of swtches. These rules, along wth ther current rates (as measured n Step 1 of the man schedulng loop), comprse the nput set F to functon LOCALFLOW. Multresoluton splttng s dscussed n 5. Even though LocalFlow s splttng strategy s load aware, t stll uses local measurements to balance load proactvely, whch allows t to cope wth traffc burstness and unpredctablty. 4.3 Handlng falures and asymmetry Perhaps surprsngly, many falures n a symmetrc network can be handled seamlessly, because they do not volate the symmetry property. In partcular, complete node falures that s, faled end hosts or swtches remove all shortest paths between a sourcedestnaton par that pass through the faled node. For example, f swtch X n Fgure 2 fals, all edge swtches n the pod now have only one opton for outgong traffc: swtch W. The network s stll symmetrc, so LocalFlow s optmalty stll holds. Indeed, even PacketScatter can cope wth such falures. Partal falures that s, ndvdual lnk or port falures, ncludng slow (down-rated) lnks are more dffcult to handle, because they volate the symmetry property. For example, consder when swtch X n Fgure 2 loses one of ts uplnks. PacketScatter at the edge swtches would contnue to dstrbute traffc equally between swtches W and X, even though X has half the outgong capacty as W. Also, snce PacketScatter splts every flow, more flows are lkely to be affected by a sngle lnk falure. Ths results n suboptmal throughput. However, wth a smple modfcaton, LocalFlow s able to cope wth ths type of falure. When swtch X experences the partal falure, other LocalFlow schedulers can learn about t from the underlyng lnk-state protocol (whch automatcally dssemnates ths connectvty nformaton). The upstream schedulers determne the fracton of capacty lost and use ths nformaton to rebalance traffc sent to W and X, by smply modfyng the bn szes used n lnes of Algorthm 1. In ths case, LocalFlow sends twce as much traffc to W than X. Note that ths rebalancng may not be optmal. In general, determnng the optmal rebalancng strategy requres non-local knowledge because the network s now asymetrc. A scheme smlar to the above can be used to run LocalFlow n an asymmetrc network. 5. MULTI-RESOLUTION SPLITTING Mult-resoluton splttng s our spatal splttng technque for mplementng the SPLIT functon n Algorthm 1. It splts traffc at dfferent granulartes by nstallng carefully crafted rules nto the forwardng tables of emergng programmable swtches [37]. Fgure 3 llustrates each type of rule. These rules represent sngle flows and subflows ( 5.1), but they can also represent metaflows ( 5.2),.e., groups of flows to the same destnaton. Snce metaflow and subflow rules use partal wldcard matchng, they must appear n the TCAM of a swtch, whch s scarcer and less power-effcent than SRAM. However, our smulatons show that LocalFlow splts very few flows n practce, so only a few TCAM rules are needed; sngle-flow rules can be placed n SRAM. 5.1 Flows and subflows To represent a sngle flow, we nstall a forwardng rule that exactly specfes all felds of the flow s 5-tuple. Ths unquely dentfes the flow and thus matches all of ts packets. To splt a sngle flow nto two or more subflows, we use one of two technques. Although these technques may not be supported by current OpenFlow swtches, the latest specfcatons [37] suggest that the functonalty wll appear soon. The frst technque extends a sngle-flow rule to addtonally match bts n the packet header that change durng the flow s lfetme, e.g., the TCP sequence number. To facltate fner splttng at later swtches, we group packets nto contguous blocks of at least t bytes, called flowlets, and splt only along flowlet boundares. Our noton of flowlets s spatal and thus dfferent from that of FLARE [27], whch crucally reles on tmng propertes. By carefully choosng whch bts to match and the number of rules to nsert, we can splt flows wth dfferent ratos and flowlet szes. For example, to splt a flow evenly over L lnks wth flowlet sze t, we add L forwardng rules for each possble lgl-bt strng whose least sgnfcant bt starts after bt lgt n the TCP sequence number. 1 Snce TCP sequence numbers ncrease consstently and monotoncally, ths causes the flow to match a dfferent rule every t bytes. Also, snce ntal sequence numbers are randomzed, the flowlets of dfferent flows are also desynchronzed. Uneven splttng can be acheved n a smlar way. For example, the subflow rules n Fgure 3 splt a sngle flow over three lnks wth ratos (1/4,1/4,1/2) and t = 124 bytes. By usng more rules, we can support uneven splts of ncreasng precson. Snce later swtches along a path may need to further splt subflows from earler swtches, they should use a smaller flowlet sze than the earler swtches. For example, edge swtches n Fgure 2 may use t = 2 maxmum segment szes (MSS) whle aggregaton swtches use t = 1 MSS. In general, smaller flowlet szes lead to more accurate load balancng. An alternatve technque that avods the need for flowlets s to assocate a counter of bytes wth each flow that s splt, and ncrement t whenever a packet from that flow s sent. Such counters are com- 1 Snce TCP sequence numbers represent a byte offset, the bt strng should actually start after bt lg(t MSS), where MSS s the maxmum sze of a TCP segment. 155

6 Type Src IP Src Port Dst IP Dst Port TCP seq/counter Lnk * *11 E * * 1 M * *1 E * * 2 * * E * * 3 F A u F v * 1 A x G y ************ 1 S A x G y *1********** 2 A x G y *11********** 3 Fgure 3: Mult-resoluton splttng rules (M = metaflow, F = flow, S = subflow). mon n OpenFlow swtches [37]. The value of the counter s used n place of the TCP sequence number n the subflow rules of Fgure 3. Snce each swtch uses ts own counter to measure each flow, we no longer rely on contguous bytes (flowlets) for downstream swtches to make accurate splttng decsons. The counter method s also approprate for UDP flows, whch do not have a sequence number n ther packet headers. Compared to the above technques, temporal splttng technques lke FLARE are nherently less precse, because they rely on unpredctable tmng characterstcs such as delays between packet bursts nsde a flow. For example, durng a bulk data shuffle between MapReduce stages, there may be few f any ntra-flow delays. Ths lack of precson leads to load mbalances that sgnfcantly degrade throughput, as shown n Metaflows To represent a metaflow, we nstall a rule that specfes the destnaton IP feld but uses wldcards to match all other felds. Ths matches all flows to the same target, savng forwardng table space, as llustrated by the thrd metaflow rule n Fgure 3. To splt a metaflow, we addtonally specfy some least sgnfcant bts (LSBs) n the source port feld. In the example, the metaflow rules dstrbutes all flows to target E over three lnks wth ratos (1/4,1/4,1/2). The all rule s placed on the bottom to llustrate ts lower prorty (t captures the remanng 1/2), although n OpenFlow prortes are explct. If there s a dversty of flows and source ports, ths scheme splts the total flow rate by the desred ratos (approxmately). But t may not do so f the dstrbuton of source ports or flow szes s unfavorable: for example, f there s a sngle large flow and many small flows. In such stuatons, metaflow rules can be combned wth subflow rules for better accuracy. For example, a metaflow rule can be splt usng the subflow splttng technque. Note that ths smultaneously splts all flows that match the rule. We dd not use metaflow rules n our evaluaton snce we found LocalFlow s space utlzaton to be modest n practce ( 7.4). 6. ANALYSIS We begn by analyzng the local (per-swtch) complexty of Local- Flow, and then prove ts optmalty. Durng each round, LocalFlow executes O( F log F + d F d log F d ) = O( F log F ) sequental steps f δ =, snce t need not sort the bns and lnks n lnes 7-8, where F d s the number of flows to destnaton d. If δ >, O( F log F + F L log L ) steps are executed, where L s the number of outgong lnks. Relatve to the number of actve flows, L can be vewed as a constant. In terms of space complexty, LocalFlow mantans at least one rule per flow; ths can be reduced to one rule per destnaton by usng metaflows. Both of these numbers ncrease when flow rules are splt. We measure LocalFlow s space overhead on a real datacenter workload n 7.4. We now show that, n conjuncton wth TCP, LocalFlow maxmzes the total network utlty (1) n an dealzed flud model [1]. In the remander of ths secton, we refer to ths dealzed Network Utlty Maxmzaton (NUM) model when dscussng the propertes of TCP wth respect to the MCF problem. Snce LocalFlow and TCP alternately optmze ther respectve varables, we frst show that the master LocalFlow optmzaton adapts lnk flow rates fu,v d to mnmze the maxmum lnk cost d fu,v d c u,v, for the commodty send rates x determned by the slave TCP sub-problem. Then we examne the optmalty condtons for TCP and show how the lnk-balanced flow rates determned by LocalFlow lead to an optmal soluton to our orgnal MCF objectve. LEMMA 6.1. If δ =, algorthm LocalFlow routes the mnmum cost MCF wth fxed commodty send rates. PROOF. The symmetry property from 3.3 mples that all outgong lnks to a destnaton lead to equally-capactated paths. Thus, the maxmum load on any lnk s mnmzed by splttng a flow equally over all outgong lnks; ths s acheved by lnes 6-1 of LocalFlow. No paths longer than the shortest paths are used, as they would ntersect wth a shortest path and thus add to ts load. Snce we can vew multple flows wth the same destnaton as a sngle flow orgnatng at the current swtch, groupng does not affect the dstrbuton of flow. Repeatng ths argument for each destnaton ndependently yelds the mnmum-cost flow. When δ >, LocalFlow splts the total rate to a destnaton d over L outgong lnks, such that no lnk receves more than δ flow rate than another. Ths process s repeated for all d D usng the same set of lnks L. Then, LEMMA 6.2. At the end of LocalFlow, the total rate per lnk s wthn an addtve δ of each other. PROOF. The lemma trvally holds when L = 1 because no splttng occurs. Otherwse, the proof s an nducton over the destnatons n D. Intally there are no flows assgned to lnks, so the lemma holds. Suppose t holds pror to processng a new destnaton. Let the total rate on the bns returned by BINPACK be y 1,y 2,...,y L n ncreasng order; let the total rate on the lnks be x 1,x 2,...,x L n decreasng order. After lne 1, the total rate on the lnks s x 1 + y 1,x 2 + y 2,...,x L + y L. If 1 < j L are the lnks wth maxmum and mnmum rate, respectvely, then we have (x + y ) (x j + y j ) = (x x j ) + (y y j ) δ, snce y y j and x x j δ by the nductve hypothess. The case when j < s smlar. THEOREM 6.3. LocalFlow, n conjuncton wth end-host TCP, acheves the maxmum MCF optmum. PROOF. To show that LocalFlow s lnk-balanced flow rates enable TCP to maxmze the maxmum MCF objectve (1), we turn to the node-centrc NUM formulaton [1] of the TCP sub-problem, adapted for the mult-path settng to allow flow splttng. maxmze: U (x ) x p:(u,v) p π p = 1: p π p c u,v : (u,v) E Here, LocalFlow has already computed the set of flow varables f d u,v, whch have been absorbed nto the path probabltes π p. Each 156

7 π p determnes the proporton of commodty x s traffc that traverses path p, where p s a set of lnks connectng source s to destnaton d. Snce these varables are derved from the lnk flow rates, they mplctly satsfy the orgnal MCF flow and send rate constrants (1). To examne the effect of LocalFlow on the MCF objectve, we focus on the optmalty condtons for the TCP sub-problem, whch s solved usng dual decomposton [1]. In ths approach, we frst form the Lagrangan L(x,λ) by ntroducng dual varables λ u,v, one for each lnk constrant. ) L(x,λ) = f (x )dx λ u,v (x π p c u,v u,v p:(u,v) p For generalty, we defne the TCP utlty to be a concave functon where U (x ) = f (x )dx, as n [32], and f s dfferentable and nvertble. Most TCP utltes (e.g., log) fall n ths category [32]. Next, we construct the Lagrange dual functon Q(λ) maxmzed wth respect to x : x = f 1 (β) when L = x, β = x π p λ u,v (2) p (u,v) p ( ) Q(λ) = f (x )dx f 1 (β)β +λ u,v c u,v u,v Mnmzng Q wth respect to λ gves both the optmal dual and prmal varables, snce the orgnal objectve s concave. f 1 (β) p:(u,v) p π p = c u,v when Q =, (u,v) E (3) λ u,v When (3) s satsfed, the system has reached the maxmum network utlty (1). TCP computes ths soluton n a dstrbuted fashon usng gradent ascent. End-hosts adjust ther local send rates x accordng to mplct measurements of path congeston (u,v) p λ u,v and swtches update ther per-lnk congeston prces λ u,v (queung delay) accordng to the degree of backlog. Accordng to the symmetry property, all nodes at the same dstance from source s along the shortest paths must have lnks of equal capacty to nodes n the next level of the path tree. Thus, for all lnks from a node u to nodes (v, w, etc.) n the next level of a path tree, for any source-destnaton par, we have: f 1 (β) p:(u,v) p π p = f 1 (β) p:(u,w) p π p (4) We know that the set of commodtes that traverse these lnks are the same, snce they are at the same level n the path tree. Thus, we can satsfy (3) by ensurng that the per-commodty values of (4) are equal. PacketScatter satsfes ths trvally by splttng every commodty evenly across the equal-cost lnks ( fu,v s,d = fu,w), s,d resultng n equal lnk probabltes. LocalFlow, on the other hand, groups commodtes by destnaton when balancng the flow rate across lnks and only splts ndvdual commodtes when necessary. However, by the same argument for commodtes, we know that the set of destnatons reachable va the lnks are the same as well. Thus, f we group the commodtes n (3) by destnaton d then the condton s satsfed when: :s d f 1 (β) π p = f 1 (β) p:(u,v) p :s d p:(u,w) p π p, Snce LocalFlow dstrbutes the per-destnaton flow (x ) evenly across equal-cost lnks,.e., f d u,v = f d u,w d, we have: x :s d π p = x p:(u,v) p :s d t π p (5) p:(u,w) p By substtutng n (2), we arrve at the per-destnaton optmalty condton for (3). Note that LocalFlow wll contnue to adjust flow rates to acheve (5) n response to TCP s optmzed send rates (and vce-versa). On each teraton, LocalFlow mnmzes the maxmum lnk utlzaton by balancng per-destnaton lnk flow rates, openng up addtonal head room on each lnk for the current commodty send-rates x to grow. Between LocalFlow teratons, the TCP sub-problem maxmzes ts send rate objectve to consume the addtonal capacty. Gven a proper tmescale separaton between the master and slave problems, the dstrbuted convex optmzaton process converges [1] to an optmal network utlty. In practce, a proper tmescale separaton s on the order of a few RTTs, whch n total s stll <1ms for a typcal datacenter. Thus, LocalFlow can safely use a schedulng nterval of 1ms or greater. In general, the speed of convergence to the optmum depends on the varant of TCP beng used. Note that the ndvdual send rate utlty functons U can dffer by source. Ths corresponds to end-hosts usng dfferent TCP varants (e.g., Cubc, NewReno) or even ther own applcaton-level congeston control over UDP. As long as the utlty functons are concave and the send rates are elastc and not unbounded or statc,.e., they can adjust to lnk congeston back pressure, then the optmal MCF condtons under LocalFlow hold. If farness between send rates s an ssue, then the provder must ensure that the endhosts employ some form of farness-nducng congeston control (e.g., U = log ) [29]. 7. EVALUATION In ths secton, we evaluate LocalFlow to demonstrate ts practcalty and to justfy our theoretcal clams. Specfcally, we answer the followng questons: Does LocalFlow acheve optmal throughput? How does t compare to Hedera, MPTCP, and other schemes? ( 7.2) Does LocalFlow tolerate network falures? ( 7.3) Gven the potental for larger rule sets, how much forwardng table space does LocalFlow use? ( 7.4) Do smaller schedulng ntervals gve LocalFlow an advantage over centralzed solutons (e.g., Hedera)? ( 7.5) Is spatal flow splttng better than temporal splttng (e.g., as used by FLARE)? ( 7.6) How well does LocalFlow manage packet reorderng compared to PacketScatter, and what s ts effect on flow completon tme? ( 7.7) We use dfferent technques to evaluate LocalFlow s performance, ncludng analyss ( 6) and smulatons on real datacenter traffc ( 7.4), but the bulk of our evaluaton ( ) uses a packetlevel network smulator. Packet-level smulatons allow us to solate the causes of potentally complex behavor between LocalFlow and TCP (e.g., due to flow splttng), to test scenaros at a scale larger than any testbed we could construct, and to facltate comparson wth pror work. In fact, we used the same smulator codebase as MPTCP [39, 43], allowng drect comparsons. 7.1 Expermental setup Smulatons. We developed two smulators for LocalFlow. The frst s a stand-alone smulator that runs LocalFlow on pcap packet traces. We used the packet traces collected by Benson et al. [6] from a unversty datacenter swtch. To stress the algorthm, we smulated the effect of larger flows by constranng the lnk bandwdth of the swtch. 157

8 1 1 1 Throughput (%) 8 6 PacketScatter 4 LocalFlow (5ms) MPTCP (8) 2 Hedera (5ms) ECMP LocalFlow-NS (5ms) Rank of flow Throughput (%) 8 6 LocalFlow-NS (5ms) 4 PacketScatter LocalFlow (5ms) 2 Hedera (5ms) MPTCP (8) ECMP Rank of flow Throughput (%) 8 6 LocalFlow-NS (5ms) 4 PacketScatter LocalFlow (5ms) 2 MPTCP (8) Hedera (5ms) VLB Rank of flow Fgure 4: Indvdual flow throughputs for a random permutaton on a 124-host Fat- Tree network. Fgure 5: Indvdual flow throughputs for a strde permutaton on a 124-host Fat- Tree network. Fgure 6: Indvdual flow throughputs for a random permutaton on a 1-host VL2 network. Our second smulator s based on htsm, a full packet-level network smulator wrtten by Racu et al. [39, 43]. The smulator models TCP and MPTCP n smlar detal to ns2, but s optmzed for larger scale and hgh speed. It ncludes an mplementaton of Hedera s Frst Ft heurstc [2]. We modfed and extended htsm to mplement the LocalFlow algorthm. Notably, we added a swtch abstracton that groups queues together and mantans a forwardng table based on the mult-resoluton splttng rules defned n 5. We allowed the duplcate-ack (dup-ack) threshold of end-host TCP to be modfed (the default s 3), but otherwse left end hosts unchanged. Changng the threshold s easy n practce (e.g., by wrtng to /proc/sys/ on Lnux). Topologes. We ran our experments on the dfferent fat-tree-lke topologes descrbed n 3, ncludng: FatTree topology bult from k-port swtches [1]. We used 124 hosts (k = 16) when larger smulatons were feasble, and 128 hosts (k = 8) hosts for fner analyses. VL2 topology [2]. We used 1 hosts wth 5 ToR, 2 aggregaton, and 5 ntermedate swtches. Inter-swtch lnks have 1 tmes the bandwdth of host-to-tor lnks. Oversubscrbed topologes, created by addng more hosts to edge/tor swtches n the above topologes. We used a 512- host, 4:1 oversubscrbed FatTree network (k = 8). All our networks were as large or larger than those used by Racu et al. [39] for ther packet-level smulatons. Unless otherwse specfed, we used 1-byte packets, 1Gbps lnks (1Gbps nter-swtch lnks for VL2), queues of 1 packets, and 1µs delays between queues. TCP NewReno varants. We notced n our smulaton experments that flows between nearby hosts of a topology sometmes suffered abnormally low throughput, even though they dd not notceably affect the average. We traced ths problem to the NewReno varant used by htsm, called Slow-but-Steady [17], whch causes flows to reman n fast recovery for a very long tme when network roundtrp tmes are low, as n datacenters and especally between nearby hosts. RFC 2582 [17] suggests an alternatve varant of NewReno for such scenaros called Impatent. After swtchng to ths varant, the low-throughput outlers dsappeared. 7.2 LocalFlow acheves optmal throughput MapReduce-style workloads We ran LocalFlow on a 124-host FatTree network usng a random permutaton traffc matrx of long flows,.e., each host sends a flow to one other host chosen at random wthout replacement. Gven ts topology, a FatTree network can run ths workload at full bsecton bandwdth. We used a schedulng nterval of 5ms and ncreased the dup-ack threshold to accommodate reorderng; these parameters are dscussed later. We also ran PacketScatter, ECMP, Hedera wth a 5ms schedulng nterval, and MPTCP wth 4 and 8 subflows per flow. Note that 5ms s an extremely optmstc nterval for Hedera s centralzed scheduler, beng one to two orders of magntude smaller than what t can actually handle [2, 39]. Fgure 4 shows the throughput of ndvdual flows n ncreasng order, wth the legend sorted by decreasng average throughput. As expected, LocalFlow acheves near-optmal throughput for all flows, matchng the performance of PacketScatter to wthn 1.4%. LocalFlow s man beneft over PacketScatter s that t splts fewer flows when there are multple flows per destnaton, as we show later. Although LocalFlow-NS attempts to dstrbute flows locally, t does not splt flows and so cannot avod downstream collsons. It s also partcularly unlucky n ths case, performng worse than ECMP (typcally ther performance s smlar). MPTCP wth 8 subflows acheves an average throughput that s 8.3% less than that of LocalFlow, and ts slowest flow has 45% less throughput than that of LocalFlow. MPTCP wth 4 subflows (not shown) performs substantally worse, achevng an average throughput that s 21% lower than LocalFlow. Ths s because there are fewer total subflows n the network; effectvely, t throws fewer balls nto the same number of bns. ECMP has the same problem but much worse because t throws N balls nto N bns; ths nduces collsons wth hgh probablty, resultng n an average throughput that s 44% less than the optmal. For the remander of our analyss, we use 8 subflows for MPTCP, the recommended number for datacenter settngs [39]. Hedera s average throughput les between MPTCP wth 4 subflows and 8 subflows, but exhbts much hgher varance. Although not shown, Hedera s varance was ±28%, compared to ±14% for MPTCP wth 4 subflows. In general, Hedera does not cope well wth a random permutaton workload, whch sends flows along dfferent path lengths (most reach the core, some only reach aggregaton, and a few only reach edge swtches). If nstead we guarantee that all flows travel to the core before descendng to ther destnatons, Hedera performs much better. Fgure 5 shows the results of a strde(n/2) permutaton workload, where host sends a flow to host + N/2. All algorthms acheve hgher throughput, and Hedera comes close to LocalFlow s performance, though ts slowest flow has 49% less throughput than that of LocalFlow. Further, forcng all traffc to traverse the core ncurs hgher latency for potentally local communcaton, and yelds worse performance n more oversubscrbed settngs. In fact, sgnf- 158

9 Total throughput, average flow completon tme ECMP.%,.% LF %,.2% Hedera 7.2%, 17.% LF-1 (δ =.1) +1.9%, 2.2% MPTCP +6.%, +28.7% LF-1 (δ =.5) +7.2%, 1.% PS +12.7%, +1.4% LF-NS (δ =1) +6.6%, 1.9% Fgure 7: Total throughput and average flow completon tme relatve to ECMP, for a heterogeneous VL2 workload on a 512- host, 4:1 oversubscrbed FatTree. cant rack- or cluster-local communcaton s common n datacenter settngs [6, 28], suggestng larger benefts for LocalFlow. It may seem surprsng that LocalFlow-NS has the hghest average throughput n Fgure 5, but ths s due to the unformty of the workload. LocalFlow-NS dstrbutes the flows from each pod evenly over the core swtches; snce these flows target the same destnaton pod, the dstrbuton s perfect. A smlar effect arses when runnng a random permutaton workload on the 1-host VL2 topology, per Fgure 6. In a VL2 network, aggregaton and ntermedate swtches form a complete bpartte graph, thus t s only necessary to dstrbute the number of flows evenly over ntermedate swtches, whch LocalFlow-NS does. In fact, LocalFlow-NS acheves optmal throughput for any permutaton workload Dynamc, heterogeneous workloads Real datacenters are typcally oversubscrbed, wth hosts sendng varable-szed flows to multple destnatons smultaneously. Usng a 512-host, 4:1 oversubscrbed FatTree network, we tested a realstc workload by havng each host select a number of smultaneous flows to send from the VL2 dataset [2, Fg. 2] 2, wth flow szes also selected from ths dataset. The flows ran n a closed loop,.e., they restarted after fnshng (wth a new flow sze). We ran LocalFlow wth a 1ms schedulng nterval and also allowed approxmate splttng (δ > ). We used a 1ms schedulng nterval for Hedera as well, whch agan s extremely optmstc. Fgure 7 shows results for the total throughput (total number of bytes transferred) and average flow completon tmes (whch we dscuss later n 7.7). Usng the VL2 dstrbutons, there are over 12, smultaneous flows n the network. Wth ths many flows, even ECMP s loadagnostc hashng should perform well due to averagng, and we expect all algorthms to delver smlar throughput; Fgure 7 confrms ths. Nevertheless, there are some nterestng ponts to note. Frst, LocalFlow-NS outperforms ECMP because t ntellgently dstrbutes flows, albet locally. In fact, ts performance s almost as good as LocalFlow due to the large number of flows. Local- Flow does not appear to gan much from exact splttng. We beleve ths s because over 86% of flows n the VL2 dstrbuton are smaller than 125KB; such flows are small enough to complete wthn a 1ms nterval, so t may be counterproductve to move or splt them mdstream. On the other hand, splttng too approxmately (δ =.5) also hurts LocalFlow s performance, because of the slght load mbalances ncurred. δ =.1 strkes the rght balance for ths workload, achevng close to PacketScatter s performance. All LocalFlow varants outperform MPTCP. Hedera acheves 7.17% less throughput than ECMP. Ths s lkely due to the small flows mentoned above, whch are large enough to be scheduled by Hedera, but better left untouched. In addton, as Racu et al. [39] observed, Hedera optmstcally reserves bandwdth along a flow s path assumng the flow can fll t, but ths 2 We obtaned the VL2 dstrbutons by extractng plot data from the paper s PDF fle. Throughput (%) LocalFlow (5ms) MPTCP (8) 2 Hedera (5ms) ECMP PacketScatter Rank of flow Fgure 8: Indvdual flow throughputs for a random permutaton on a 128-host FatTree network wth faled lnks. bandwdth may go to waste n the current schedulng nterval f the flow s unable to do so. 7.3 LocalFlow handles falures gracefully Although PacketScatter s throughput s compettve wth Local- Flow above, ths s not the case when network falures occur. As dscussed n 4.3, f an entre swtch fals, PacketScatter s compettve wth LocalFlow, but f falures are skewed (as one would n practce), PacketScatter s performance suffers drastcally. Fgure 8 shows the results of a random permutaton on a 128-host Fat- Tree network, when one aggregaton swtch (out of four) n each pod loses 3 of ts 4 uplnks to the core. Upon learnng of the falure, LocalFlow at the edge swtches rebalances most outgong traffc to the three other aggregaton swtches. From Fgure 8, we see that LocalFlow and MPTCP delver near-optmal throughput, whereas PacketScatter performs even worse than ECMP, achevng only 48% of the average throughput of LocalFlow. 7.4 LocalFlow uses lttle forwardng table space LocalFlow dstrbutes the aggregate flow to each destnaton, so f several flows share the same destnaton, the number of subflows (splts) per flow s small. Wth approxmate splttng, even fewer flows are splt due to the added slack. Ths s mportant because splttng flows ncreases the sze of a swtch s forwardng tables. To evaluate how much splttng LocalFlow does n practce, we ran our stand-alone smulator on a 3914-second TCP packet trace that saw 259,293 unque flows, collected from a 5-server unversty datacenter swtch [6]. We used a schedulng nterval of 5ms and dfferent numbers of outgong lnks, whle varyng δ. Fgure 9 (top) shows these results as a functon of δ. Although LocalFlow splts up to 78% of flows when δ = (usng 8 lnks), ths number drops to 21% when δ =.1 and to 4.3% when δ =.5. Thus, a slack of just 5% results n 95.7% of flows remanng unsplt! Ths s a bg savngs, because such flows do not requre wldcard matchng rules, and can thus be placed n an exact match table n the more abundant and power-effcent SRAM of a swtch. The average number of subflows per flow smlarly drops from 3.54 when δ = to 1.9 when δ =.5 (note the mnmum s 1 subflow per flow). Ths number more accurately predcts how much forwardng table space LocalFlow wll use, snce t counts the total number of rules requred. Thus, usng 8 lnks and δ =.5, Local- Flow uses about 9% more forwardng table space than LocalFlow- NS, whch only needs one rule per flow. Although PacketScatter creates almost 8 tmes as many subflows, t only needs to store a small amount of state per destnaton, of whch there are at most 5 n ths dataset. Later, we wll see that PacketScatter pays for ts excessve splttng n the form of longer flow completon tmes. 159

Efficient Distributed File System (EDFS)

Efficient Distributed File System (EDFS) Effcent Dstrbuted Fle System (EDFS) (Sem-Centralzed) Debessay(Debsh) Fesehaye, Rahul Malk & Klara Naherstedt Unversty of Illnos-Urbana Champagn Contents Problem Statement, Related Work, EDFS Desgn Rate