1 Joint Server Seletion and Routing for Geo-Repliated Servies Srinivas Narayana, Wenjie Jiang, Jennifer Rexford, Mung Chiang Prineton University Abstrat The performane and osts of geo-repliated online servies depend on whih data enters handle user requests, and whih wide-area paths arry traffi. To provide good performane at reasonable ost, servie providers adapt the mapping of user requests to data enters e.g., through DNS), and routing of responses bak to users i.e., through multi-homed route ontrol). Mapping and routing are typially managed independently, with mapping having limited visibility into routing deisions, response path latenies, and bandwidth osts. However, poor visibility and unoordinated deision-making an lead to worse performane and higher osts when ompared to a joint deision. In this paper, we argue that mapping and routing should ontinue to operate modularly, but ooperate towards servie-wide performane and ost goals. Our main ontribution is a distributed algorithm to steer ooperating, yet funtionally separate, mapping and routing provably towards a globally optimal operating point. Trae-based evaluations on an operational CDN show that the algorithm onverges to within 1% of optimum in 3-6 iterations. I. INTRODUCTION Online servie providers OSPs) arefully optimize the performane their ustomers experiene, sine even small inreases in user-pereived lateny an have signifiant impat on revenue [1]. To improve performane and reliability, OSPs run servies out of multiple geographially distributed servie loations e.g., data enters), eah typially peering with multiple ISPs. However, good performane must be balaned against operational osts resulting from the eletriity and bandwidth needed to serve large amounts of data on a daily basis [2], [3]. Client traffi loads, path latenies, and eletriity and bandwidth osts, vary over spae and time. To provide good performane at reasonable osts, OSPs adapt the wide-area paths arried by traffi in two key ways Fig. 1) mapping requests to speifi loations e.g., DNS rediretion [4], [5]), and routing responses bak to lients through peer seletion at the egress to the Internet e.g., multi-homed routing [2], [6]). These knobs allow an OSP to exploit path and ost diversity to adapt to time-varying demands, path performane, and osts. To our knowledge of the state of the art, mapping and routing are managed independently: mapping deisions are made without full visibility into the path metris, loads and deisions of the multi-homed) routing system. However, as we show in II, misaligned objetives, lak of information visibility and unoordinated deision-making an lead to bad performane and high osts. For example, the mapping may diret user requests to loations with limited upstream bandwidth, poor routing performane [4], [7], or expensive ISP onnetivity [2]. As a Google CDN study [4] shows, lients an experiene latenies inflated by several tens of milliseonds even when served by the geographially losest node, in part Fig. 1. Data Centers Multi-homed Internet onnetivity Internet Clients! IP Prefixes Mapping Nodes! DNS resolvers An online servie with multiple data enters aross the Internet. due to routing ineffiienies. In suh ases, better visibility into routing deisions and path performane an allow poorly performing lients to be mapped to other loations. A natural approah to improve the situation is to jointly ompute mapping and routing deisions after all, the OSP may have administrative ontrol over both. However, the ability to separately ompute deisions is desirable for two key reasons. First, an OSP may not be running its own infrastruture on all parts of its global traffi management system. For example, an OSP may employ an external mapping servie, e.g., [5], [8]. In suh ases, the OSP may only ditate highlevel poliies but may be unable to ompute or onfigure the mapping deisions diretly. Seond, a large OSP may already have operational mapping and routing infrastruture working independently [4], [7]. Hene, instead of building a monolithi system, we argue that mapping and routing should ontinue to operate modularly, but oordinate to ahieve the servie-wide performane and ost of a joint system. Our goal is to onstrut a distributed algorithm to oordinate administratively separate mapping and routing systems, and provably ahieve global performane and ost goals. Unlike prior works on OSP joint traffi engineering [2], [7], [9], this algorithm retains the modularity of mapping and routing while addressing major onerns that ouple their deisions e.g., link apaities). Our ontributions are as follows. Identifying hallenges in oordination. We show why a naive distributed algorithm, where mapping and routing iteratively optimize deisions even with full information visibility an lead to bad performane and high osts II). Distributed mapping and routing algorithm. We present a distributed algorithm that systematially addresses the auses of suboptimality, hene allowing modular mapping and routing to provably onverge to a global optimum IV). The solution is summarized in Fig. 2. Our optimization model aptures on-

2 Data enter Mapping nodes Data enter i ROUTING i Mapping deisions Client traffi volumes MAPPING... Data enter Routing deisions Path performane the senario in Fig. 4b), where a mapping initialization e.g., from geo-proximity) direts all traffi to the DC on the left. No traffi from this user reahes the other DC, so all routing deisions here are equally good. Suppose this DC hooses to route traffi through the 100ms path. Then, these mapping and routing deisions are loally optimal to eah other preventing the routing from exploring the globally optimal 50ms path. Fig. 2. Distributed mapping-routing algorithm IV). Data enter i solves loal problem ROUTING i to obtain multi-homed routing deisions, and mapping nodes solve MAPPING to ompute global mapping deisions. They then exhange measured traffi volumes, path performane, and omputed mapping and routing deisions, and iteratively reoptimize until onvergene. siderations like traffi demands, path performane, bandwidth osts, mapping poliies [5], [8], and link apaities III). Evaluation on CoralCDN traes. We evaluate the distributed algorithm on traes from CoralCDN [10], an operational ontent distribution network, and show that it onverges to within 1% of the global optimum in 3-6 iterations V). II. COORDINATING MAPPING & ROUTING Performing lient-mapping and network-routing independently an miss opportunities to improve performane or redue osts. Through an example OSP whih has two data enters DCs) eah with two ISP links, and one lient IP prefix, we highlight some hallenges that hinder independent mapping and routing systems from arriving at good olletive deisions. Misaligned objetives an lead to suboptimal deisions. Typially, mapping and routing systems work with different objetives e.g., mapping performs lateny and load-based DC seletion, while routing onsiders end-to-end performane and bandwidth osts to hoose a peer to forward responses. In Fig. 3a), the routing system at eah DC piks the peer with the least ost per unit bandwidth, while the mapping system piks the DC with the least propagation delay given urrent routing hoies resulting in a situation where the servie neither has globally optimal ost nor optimal lateny. Inomplete visibility an lead to suboptimal deisions. In Fig. 3b), the mapping and routing systems both optimize lateny, and users are sending traffi at the rate of 5Gb/s. Without information on link loads and apaities, the mapping system direts too muh traffi to the DC on the left, leading to large queuing delays and paket losses. If the mapping system uses information about link apaities, it ould easily diret most traffi to the alternate DC with ample spare apaity, and only slightly worse lateny. Coupled onstraints an lead to suboptimal equilibria. Even if mapping and routing have aligned objetives e.g., minimize lateny) and omplete visibility, optimizing their deisions separately taking turns an still lead to globally suboptimal situations. In Fig. 4a), mapping and routing are loally optimal given eah other s deisions, as traffi is served through the least lateny paths respeting link apaities. However, in the globally optimal alloation, all traffi is served by the DC on the right, using both peers. Bad routing deisions for prefixes with no traffi ontribution to a DC an lead to suboptimal equilibria. Consider $5/GB 40ms $1/GB 100ms a) $10/GB 90ms $12/GB 250ms Route-seleted Alternate route Map-seleted 100ms 1Gb/s 45ms 5Gb/s b) 10Gb/s 50ms 75ms Route-seleted Alternate route Map-seleted Fig. 3. a) Mapping is suboptimal beause of misaligned objetives; b) Mapping is suboptimal beause routing does not share link-related information. 4Gb/s 90ms 4Gb/s 75ms 4Gb/s 75% 25% a) 1Gb/s 50ms 3Gb/s 60ms Route-seleted Alternate route Map-seleted 100ms 2Gb/s 60ms 1Gb/s b) 2Gb/s 100ms 2Gb/s 50ms Route-seleted Alternate route Map-seleted Fig. 4. a) Separate mapping and routing an be ineffiient under oupled onstraints; b) Routing is suboptimal when mapping sends no traffi to a DC. III. OSP & OPTIMIZATION MODELS In this setion, we introdue the model for i) OSP mapping and routing deisions, ii) performane and ost goals, and iii) joint ost-performane optimization. A. Online Servie Provider Model Network. The OSP network model is summarized in Fig. 1. An online servie runs in a set I of servie loations i.e., data enters, ahes, replias, availability zones). Every loation i is onneted to the Internet through a set of ISP links, whih we denote by J i. Let C be the set of lients, where eah lient is an aggregate of real users in our experiments, these are IP prefixes). Clients an be direted to different loations depending on their proximity and the urrent load on the ompute infrastruture. Suh mapping of users to loations ours through a set of mapping nodes, suh as authoritative DNS servers or HTTP proxies. One a request is servied at loation i, the egress router piks one or more ISP-links j J i to send responses. The total traffi that a link j an support is limited by its apaity ap i j. Mapping Deisions hoie of servie loation). We denote α i as the proportion of traffi from lient mapped to loation i. We require that i α i = 1 for all, where α i [0,1]. Different users in the same IP prefix may be direted to different loations simultaneously. The mapping deisions are realized by a flexible mapping servie e.g., [5], [8], [11]. Eah user is served by its loal mapping node, e.g., a loal DNS server, and mapped to the designated servie loation. Routing Deisions hoie of egress ISP). Eah servie loation is onneted to the Internet by several links to different ISPs. We use the index j to denote a link that onnets a servie loation to an ISP. For every loation i, we denote by

3 β i j the proportion of traffi for lient served by loation i that is sent over link j J i, where j: j Ji β i j = 1. The responserouting deision only ontrols the first hop, and not the entire path. Typial routing deisions β i j are often integers {0,1}, sine one next-hop is hosen to reah eah lient. We relax this onstraint and allow OSPs to freely split lient traffi aross links. In pratie, frational routing an be realized by hash-based splitting or multi-homing agents [12]. Eah data enter is a stub autonomous system with no transit servie, hene routing hanges do not need BGP to reonverge. B. Performane and Cost Goals User-pereived lateny is a key performane metri for online servies, sine even small inrements an have signifiant effets on revenue [1]. User-pereived lateny an depend on round trip delays between users and data enters, proessing time within the ompute infrastruture, TCP dynamis, and even browser page loading time. In this paper, we fous on optimizing the round trip lateny for user requests, whih impats the ompletion time of short TCP flows [13], muh more than metris like bandwidth and paket loss. Average end-to-end lateny objetive. We use average endto-end path RTT ms/request) as our performane metri. Path latenies depend on whih loations handle requests, whih wide-area paths deliver traffi, and paket queuing and transmission delays along paths. A servie provider an obtain statistially averaged latenies through ative measurements [14] or lateny predition tools [15]. We use perf i j to denote the lateny from loation i to lient, when i piks link j J i to deliver traffi. Then the performane objetive is: perf = C vol α i β i j perf i j i I j J i ) /vol, where vol is the total request volume from lient. We ignore the impat of OSP traffi shifts on lateny as long as data enter-isp links operate safely under apaity. This is a standard simplifiation used in prior works [2], [6]. Average ost per request objetive. Operational osts are an important onsideration for OSPs [2], [3]. In this paper we fous on ISP bandwidth osts, whih an be signifiant on its own for large OSPs [2]. For tratability, we assume a ost funtion whih is linear on the amount of data sent. The ost metri is average ost per request $/request): ost = C vol α i β i j prie i j i I j J i ) /vol where prie i j is the ost per request on link j J i of data enter i. In pratie ISPs employ omplex priing funtions e.g., 95th perentile harging). However optimizing a linear ost on every harging interval an redue monthly 95th perentile osts [2]. C. Joint Cost-Performane Optimization The goals of minimizing lateny and minimizing osts an often be at odds. For example, an ISP with riher Internet peering may offer lower latenies to more lients, albeit at higher ost, than its ompetitors [2]. To strike a balane between ost and performane, we introdue a weight K that reflets the amount of additional osts $/request) that an OSP is willing to pay for one unit of performane improvement ms/request). Hene, the OSP s goal is to minimize ost + K perf, whih aptures pareto-optimal tradeoffs between perf and ost. For additional flexibility, we allow OSPs to express mapping poliies in terms of traffi split ratios w i with tolerane ε i ) or absolute request aps B i per data enter i. This allows an OSP to balane load between data enters in a weighted round robin fashion [8] or limit load at ertain sites [5]. We formulate the joint optimization problem as follows: GLOBAL minimize ost + K perf 1a) subjet to vol α i β i j ap i j, j 1b),i: j J i vol α i w i ε i, i 1) vol vol α i B i, i 1d) variables i α i = 1, β i j = 1, i, j J i α i 0, i,, β i j 0, i, j, 1e) 1f) where 1b) ensures the aggregate link traffi does not exeed apaity [2], [5], [9], [16]. Constraints 1) and 1d) enfore traffi split ratios and bandwidth aps respetively. Constraints 1e) and 1f) ensure that proportions sum up to one. The problem 1) is a linear program on α and β separately, but non-onvex when both are variables. As suh, it annot be solved using standard onvex programming. However, it an be onverted into an equivalent LP whose variables ouple mapping and routing) and solved effiiently Appendix A). IV. OPTIMAL DISTRIBUTED ALGORITHM We design effetive ways to avoid suboptimal equilibria IV-A) and present an optimal distributed algorithm IV-B). A. Avoiding Suboptimal Equilibria A straightforward distributed solution to the joint problem 1) is to alternate between the mapping nodes optimizing over α given routing deisions β), and the routing nodes optimizing over β given mapping deisions α). However, suh separate optimizations often lead to loal optima II). We systematially address the auses of suboptimality from II. First, we align mapping and routing objetives expliitly by onstruting both to solve the joint problem 1) iteratively. Further, we mandate that eah system has visibility into information needed to solve the joint problem, by requiring speifi loal measurements and exhange of statistis IV-B). Next, we illustrate how we mitigate the oupled onstraints problem through an example network in Fig. 5 a). Let α and β be the mapping and routing deisions respetively. If the lient has one unit of demand, optimal lateny is ahieved when α = 1,β = 1/4. However, by separate optimizations, we an end up in loal optima, e.g., α = 1/2,β = 1/2, whih are valid but globally suboptimal mapping and routing profiles.

4 ap=100 ap= ms 60ms ap=3/4 1-β 1-α α 75ms 50ms ap=1/4 β Available route a) Example network b) GLOBAL ) GLOBAL Φ Fig. 5. An example of loal optima with separate optimization. Fig. 5 b) visualizes how the loal optimum is reahed from an initial starting point. The olor-shaded region represents the feasible deision spae. The urvy boundaries reflet the apaity onstraints of the 50ms and 60ms links. The olor oding means the objetive value, i.e., lateny, where red is high and blue is low. All points on these boundaries exept at α = 1 are loal optima, whih an be very ineffiient. In general, the feasible spae determined by link apaity and mapping poliy onstraints is non-onvex. Motivated by the observation that an ill-shaped feasible region does not enable effiient distributed optimization, we aim to work around the boundaries implied by the hard apaity onstraints. We show later that these are in fat the only onstraints leading to suboptimal equilibria in Appendix B.) We relax the link apaity onstraint by introduing a penalty funtion. In partiular, we utilize a piee-wise linear funtion Φ i j ) often used in ISP traffi engineering [17] to apture osts due to high link utilization. The penalty term Φ improves the predition of lient performane by apturing the effets of link ongestion on queuing delay. In the refined global optimization, namely GLOBAL Φ, we replae the objetive 1a) by obj g α,β) = ) α i β i j vol priei j + K perf i j i j + i j vol Φ i j α i β i j vol )/ J, 2) whih an be viewed as a joint ost-performane optimization with link ongestion onsideration. The onstraint 1b) that aused the suboptimal equilibria is now removed from the problem. Figure 5) shows that the solution of this refined problem formulation onverges to the global optimum by alternately fixing one dimension and optimizing the other. The whole deision spae is now feasible, but the violation of link apaity will inur a high ost. Note that the global optima shown in Figures 5b) and 5) are different, as the ongestion ost is not onsidered in the original formulation. B. Distributed Mapping and Routing We present a distributed solution that builds on the existing pratie of omputing mapping and routing deisions independently. The key idea is to design optimization problems for eah system, i.e., mapping and routing, in whih i) they have aligned objetives, and ii) eah system omputes only its own loal) deision variables, using loal onstraints and exhange of appropriate statistis. In partiular, mapping nodes olletively solve the following mapping problem: MAPPINGβ) minimize obj g 3a) α i vol subjet to w i ε vol i, i 3b) variables i α i vol B i, i α i = 1,, α i 0, i, Further, edge routers together solve the following problem: 3) 3d) ROUTINGα) minimize obj g 4a) subjet to variables β i j = 1, i, j J i β i j 0, i, j, 4b) Note that ROUTING does need to know about MAPPING s load balaning poliies. Further, MAPPING and ROUTING are both onvex optimizations, whih an be effiiently solved. A distributed solution, the alternate projetion algorithm, proeeds as follows: given routing deisions β as input, mapping nodes solve 3) and optimize over α, and given mapping deisions α as input, edge routers solve 4) and optimize over β. The two steps are arried out iteratively until solutions onverge. We allow mapping and routing problems to be solved at different time-sales, with the only requirement that 3) does not start until 4) is fully solved, and vie versa. This is easily ensured by having eah omponent wait to reeive inputs from the other before solving its own loal optimization problem. However, we showed in Figure 4b), in general the alternate projetion algorithm may still lead to suboptimal equilibria, when some lients send no traffi to a data enter. To mitigate

5 this no traffi problem, we introdue a refinement to routing in addition to the optimality of 4). We later prove the optimality of suh a refined routing strategy in Appendix B.) In pratial omputation of routing deisions, we introdue an approximation to 4) by inrementing an infinitesimally small demand to a lient-server pair with zero traffi, i.e., α i δ if α i = 0, where δ is a small positive onstant [18]. This approximation is often used to ease the omputation so that standard optimization tehniques an be applied. However, lients do not need to send real traffi, making this approah pratially appealing. We show that the alternate projetion algorithm with the refinements introdued above provably onverges to the global optimum of GLOBAL Φ : Theorem 1: Alternate projetions of MAPPING and ROUTING onverge to an optimal solution of GLOBAL Φ. Note that ROUTING an be further deomposed into loal versions at eah data enter, as both its objetives and onstraints an be deoupled, e.g., obj g = i obj i) g, where obj i) g = ) α i β i j vol priei j + K perf i j / vol j J i, + Φ i j α i β i j vol )/ J j J i is the loal objetive funtion at data enter i. There is no need for oordination among data enters, and their deisions olletively attain the optimal solution to 4). The mapping problem 3) an also be distributed among mapping nodes; indeed [5] studies this problem. We omit the details here. The distributed algorithm is summarized in Fig. 2. Mapping nodes olletively measure lient request rates, and solve the global mapping problem MAPPING using routing deisions and path latenies from the routing system. Eah data enter loally measures path latenies to lients, and solves ROUTING i using mapping deisions and traffi volumes from the mapping system. We assume that other optimization parameters namely onfiguration of data enters, peering links, link apaities and mapping poliies vary slower than the deision reoptimization time, and are globally up to date. C. Optimality of distributed solution Due to spae limitations, we present a proof sketh and defer the formal proof to the Appendix. The proof proeeds in three steps. We first show that there exists a equilibrium point α,β ), suh that α is a best response to β, and vie versa. We define the best response as an optimal projetion Def. 2), and the resulting equilibrium as a Nash equilibrium Def. 3). We further prove that α,β ) is also an optimal solution to GLOBAL Φ, whih is one of our main ontributions and signifiantly generalizes previous results [18]. We finally show that the alternate projetion algorithm leads to a Nash equilibrium point, whih ompletes the proof. The rest of this setion introdues these formalities to assist our proof. First, we need a way to apture the marginal ost f i j of serving lient from data enter i over link j: Definition 1: Let the metri f i j be defined as ) f i j α i,β i j ) = vol priei j + K perf i j / vol +vol Φ j α i β i j vol )/ J We next define the optimal projetion of mapping and routing strategies, respetively: Definition 2: i) Given fixed routing deisions β, a set of mapping deisions α is alled an optimal projetion onto mapping spae if α is an optimal solution to 3). ii) Given fixed mapping deisions α, a set of routing deisions β is alled an optimal projetion onto routing spae if, i,, we have f i j α i,βi j ) f i j α i,β j ) for all j, j J i suh that β j > 0. Lemma 1: If β is an optimal projetion given α, then β is also an optimal solution to 4). It is easy to hek that an optimal solution to 4) is not neessarily an optimal projetion. We introdued an approximation in IV-B by inrementing small traffi demand to every lient-server pair. We then an alulate the optimal projetion of routing by solving refined 3) using standard optimization tehniques. We next define the notion of Nash equilibria. Definition 3: A set of mapping and routing deisions α,β ) is alled a Nash equilibrium if α is an optimal projetion given β, and β is an optimal projetion given α. V. DISTRIBUTED SOLUTION EVALUATION Experiment setup. CoralCDN [10] is a ahing and ontent distribution platform running on Planetlab. Our request-level trae olleted at 229 Planetlab sites running Coral onsists of over 27 million requests and a terabyte of data, orresponding to Marh 31st, To emulate multi-homed servie deployment, we luster Planetlab sites in approximately the same metropolitan area and treat them as ISPs peering with the same servie loation similar to the approah followed in [12]. Our setup has 12 servie loations distributed in North Ameria, Europe and Asia with 3-6 ISP onnetions eah. Requests arrive from about IP prefixes. For latenies to these prefixes, we use iplane [14], a system that ollets wide-area network statistis to Internet destinations from Planetlab vantage points. To orrelate the traffi and lateny data, we narrow down our lient set to IP prefixes, whih ontribute 38% traffi by requests) of the entire trae. The osts per unit bandwidth for ISP links are derived from a reallife loud bandwidth priing plan [19]. We refer the reader to Appendix D for full details. Benefits of joint optimization. We evaluate the benefits of full information visibility for ost and perf in isolation. Coneptually, this orresponds to setting K = 0 and K = respetively in the GLOBAL optimization. For perf, we ompare against a baseline mapping that optimizes for average lateny assuming per-lient least lateny path is used to reah eah lient from eah data enter, and a routing that optimizes perf. From hourly optimizations on the CDN trae, we find that GLOBAL ahieves 10ms smaller perf than this baseline on average.

6 Adaptive Cold start # iterations # instanes auray % TABLE I. SOLUTION OPTIMALITY AND CONVERGENCE. For ost, we use a baseline mapping that optimizes for average ost assuming average per-link ost is inurred at eah data enter, and routing that optimizes ost. GLOBAL ahieves between 3-55% lower ost aross the trae. Optimality to within 1% in 3-6 iterations. Over 24 hourly problem instanes through the trae, we reord the number of iterations for onvergene of mapping-routing alternate projetions, and the differene in %) of the onverged objetive from the optimal value of GLOBAL Φ, i.e., auray. As a onvergene riterion, we test if objetive values from suessive iterations are within a fixed perentage say 1%) of eah other. We examine two sets of initial onditions, namely old start, i.e., mapping is initialized with uniform round robin splitting, and adaptive, i.e., mapping starts from the onverged setting of the previous hour. The results are summarized in Table I. We find that i) onvergene ours in 3-6 iterations, ii) adaptive mapping generally onverges faster than old start, and iii) the onverged objetives are indeed within the fixed tolerane 1%) of the GLOBAL Φ optimum. Optimization runtime under 1 minute on a modern mahine. A single iteration of the alterate projetion algorithm with 12 loations, 49 links and lients takes about 30 seonds on a desktop mahine with a 2.40GHz two-ore CPU. Impat of penalty funtion Φ on lateny. We evaluate how the value of perf is inreased when the distributed algorithm optimizes a lateny objetive with link utilization penalties obj g, eqn. 2)), instead of optimizing perf diretly. Note that this relaxation in general auses an inrease in perf. Over hourly instanes through the trae, we find a uniform inrease of 10ms perf over optimizing perf diretly using 100% available link apaities. The gaps are 8ms and 5ms respetively when using only 95% and 90% of link apaities. Piking a ost-performane tradeoff. The GLOBAL optimization trades off performane for ost through a parameter K. However, it an be hallenging to know what a good tradeoff looks like. To understand this better, we visualize a pareto-optimal tradeoff urve between ost and perf for the the day in Fig. 6, as K varies. We see that there is a rih tradeoff spae between ost and lateny, with perf ranging between ms 88% inrease possible from optimal perf) and ost ranging between $/GB 53% inrease possible). A servie provider ould manually pik a sweet spot from this urve for example, the value of K orresponding to the knee at perf 98ms and ost $0.175/GB. VI. RELATED WORK There is a rih literature exploring the benefits of visibility between networks, appliations and servie infrastruture. Provider networks and selfish appliations. P4P [20] and IETF ALTO [21] advoate that appliations should use hints from ISPs to improve both their performane and ISP ontrol Fig. 6. ost $$/GB) perf ms) Pareto-optimal tradeoff urve between lateny perf) and ost $/GB). over their traffi. OSP traffi engineering differs from this setting sine the OSP ontrols the user-replia mapping diretly. ISP and CDN ollaboration. There have been many reent proposals suggesting ollaboration between ISPs and ahing infrastruture to improve servie performane [22] [24]. In ontrast, OSP traffi engineering onsiders paths aross ISPs onneted from distributed servie loations. This hanges the problem beause flexible ontent plaement inside a single ISP an provide signifiant performane gains. However, repliation aross ISPs an be expensive for OSPs hosting dynami ontent [7], [25]. Our optimization model more losely resembles earlier proposals to mitigate bad interation between replia seletion and ISP traffi engineering given the set of replias [16], [18], [26]. However, these works onsider routing within a single administrative domain, and do not inorporate flexible mapping poliy onstraints, e.g., 1). OSP joint traffi engineering. Unlike Entat [2], our algorithm ahieves joint system gains by oordinating modular request-mapping and response-routing systems. Our goals are similar to those of PECAN [7] but our model also onsiders aspets whih intimately ouple mapping and routing deisions e.g., link apaities, bandwidth osts). Xu et al. [9] distribute the joint optimization for sale; however their separation of variables does not deouple the mapping and routing systems. VII. CONCLUSION We address the problem of jointly optimizing requestmapping and response-routing for online servies, while retaining the funtional separation between them. We highlight the hallenges in ahieving the gains of a joint system in a distributed setting, and mitigate these hallenges through a distributed algorithm that provably onverges to a globally optimum point. Hene, OSPs an attain the benefits of a joint system namely better performane at lower ost while only slightly refining the behavior of existing infrastruture. Aknowledgment. This work was funded by NSF grant , DARPA grant FA C-0254, and a gift from HP. We thank L. Vanbever, B. Balasubramanian, M. Wang and W. Lloyd for feedbak and H. Madhyastha for iplane traes. REFERENCES [1] J. Hamilton, The ost of lateny, /10/31/TheCostOfLateny.aspx. [2] Z. Zhang, M. Zhang, A. Greenberg, Y. C. Hu, R. Mahajan, and B. Christian, Optimizing ost and performane in online servie provider networks, in Pro. NSDI, 2010.

7 [3] A. Qureshi, R. Weber, H. Balakrishnan, J. Guttag, and B. Maggs, Cutting the eletri bill for internet-sale systems, in Pro. ACM SIGCOMM, [4] R. Krishnan, H. V. Madhyastha, S. Srinivasan, S. Jain, A. Krishnamurthy, T. Anderson, and J. Gao, Moving beyond end-to-end path information to optimize CDN performane, in Pro. IMC, [5] P. Wendell, J. W. Jiang, M. J. Freedman, and J. Rexford, DONAR: Deentralized server seletion for loud servies, in Pro. ACM SIG- COMM, [6] D. K. Goldenberg, L. Qiu, H. Xie, Y. R. Yang, and Y. Zhang, Optimizing ost and performane for multihoming, in Pro. ACM SIGCOMM, [7] V. Valanius, B. Ravi, N. Feamster, and A. C. Snoeren, Quantifying the benefits of joint ontent and network routing, in Pro. ACM SIGMETRICS, [8] Amazon Route 53, [9] H. Xu and B. Li, Joint request mapping and response routing for geodistributed loud servies, in Pro. IEEE INFOCOM, [10] M. J. Freedman, E. Freudenthal, and D. Mazières, Demoratizing ontent publiation with Coral, in Pro. NSDI, [11] Y. Zhu, B. Helsley, J. Rexford, A. Siganporia, and S. Srinivasan, LatLong: Diagnosing wide-area lateny hanges for CDNs, IEEE Transations on Network and Servie Management, vol. 9, [12] A. Akella, B. Maggs, S. Seshan, A. Shaikh, and R. Sitaraman, A measurement-based analysis of multihoming, in Pro. ACM SIG- COMM, [13] N. Cardwell, S. Savage, and T. Anderson, Modeling TCP lateny, in Pro. IEEE INFOCOM, [14] H. V. Madhyastha, T. Isdal, M. Piatek, C. Dixon, T. Anderson, A. Krishnamurthy, and A. Venkataramani, iplane: An information plane for distributed servies, in Pro. OSDI, Nov [15] M. Szymaniak, D. Presotto, G. Pierre, and M. van Steen, Pratial large-sale lateny estimation, Computer Networks, vol. 52, pp , May [16] J. W. Jiang, R. Zhang-Shen, J. Rexford, and M. Chiang, Cooperative ontent distribution and traffi engineering in an ISP network, in Pro. ACM SIGMETRICS, [17] B. Fortz, J. Rexford, and M. Thorup, Traffi engineering with traditional IP routing protools, IEEE Communiation Magazine, [18] D. DiPalantino and R. Johari, Traffi engineering vs. ontent distribution: A game theoreti perspetive, in Pro. IEEE INFOCOM, [19] Amazon EC2 Priing, [20] H. Xie, Y. R. Yang, A. Krishnamurthy, Y. G. Liu, and A. Silbershatz, P4P: Provider portal for appliations, in SIGCOMM, [21] IETF ALTO, [22] I. Poese, B. Frank, G. Smaragdakis, S. Uhlig, A. Feldmann, and B. Maggs, Enabling ontent-aware traffi engineering, ACM SIG- COMM Computer Communiation Review, [23] A. Sharma, A. Venkataramani, and R. K. Sitaraman, Distributing ontent simplifies ISP traffi engineering, in ACM SIGMETRICS, [24] B. Frank, I. Poese, Y. Lin, G. Smaragdakis, A. Feldmann, B. Maggs, J. Rake, S. Uhlig, and R. Weber, Pushing CDN-ISP Collaboration to the Limit, SIGCOMM CCR, July [25] S. Agarwal, J. Dunagan, N. Jain, S. Saroiu, A. Wolman, and H. Bhogan, Volley: automated data plaement for geo-distributed loud servies, in Pro. NSDI, [26] I. Poese, B. Frank, B. Ager, G. Smaragdakis, and A. Feldmann, Improving ontent delivery using provider-aided distane information, in IMC, APPENDIX A A CENTRALIZED APPROACH TO THE JOINT PROBLEM We present an equivalent onvex formulation of the joint mapping and routing optimization problem that aepts an effiient entralized solution. GLOBAL is a non-onvex optimization problem on variables α i,β j ), and hene annot be solved using standard onvex programming tehniques. It is not lear whether loal optima exist and how bad they are relative to the global optimum. However, we show that 1) an be translated into an equivalent onvex optimization problem by introduing a new set of variables X i j, whih represents the fration of lient traffi that is mapped to data enter i through link j. There is a diret onnetion between the two sets of variables: X i j = α i β j, i, j,) We an rewrite 1) into the following onvex program: GLOBAL X minimize subjet to variables ) X i j vol priei j + K perf i j i j, j Ji vol X i j vol vol w i ε i, i 5a) 5b), j J i vol X i j B i, i 5),i: j J i vol X i j ap j, j X i j = 1, i j X i j 0 i, j, 5d) 5e) The size of the linear program is O J C ). The feasible sets of GLOBAL and GLOBAL X are relevant, and the two problems have the same optimal objetives. Theorem 2: GLOBAL and GLOBAL X are equivalent. They have the same optimal objetive values. Further, their optimal solutions an be derived from eah other. Proof: First, the optimal value of GLOBAL X is a lowerbound on that of GLOBAL. Consider any feasible solution {α i,β i j } of 1). We onstrut X i j = α i β i j, whih is also feasible for problem 5). It an be readily verified that two solutions ahieve the same objetive value. Seond, the optimal value of GLOBAL is a lower-bound on that of GLOBAL X. Consider any feasible solution {X i j } of 5). We onstrut α i = j Ji X i j, β j = X i j /α i if α i > 0 and β j = 1/ J i otherwise. It an be verified that {α i,β i j } is feasible for GLOBAL, and ahieves the same objetive value as GLOBAL X. We establish a one-to-one mapping between variables of 1) and 5), and hene they obtain the same optimal objetives and solutions. The above theorem shows a simple approah to solving the joint problem by deploying a entralized oordinator. After olleting all neessary information suh as {perf i j,prie i j,vol }, we rewrite 1) into 5), whih an be solved using standard linear programming tehniques. We then reonstrut {α i,β j } from {X i j } using the above methods and distribute them to all mapping nodes and egress routers. APPENDIX B OPTIMALITY OF DISTRIBUTED MAPPING AND ROUTING A. Proof of Lemma 1 Proof: To show that β is an optimal solution to the routing problem 4), we hek the KKT onditions for 4). These onditions hold if and only if α i f i j α i f i j, j, j J i

8 and β i j > 0. This is true beause when α i > 0, the definition of optimal projetion implies the above ondition. When α i = 0, the optimality ondition holds too. B. Existene of Nash Equilibrium We follow the definitions and proofs of onvergene and optimality in [18]. However, we annot diretly apply their results beause of the presene of data enter load balaning onstraints 3b) and 3). Theorem 3: There exists a Nash equilibrium. Proof: We show this by onstrution. Let X be an optimal solution to GLOBAL X with refined objetive funtion obj g and removing apaity onstraint 1b). We onstrut a set of mapping and routing deisions as follows: αi = j Ji Xi j, β i j = X i j /α i if α i > 0. If α i = 0, we set β j = 1 for some j = argmin j Ji f i j, and set β j = 0 for j J i\{ j }. By Theorem 2, it is easy to hek that α,β ) is also an optimal solution to GLOBAL. Therefore, α must be an optimal solution to MAPPINGβ ), as otherwise we an find a better α to improve the global objetive funtion. By definition, α is an optimal projetion given β. Similarly, β is an optimal solution to ROUTINGα ). The KKT onditions for 4) imply that the definition of optimal projetion holds when αi > 0. When αi = 0, our hoies of β j stritly follows the optimal projetion definition. Combining the two ases shows β is an optimal projetion given α. Therefore, α,β ) is a Nash equilibrium. C. Optimality of Nash Equilibria Theorem 4: Let α,β ) be a Nash equilibrium. Then {α,β } is an optimal solution to GLOBAL. Proof: Construt a set of variables Xi j = α i β i j for i, j,). We show that {Xi j } is an optimal solution to GLOBAL X, given that {αi,β i j } is a Nash equilibrium. It is easy to hek that {Xi j } are feasible for GLOBAL X. Then, it suffies to show that the KKT onditions for GLOBAL X hold with our hoie of {Xi j } and appropriate hoies of other parameter, e.g., Lagrange multipliers, involved in the optimality onditions whih we disuss shortly). In a slight abuse of notation below, we write f i j X) = vol prie i j + K perf i j )/ vol +Φ j X i j vol ) vol / J. Here we onsider the mapping onstraint 3b) only and 3) should follow similarly. We write down the KKT onditions for GLOBAL X. There exist dual optimal variables {λ i1 0,λ i2 0} i suh that f i j + λ i1 λ i2 ) g f i ) j + λ i 1 λ i 2) g, if X i j > 0, λ i1 X i j g w i ε i = 0, i j ) λ i2 X i j g w i + ε i = 0, i j where g = vol / vol. The first ondition originates from the stationarity requirement, and an be interpreted as follows: when a link j J i link j of data enter i) is used to reah lient, e.g., X i j > 0, its assoiated marginal ost must be no greater than the marginal ost of any other link j J. Note that the marginal ost onditions are similar to those in [18], but slightly more ompliated due to the presene of dual variables and onstraints. The seond and third optimality onditions are omplementary slakness for the dual variables and orresponding onstraints. Next we onsider a Nash equilibrium α,β ). We an write down the KKT optimality onditions for MAPPING and ROUTING, respetively, beause the definition of optimal projetion implies the optimality to the two optimization problems. Following the same notations, we have: i) KKT optimality onditions for MAPPING: there exist dual optimal variables {µ i1 0, µ i2 0} i suh that βi j f i j + µ i1 µ i2 )g j J i β j f i j + µ i 1 µ i 2)g, if αi > 0, j J i ) 7) µ i1 αig w i ε i = 0, i 6) µ i2 ) αig w i + ε i = 0, i Similarly, the first ondition is the stationarity requirement, and the seond and third onditions are the omplementary slakness for the dual variables. ii) KKT optimality onditions for ROUTING: f i j f i j for j, j J i and β i j > 0, i,) where α i > 0 8) It follows that the values of f i j are equal for all j suh that β i j > 0 and α i > 0. By our onstrution, Xi j = α i β i j. We next show that {Xi j } satisfy the optimality onditions 6), given the KKT onditions 7)-8) established for {αi,β i j }. Without loss of generality, onsider Xi j > 0 for some i, j,) and j J i, and we have αi > 0,β i j > 0. From ROUTING optimality onditions 8), we know that f i j takes the same value for all j J i when β > 0. By our j hoie of βi j > 0, we have j J i β j f i j = f i j j J i β j = f i j 9) Consider any link j in data enter i, i.e., j J i. There exists at least one link j J i suh that β j > 0. Applying the optimality onditions 8) on i, j,), we have f i j f i j 10) Further, by an argument similar to 9), we have f i j = β j f i j 11) j J i To show that 6) holds for {Xi j }, it suffies to find a set of parameters {λ i1 0,λ i2 0} for all i suh that those equalities and inequalities in 6) hold. We laim that the dual

9 optimal variables {µ i1, µ i2 } in 7) an be used as a hoie of {λ i1,λ i2 }. We then show that the KKT optimality onditions 6) are satisfied with suh hoies. Consider Xi j > 0 for some i, j,) and j J i, we write down its assoiated marginal ost f i j, and ompare it against that of any other link j J i for the same lient. We have f i j + λ i1 λ i2 ) g = f i j + µ i1 µ i2 ) g hoie of dual variables) = βi j f i j + µ i1 µ i2 ) g j J i by 9) β j f i j + µ i 1 µ i 2) g by 7) j J i = f i j + µ i 1 µ i 2) g by 11) f i j + µ i 1 µ i 2) g by 10) = f i j + λ i 1 λ i 2) g hoie of dual variables) So we arrive at the onlusion that {Xi j } are optimal solutions to GLOBAL X. Therefore, {αi,β i j } are also optimal solutions to GLOBAL as they attain the same optimal objetive values. D. Convergene to Nash Equilibrium Theorem 5: Alternate projetion of MAPPING and ROUTING onverges to a Nash equilibrium. Proof: Consider the refined funtion obj g 2, whih is the ommon objetive for both MAPPING and ROUTING problems. By the definition of optimal projetions, the objetive value of obj g is a dereasing sequene in the alternate projetion algorithm. In addition, the objetive value is lowerbounded by the optimal value of GLOBAL X. Therefore, there exists a limit point of the sequene. Similarly, we an argue that the α, β) sequene also has a limit point, sine the the feasible spae of {α,β} is ompat and ontinuous. It is not diffiult to hek that the limit point must be be a Nash equilibrium, as otherwise we an find another feasible solution that is within an ε-ball of the the limit point, whih gives a lower objetive value than the limit of obj g, whih ontradits our assumption at the beginning. In our evaluations, we sale this funtion suh that 100% utilization gives a 250ms queuing lateny, orresponding to some standard router buffer sizes. APPENDIX D EXPERIMENT SETUP We evaluate the benefits of the joint mapping-routing approah using trae-based simulations for CoralCDN [10], a ahing and ontent distribution platform running on top of Planetlab. Planetlabis a servie deployment platform whih spans more than 800 sites geographially distributed in six ontinents. We orrelate the traffi data with lateny measurements from iplane [14] whih ollets various wide-area network statistis to destinations on the Internet from a few hundred vantage points hosted on Planetlab nodes). All data orresponds to Marh 31st, Emulating data enters by lustering Coral nodes: Most Planetlab sites are single-homed i.e., have just one ISP onneting them to the Internet. Therefore, it is hallenging to demonstrate the benefits of our sheme on Coral, beause the only deision that ontrols the lateny and ost of user requests is the mapping deision. Instead, we adopt the approah followed in [12] where we treat multiple Planetlab sites in the same general metropolitan area as different egress links of a single data enter loated in that area. The density of Planetlab sites around ertain ities makes suh lustering of sites into data enters possible. We handpiked 12 data enters loated in North Ameria, Europe, Asia and South Ameria, with a minimum of three Coral nodes in eah data enter and a maximum of six. Fig. 7. We show the loations of the data enters in Figure 7. Loation of the 12 emulated data enters. APPENDIX C PERFORMANCE PENALTY FUNCTION ISP traffi engineering models link ongestion ost with a onvex inreasing funtion Φr j ) on the link traffi load, i.e., r j. The exat shape of the funtion is not important, and we use the same pieewise linear ost funtion as in [17], given below: r j 0 r j /ap j < 1/3 3r j 2/3 ap j 1/3 r j /ap j < 2/3 10r j 16/3 ap j 2/3 r j /ap j < 9/10 Φ j r j,ap j ) = 70r j 178/3 ap j 9/10 r j /ap j < 1 500r j 1468/3 ap j 1 r j /ap j < 11/ r j 16318/3 ap j 11/10 r j /ap j < Routable prefixes as lient aggregates: We aggregate users into routable Internet prefixes from a RouteViewsFIB dump), for two reasons. First, routing deisions at data enter egress routers happen at this granularity, hene it is a natural hoie. Seond, the aggregation of IP prefixes in routing tables enables us to deide on reahability and lateny information for a larger number of lient IPs from the same set of initial lateny measurements. The number of routable IP prefixes is large 350,000) but only of these send any traffi to Coral. Traffi data from CoralCDN: We use a worldwide request trae from CoralCDN whih lists a request timestamp, the Planetlab site at whih it was reeived, a lient IP address, and the number of bytes involved in the transfer. Overall, this trae ontains 27 million requests and over a terabyte of data.

10 For our experiments, we aggregate the requests made by lients into their most speifi IP prefix from the RouteViews dump, and average their request rates for eah hour. Lateny measurements from iplane: We extrat round trip lateny in milliseonds) from the iplane logs, whih ontain traeroutes made to a large number of IP addresses from various vantage points at Planetlab sites. We only use lateny information from vantage points whih are also egress links in the data enters we piked in Figure 7, whih limits us to 49 vantage points. We assign the lateny to eah destination IP address to the orresponding most-speifi IP prefix from the RouteViews dump, assuming that the lateny is representative. If there is lateny data for multiple IP addresses from the same prefix, we use their average value. The logs provide only one set of lateny values for the entire day, and we assume that the paths are lightly loaded when the probes our hene treating these measured latenies as path propagation delays. Piking a set of workable lient prefixes: The set of destination IP prefixes for whih lateny data is available is not uniform aross vantage points. Hene, we only retain data for destinations whih are reahable from at least one vantage point belonging to eah data enter. This allows us to perform mapping of any lient to any data enter. Next, we interset this set of destinations with those lient prefixes whih send traffi to Coral CDN at any time through the day. We are left with a set of about prefixes, whih onstitute about 39% of the total traffi trae by requests and 38% by bytes. Capaity estimations: The apaities of links onneting Planetlab sites to the Internet are not publily available, and even if they are, Planetlab mahines are shared aross multiple servies eah possibly with bandwidth aps. Instead of attempting to determine apaity or bandwidth aps from ground truth this information is not publily available to our knowledge), we estimate them through traffi volumes in the trae. The key assumption we make for setting apaities from observed traffi is that links are provisioned for a speifi peak utilization. We determine the peak request rates reeived by Coral nodes whih are part of the emulated data enters. We sale these numbers as follows: 1) we aount for a speifi peak utilization 80%), hene saling by 100/80; 2) we aount for the traffi reeived by all Coral nodes inluding those not part of the emulated data enters) through saling by the ratio of peak hourly traffi over all Coral nodes to the peak hourly traffi over the hosen Coral nodes. Bandwidth ost estimation: We assume a simple harging model based on total amount of data transferred over a given period of time. We employ a performane-based model where we assign a higher ost to a link whih provides lower lateny paths to a higher proportion of traffi volume as determined from the traffi and lateny trae information). To quantify this, we first determine the number of requests servied by eah Coral node assuming that lients are servied by the latenywise losest Coral node. We then use table II motivated by prie values from Amazon EC2 bandwidth priing [19]) to determine the harging prie per GB of data sent over that link. Data Center Popularity total request volume per day: 10 5 ) TABLE II. Priing $ per GB) PERFORMANCE-BASED PRICING MODEL.

More information

More information