Joint Server Selection and Routing for Geo-Replicated Services

Similar documents
DONAR Decentralized Server Selection for Cloud Services

CONTENT DISTRIBUTION. Oliver Michel University of Illinois at Urbana-Champaign. October 25th, 2011

Lecture 13: Traffic Engineering

Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Algorithms and Game Theory Date: 12/3/15

Cooperative Content Distribution and Traffic Engineering

Motivation'for'CDNs. Measuring'Google'CDN. Is'the'CDN'Effective? Moving'Beyond'End-to-End'Path'Information'to' Optimize'CDN'Performance

Efficient Internet Routing with Independent Providers

context: massive systems

Internet Anycast: Performance, Problems and Potential

Network games. Brighten Godfrey CS 538 September slides by Brighten Godfrey unless otherwise noted

LatLong: Diagnosing Wide-Area Latency Changes for CDNs

SaaS Providers. ThousandEyes for. Summary

volley: automated data placement for geo-distributed cloud services

Multipath Transport, Resource Pooling, and implications for Routing

Cache Management for TelcoCDNs. Daphné Tuncer Department of Electronic & Electrical Engineering University College London (UK)

A Background: Ethernet

Distributed Systems. 21. Content Delivery Networks (CDN) Paul Krzyzanowski. Rutgers University. Fall 2018

CS November 2018

From Routing to Traffic Engineering

TCP WISE: One Initial Congestion Window Is Not Enough

BGP. Daniel Zappala. CS 460 Computer Networking Brigham Young University

The Content-Pipe Divide

Complex Interactions in Content Distribution Ecosystem and QoE

ThousandEyes for. Application Delivery White Paper

Authors: Rupa Krishnan, Harsha V. Madhyastha, Sridhar Srinivasan, Sushant Jain, Arvind Krishnamurthy, Thomas Anderson, Jie Gao

A strategy for IPv6 adoption

Quality-Assured Cloud Bandwidth Auto-Scaling for Video-on-Demand Applications

JOINT NETWORK AND CONTENT ROUTING

Games in Networks: the price of anarchy, stability and learning. Éva Tardos Cornell University

! Network bandwidth shared by all users! Given routing, how to allocate bandwidth. " efficiency " fairness " stability. !

A Survey on Research on the Application-Layer Traffic Optimization (ALTO) Problem

Multiple Virtual Network Function Service Chain Placement and Routing using Column Generation

Data Center TCP (DCTCP)

Gayathri V, Vasumathi N, L. Thangapalani

Democratizing Content Publication with Coral

15-744: Computer Networking. Overview. Queuing Disciplines. TCP & Routers. L-6 TCP & Routers

To Relay or Not to Relay for Inter-Cloud Transfers? Fan Lai, Mosharaf Chowdhury, Harsha Madhyastha

Shim6: Network Operator Concerns. Jason Schiller Senior Internet Network Engineer IP Core Infrastructure Engineering UUNET / MCI

Interdomain Routing Design for MobilityFirst

End-user mapping: Next-Generation Request Routing for Content Delivery

An overview on Internet Measurement Methodologies, Techniques and Tools

CS 268: Computer Networking

CS 268: Computer Networking. Adding New Functionality to the Internet

Carnegie Mellon Computer Science Department Spring 2016 Midterm Exam

CS November 2017

Trisul Network Analytics - Traffic Analyzer

How Bad is Selfish Routing?

Key aspects of cloud computing. Towards fuller utilization. Two main sources of resource demand. Cluster Scheduling

John S. Otto Mario A. Sánchez John P. Rula Fabián E. Bustamante

Selfish Caching in Distributed Systems: A Game-Theoretic Analysis

Chapter 1. Introduction

The Keys to Monitoring Internal Web Applications

Analysing Latency and Link Utilization of Selfish Overlay Routing

Congestion and its control: Broadband access networks. David Clark MIT CFP October 2008

PCC: Performance-oriented Congestion Control

March 10, Distributed Hash-based Lookup. for Peer-to-Peer Systems. Sandeep Shelke Shrirang Shirodkar MTech I CSE

On the Optimality of Link-State Routing Protocols

Missing Pieces of the Puzzle

Pricing Intra-Datacenter Networks with

Democratizing Content Publication with Coral

Using Game Theory To Solve Network Security. A brief survey by Willie Cohen

Lecture 15: Datacenter TCP"

CS 43: Computer Networks. 24: Internet Routing November 19, 2018

The Case for Informed Transport Protocols

TAIL LATENCY AND PERFORMANCE AT SCALE

Inter-AS routing. Computer Networking: A Top Down Approach 6 th edition Jim Kurose, Keith Ross Addison-Wesley

On low-latency-capable topologies, and their impact on the design of intra-domain routing

CSE 124: TAIL LATENCY AND PERFORMANCE AT SCALE. George Porter November 27, 2017

SOFTWARE DEFINED NETWORKS. Jonathan Chu Muhammad Salman Malik

When Milliseconds Matter. The Definitive Buying Guide to Network Services for Healthcare Organizations

DESIGNING a link-state routing protocol has three

List of measurements in rural area

CS 43: Computer Networks Internet Routing. Kevin Webb Swarthmore College November 16, 2017

Capacity planning and.

Data Center TCP (DCTCP)

Reverse Traceroute. NSDI, April 2010 This work partially supported by Cisco, Google, NSF

New levels of cooperation between eyeball ISPs and OTT/CDNs. RIPE 75 Dubai Oct 24, 2017 Falk von Bornstaedt, DTAG ICSS

Lecture 14: Congestion Control"

Some economical principles

Congestion Control In the Network

11/13/2018 CACHING, CONTENT-DISTRIBUTION NETWORKS, AND OVERLAY NETWORKS ATTRIBUTION

Application of SDN: Load Balancing & Traffic Engineering

Why Highly Distributed Computing Matters. Tom Leighton, Chief Scientist Mike Afergan, Chief Technology Officer J.D. Sherman, Chief Financial Officer

Tales of the Tail Hardware, OS, and Application-level Sources of Tail Latency

Topic 6: SDN in practice: Microsoft's SWAN. Student: Miladinovic Djordje Date:

CS644 Advanced Networks

Congestion Control. Andreas Pitsillides University of Cyprus. Congestion control problem

Question. Reliable Transport: The Prequel. Don t parse my words too carefully. Don t be intimidated. Decisions and Their Principles.

RAPTOR: Routing Attacks on Privacy in Tor. Yixin Sun. Princeton University. Acknowledgment for Slides. Joint work with

Introduction to Dynamic Traffic Assignment

On Selfish Routing in Internet-Like Environments

Measuring IPv6 Adoption

Implementing stable TCP variants

Content Distribution. Today. l Challenges of content delivery l Content distribution networks l CDN through an example

Overlay Networks. Behnam Momeni Computer Engineering Department Sharif University of Technology

IPv6 at Google. Lorenzo Colitti

Capacity planning and.

COST-MINIMIZING DYNAMIC MIGRATION OF CONTENT DISTRIBUTION SERVICES INTO HYBRID CLOUDS

PEERING: An AS for Us

ClickHouse Deep Dive. Aleksei Milovidov

Transcription:

Joint Server Selection and Routing for Geo-Replicated Services Srinivas Narayana Joe Wenjie Jiang, Jennifer Rexford and Mung Chiang Princeton University 1

Large-scale online services Search, shopping, social networking, Typical architecture: 2

Large-scale online services Typical architecture: Geo-distributed data centers Multiple ISP connectivity to the Internet 3

Large-scale online services Typical architecture: Multiple geo-distributed DCs Each DC has multiple ISP connections Diversity for performance & fault tolerance Wide-area traffic management Crucial to service performance and costs 4

Optimizing wide-area traffic matters Latency: page speed affects bottom line revenue [1] Wide-area round trip times 100ms Perceived delay: several RTTs Costs: server power, transit bandwidth costs Tens of millions of $$ annually [2,3] Function of (load at DCs, ISP links) over time [1] Steve Souders, Velocity and the bottom line [2] Zhang et al., Entact, (NSDI 10) [3] Qureshi et al., Cutting the electric bill for Internet-scale systems (sigcomm 09) 5

Controlling wide-area paths Data center 1 Data center 2 Data center 3 Requests can be served out of any DC Send request to DC 2 Route response through peer 1 1 2 3 Who is service.com? Map request to data center 2 Mapping node e.g., DNS resolver 1. Pick a DC Mapping 2. Traverse path from client à DC 3. Pick a path from DC à client Choice of ISP: Routing 6

Today: Independent mapping & routing Systems managed by different teams [1, 2] even in the same organization Different goals e.g., mapping: geographic locality e.g., routing: latency as well as bandwidth costs But, they need to cooperate [1] Valancius et al., PECAN (Sigmetrics 13) [2] Zhu et al., LatLong (IEEE transactions on network & service management 12) 7

Lack of coordination leads to poor performance and high costs! 8

Why does coordination matter? Incompatible goals between mapping & routing: Mapping à latency; routing also cares about cost Result: neither low latency nor low cost Microsoft: operating point far from pareto-optimal [1] Incomplete visibility into routing system info: e.g., latency of path chosen by routing, link loads Google: prefixes with tens of ms extra latency [2] Reason: mapping to nodes with poor routing Visibility and coordination can improve state of affairs... [1] Zhang et al., Entact (NSDI 10) [2] Krishnan et al., WhyHigh (IMC 09) 9

Use centralized control? But centralized control is problematic Lack of modularity Different teams manage mapping & routing [1, 2] System management possibly outsourced e.g., mapping by Amazon s Route 53 [1] Valancius et al., PECAN (Sigmetrics 13) [2] Krishnan et al., WhyHigh (IMC 09) 10

Central question of this paper: How to coordinate decision-making between mapping and routing? while retaining administrative separation? 11

Contributions of this work Challenges in coordinating mapping & routing decisions Distributed algorithm that addresses these problems In particular, What local computations? What information exchanges? What local measurements? Language: Optimization A distributed algorithm that provably converges to the system global optimum 12

Optimization model Bandwidth price Routing decision per client at each DC: Traffic splits across links Mapping decision per client: Load balancing fractions across DCs Link capacity Path latency Mapping policies: Total DC load fractions, load caps per DC Clients: routable IP prefixes Decision variables Optimized periodically, e.g., hourly/daily Constraints 13

Optimization objectives Performance: Average request round-trip latency Cost: Average $$ per GB Final objective: average cost + K * average latency K: additional cost per GB per ms improvement See paper for full details 14

Joint optimization formulation Minimize cost + K * perf Subject to: Link capacity constraints DC load balancing / capacity constraints Non-negative request traffic, all request traffic reaches a DC and an ISP egress link Variables: Mapping and routing decisions 15

Simple distributed algorithm Separate into mapping & routing subproblems Optimize over only one set of variables Alternate mapping & routing optimizations Hope: iterate until equilibrium and reach globally optimal decision? Routing Min: cost + K * perf Subject to: Variables: Min: cost + K * perf Subject to: Variables: Mapping 16

Naïve distributed mapping and routing can lead to suboptimal equilibria! 17

Causes for suboptimal equilibria #1. Bad constraints in the model #2. Bad routing under no traffic In examples, assume that mapping & routing: Both optimize for average latency Both have visibility into necessary information Each other s decisions, link capacities, client s traffic load, From some starting configuration è mapping, routing: each optimal given the other but globally suboptimal 18

Problem #1. Bad constraints Mapping decides based on paths selected by routing Selected by routing Alternate route DC 1 ¼ Gb/s DC 2 1Gb/s 50ms 1Gb/s 75ms ¾ Gb/s 150ms 60ms 75% 25% 1Gb/s traffic 19

Visualizing suboptimal equilibria Mapping decisions: Routing decisions: DC 1 DC 2 ¼ Gb/s 1Gb/s 50ms 75ms 2 1Gb/s 1-1 ¾ Gb/s 150ms 60ms 1 1-1 - 2 β 1 = β 2 = β Optimal solution: α = 1; β =1/4 1Gb/s traffic 20

Locally optimal boundary points Infeasible space Objective value High Routing (β) Objective values only increase! Low Mapping (α) 21

Solution: remove local optima on boundary! Intuition: remove constraint boundary to get rid of local optima Problem: link load can exceed bandwidth! Penalty term: approximation of queuing delay (M/M/1) Add average queuing delay to objective function Remove capacity constraints 22

Convergence to optimum Global optimum 23

Causes for suboptimal equilibria #1. Bad constraints in the model #2. Bad routing under no traffic 24

Problem #2: Bad routing under no traffic conditions Selected by routing Alternate route DC 1 DC 2 1Gb/s 1Gb/s 75ms 100ms 100ms 1Gb/s 50ms 0.5Gb/s traffic 25

Solution: algorithmically avoid bad states! When no traffic from a client at a DC: Route client so that infinitesimal traffic behaves correctly [DiPalatino et al., Infocom 09] Mapping can explore better paths from unused DCs Extension: linear map constraints (e.g., load balancing) Insight: mapping constraints don t cause suboptimality (only capacity constraints do) Routing doesn t need to know mapping constraints 26

Bad routing states removed DC 1 DC 2 1Gb/s 75ms 100ms 100ms 1Gb/s 50ms 0.5Gb/s traffic Assume infinitesimal traffic arrives at DC 2 è Routing changes from 100ms to 50ms link! 27

Remove local optima on constraint boundaries by using penalties in the objective function instead Avoid bad routing states algorithmically by routing infinitesimal traffic correctly Provably sufficient for convergence to a globally optimal solution! (See paper for proof and distributed algorithm) 28

Evaluation 29

Trace-based evaluations Coral CDN day trace: 12 data centers, 50 ISP links, 24500 clients Over 27 million requests, terabyte of data traffic Benefits of information visibility: ~ 10ms lower average latency throughout the day 3-55% lower costs Optimality & convergence of distributed solution: Get to within 1% of global optimum in 3-6 iterations! Find more results in paper: Runtime, impact of penalty, choice of K 30

Summary Coordinating mapping and routing is beneficial over independent decision-making Both in cost and performance It s possible to be optimal with a modular design Proposed solution achieves optimality and converges quickly in practice 31

Thanks! narayana@cs.princeton.edu 32

Related work P2P apps with hints from the network: P4P, ALTO, ISP-CDN collaboration: Using ISP info for CDN mapping: PaDIS Shaping demand: CaTE, Sharma et al. (sigmetrics 13), Game theory: DiPalatino et al. (infocom 09), Jiang et al.(sigmetrics 09), Online service joint TE: Entact, Pecan, Xu et al. (infocom 13) 33

Distributed protocol Min: cost + K*perf Subject to: Variables: Routing decisions Latencies Min: cost + K*perf Subject to: Variables: Latencies Mapping decisions Client traffic volumes Traffic volumes 34

Why optimization? Encode tradeoffs: cost, performance Encode constraints: DCs and DC-ISP link capacities Even small % improvements can be significant 1% of 10 million users == 100,000 users 5% annual savings == millions of $$ Naturally derive distributed algorithms [1] Theoretical guarantees on stability [1] Chiang et al., Layering as optimization decomposition (IEEE transactions) 35

Incompatible objectives Mapping System: Optimize latency Routing System: Optimize cost Selected by routing Alternate route Mapping decisions based on paths exposed by routing (orange paths) DC 1 DC 2 $10/GB $1/GB 90ms $5/GB 100ms $12/GB 40ms 250ms Neither optimal cost nor optimal latency 36

Incomplete visibility Mapping and Routing systems -- Both optimize latency Selected by routing Alternate route Overload + DC 1 Packet drop DC 2 1Gb/s 10Gb/s 45ms 50ms 100ms 75ms 5Gb/s traffic 37

Optimization formulation (in words) Minimize latencies and costs cost + K * perf Subject to: Link capacity constraints DC load balancing / capacity constraints Non-negative request traffic, all request traffic reaches a DC and an ISP egress link Variables: Mapping and routing decisions 38

Link load penalty function 39

How good is the joint solution? How does the distribution algorithm perform? 40

Trace-based evaluations Traffic: Coral content distribution network Latency: iplane Capacities and pricing: estimations 12 data centers, 50 ISPs, 24500 clients Over 27 million requests, terabyte of data exchanged 41

Benefits of information visibility Evaluate performance and cost benefits separately Baseline mapping: no visibility into routing decisions but respects total data center capacities Performance: Baseline mapping: optimize average latency Assume per-client least latency path is used Cost: Baseline mapping: optimize average cost Assume average per-link cost is applicable 42

Benefits of information visibility 10ms smaller average latency through the day 3-55% smaller average cost across the day 43

Optimality and convergence Fast convergence Get within 1% of global optimum in 3-6 iterations Each iteration takes ~ 30s to run Acceptable for hourly or daily re-optimization time More results in paper Impact of penalty function Choice of tradeoff parameter K 44

Proof overview Existence: show existence of Nash equilibrium Optimality: Show that KKT optimality conditions of mapping and routing problem imply those of joint problem Key insight: find dual variables that satisfy the existence condition Convergence: standard techniques 45

Use centralized control? Performance goals Cost goals Path performance metrics DC-ISP link capacities Client traffic volumes DC load balancing goals Controller logic Mapping decisions Routing decisions 46