Spotlight: Scalable Transport Layer Load Balancing for Data Center Networks

Size: px

Start display at page:

Download "Spotlight: Scalable Transport Layer Load Balancing for Data Center Networks"

Garry Fox
5 years ago
Views:

1 1 Spotlight: Scalable Transport Layer Load Balancing for Data Center Networks Ashkan Aghdai, Cing-Yu Chu, Yang Xu, David H. Dai, Jun Xu, H. Jonathan Chao New York University arxiv: v1 [cs.ni] 21 Jun 2018 {ashkan.aghdai, cyc391, yang, Huawei {david.h.dai, Abstract Load Balancing plays a vital role in modern data centers to distribute traffic among instances of network functions or services. State-of-the-art load balancers such as Silkroad dispatch traffic obliviously without considering the real-time utilization of service instances and therefore can lead to uneven load distribution and suboptimal performance. In this paper, we design and implement Spotlight, a scalable and distributed load balancing architecture that maintains connection-to-instance mapping consistency at the edge of data center networks. Spotlight uses a new stateful flow dispatcher which periodically polls instances load and dispatches incoming connections to instances in proportion to their available capacity. Our design utilizes distributed control plane and in-band flow dispatching and thus scales horizontally in data center networks. Through extensive flow-level simulation and packet-level experiments on a testbed, we demonstrate that compared to existing methods Spotlight distributes the traffic more efficiently and has near-optimum performance in terms of overall service utilization. Moreover, Spotlight is not sensitive to utilization polling interval and therefore can be implemented with low polling frequency to reduce the amount of control traffic. Indeed, Spotlight achieves the mentioned performance improvements using O(100ms) polling interval. Index Terms software defined networks, scalable load balancing, transport layer, network function virtualization.

2 2 I. INTRODUCTION In a typical data center a large number of services and network functions coexist. On average, 44% of data center traffic passes through at least one service [1]. To keep up with the evergrowing demand from users, services and functions scale out to have a large number of service instances. L4 load balancers play a major role in data center networks by choosing the serving instance of a service for incoming connections. Services and network functions perform stateful operations on connections. Consider Intrusion Detection Systems (IDS) as an example. In order for an IDS to properly detect intrusive connections, it should process the content of a connection as a whole and not on a per-packet basis since malicious content tends to spread across multiple packets. In other words, judging by individual packets an IDS cannot reliably decide whether or not the content is malicious. Therefore, once a load balancer chooses an IDS instance for a particular connection, all packets of that connection should be processed by the same IDS instance. This requirement is referred to as Per-Connection Consistency (PCC) [2]. Violating PCC may result in malfunction, connection interruption, or increased end-to-end latency that degrade the quality of service considerably. PCC requirement reduces the load balancing problem to distribution of new connections among service instances. A PCC load balancer has to resolve two critical questions: 1) Flow Dispatching: Which instance will serve incoming connections? The answer to this question determines how well the load balancer distributes the traffic among instances. An efficient distribution does not overload instances; thus, it maximizes the aggregated throughput of the service. 2) Load Balancing Architecture: How the traffic is directed to the serving instance? This question signifies the architecture and implementation of the load balncer. Therefore, the answer to this question affects the scalability and practicality of the load balancer. A number of L4 load balancers have been proposed in the recent years. These solutions use software [1], [3], [4] hardware [2], or both [5] for load balancing. Most of these solutions rely on Equal Cost Multipath Routing (ECMP) for flow dispatching. Using ECMP at L4 is problematic as it does not take the real-time load of service instances into account for flow dispatching. Although ECMP equally distributes the connections among service instances, nonuniform distribution of connection size may cause uneven utilization of instances and therefore degrade service s throughput. Moreover, instances capacity are not equal. This is specially true

3 3 for Virtualized Network Functions (VNF) that implement network functions using software. As a result of varying processing power and other resources on physical machines, VNF instances could have different packet processing rates. Therefore, even if ECMP could distribute the traffic uniformly among VNFs, their utilization would have been uneven. L3 load balancers used to rely on ECMP as well, however, it is well known that using this method causes uneven traffic distribution [6]. Consequently, state-of-the-art L3 load balancers deprecated ECMP in favor of stateful load balancing algorithms such as Least Congested First (LCF) [7] [9]. LCF aims at equalizing resource utilization by avoiding highly utilized resources and prioritizing those with least utilization. However, LCF requires constant polling of resources utilization which is excessive at L4 given the large number of service instances. Architecting load balancers is also challenging as load balancers are a potential performance bottleneck for cloud-scale services. Traditionally, data centers relied on dedicated load balancers [10] [12]. Dedicated solutions scale up using application-specific hardware. However, it is impossible for scale-up solutions to keep up with scale-out services. Hence, load balancing itself should be implemented as a distributed scale-out service. Having multiple load balancer instances means that the connection-to-instance mapping is partitioned among multiple devices. Thus, distributed load balancers will have to include complex state management and fault tolerance mechanisms to assure PCC. To address aforementioned issues, we propose a set of solutions. Our objective is to enable scalable high-throughput load balancing at L4. Adaptive Weighted Flow Dispatching (AWFD) is our solution for the flow dispatching problem. Unlike ECMP, AWFD is stateful and dispatches new connections among instances in proportion to instances available capacity. As a result, it improves service s aggregated throughput. AWFD uses instances available capacity to classify them into different priority classes. Priority classes are then used for assigning new connections to instances. Our simulations show that AWFD with O(100ms) update rate and 4 priority classes has near-optimum performance in terms of overall service utilization. Moreover, we have designed Spotlight, a distributed load balancing architecture that implements AWFD, maintains PCC, and scales out at data center networks edge. Spotlight polls instances to estimate their available capacity which is used by AWFD. As an SDN application, Spotlight utilizes a distributed control plane to push AWFD priority classes to edge switches. Edge switches use priority classes to dispatch incoming connections to service instances in the

4 4 data plane. In-band flow dispatching eliminates the pressure on control plane and allows Spotlight to scale horizontally. AWFD does not require frequent polling of resources utilization and our testbed results show that using relaxed polling in O(100ms), our solution reduces average flow completion time by 9% compared to ECMP. The rest of the paper is organized as follows. II reviews the load balancing problem in fine detail. III explores existing flow dispatchers, presents their weaknesses, and proposes AWFD. IV presents Spotlight. V evaluates AWFD and Spotlight using flow-level simulations and packet-level experiments on a testbed, respectively. VI reviews related works in this area. VII concludes the paper. II. MOTIVATION AND BACKGROUND A. Terminology In data centers, services or network functions are usually assigned Virtual IP addresses (VIP). Each VIP has many instances that collectively perform the network function; instances are uniquely identified by their Direct IP addresses (DIP). An L4 Load Balancer (LB) distributes the VIP traffic among DIPs by assigning connections to DIPs; this assignment process is also referred to as Flow Dispatching or connection-to-dip mapping. The connection-to-dip mapping should be consistent: i.e., all packets of a connection 1 should be processed by the same DIP (PCC). We refer to the mapping between active connections and their DIPs as the state of the load balancer. The following notations are used throughout the paper: j: VIP index. i: DIP index. N j : Number of instances for jth VIP. f j i : ith instance of jth VIP. C j i : Capacity of f j i. U j i : Utilization of f j i. R j i = U j i Cj i : Packet processing rate at f j i. p[f j i ]: Probability of choosing f j i for a new connection to VIP j. A j i = (1 U j i )Cj i : Available capacity of f j i. 1 In this paper the terms connection and flow are used interchangeably.

5 5 C j = N j i=1 Cj i : Capacity of VIP j. T j = N j i=1 U j i Cj i : Aggregated throughput of VIP j. Ω j = T j /C j : Service Utilization for VIP j. B. Load Balancing Architecture Traditionally, data center operators used a dedicated load balancer for each VIP. In this architecture, as illustrated in Figure 1, a hardware equipment is configured as VIP load balancer. All traffic to the VIP pass through the dedicated load balancer which uses ECMP flow dispatcher. Therefore, dedicated load balancer is a single point of failure as well as a potential performance bottleneck for its respective service. This is a scale-up architecture utilizing high capacity expensive special-purpose hardware. In this architecture, the load balancer is typically deployed at the core of the network to handle incoming traffic from the Internet as well as the intra-data center traffic. As a result, it adds extra hops to connections data path. The advantage of this dedicated architecture is its simplicity since the load balancing state is kept on a single device and could be easily backed up or replicated to ensure PCC in case of failures. Modern data centers, on the other hand, rely on distributed load balancing [4]. In this architecture a number of hardware or software load balancers handle the traffic for one or many VIPs. Distributed load balancing is a scale-out architecture that relies on commodity devices to perform load balancing. Compared to the dedicated architecture, distributed load balancing can deliver higher throughput at a lower cost. Distributed load balancing on the source side is depicted 1: Packet VIP Connection's Source Data Center Network 2: Packet VIP 4: Packet DIP VIPLB Service Instances( DIPs) Figure 1: Dedicated load balancing.

6 1: Packet VIP Data Center Network 2: Packet DIP 3: Packet DIP Connection's Source Service Instances( DIPs) Figure 2: Distributed load balancing at source. in Figure 2.

6 6 1: Packet VIP Data Center Network 2: Packet DIP 3: Packet DIP Connection's Source Service Instances( DIPs) Figure 2: Distributed load balancing at source. in Figure 2. In this architecture VIP is resolved to DIP when packets enter the network. and compared to dedicated solutions, connections data path has fewer hops. Distributed architecture offloads the load balancing function to edge devices. Distributed load balancing is also capable of handling the Internet traffic. For Internet traffic the gateway is the edge of the data center network and one or multiple load balancers can be installed on the gateway to handle the Internet traffic. The real cost of distributed Load balancing, however, is its complexity of maintaining the load balancing state that is distributed across multiple devices. Ensuring PCC poses a challenge in this architecture since load balancers are prone to failure. To solve this issue, [4] proposes to use consistent hashing [13]. Consistent hashing allows the system to recover the lost state from failing devices and thus guarantee PCC. C. Problem Statement We focus on dynamic load balancing where connection information such as size, duration, and rate are not available to the load balancer. An efficient L4 load balancer should distribute connections among DIPs such that: 1) The connection-to-dip mapping is consistent (PCC). 2) Uniformly distribute the load among DIPs. Uniform load distribution will have 3 conequences:

7 7 1) The maximum utilization among DIPs will be minimized. 2) The throughput and thus the overall VIP utilization (Ω) will be maximized. 3) Flows average completion time will be minimized as a result of minimized max utilization. Uniform load distribution ensures that we are utilizing the DIP pool as efficiently as possible when VIP capacity is close to users demand. An efficient load balancer utilizes as many DIPs as possible to achieve large T j and high Ω j as a result, whereas an inefficient load balancer may over load some DIPs while underutilizing others and achieve low Ω j. A load balancer may have many components to ensure PCC and may use distributed or dedicated architecture. To guarantee PCC, new flows are assigned to DIPs and once the assignment is done, it cannot be changed. Therefore, flow dispatching is the core of any L4 load balancing scheme and it determines the efficiency of load balancer. III. FLOW DISPATCHING In this section, we first review existing flow dispatching algorithms and demonstrate their shortcomings for L4 load balancing. Next, we introduce a novel L4 flow dispatching algorithm. Throughout the section we will use a simple example as shown in Figure 3 to compare the performance of various flow dispatching algorithms. In this example, the VIP has 3 DIPs with various capacities. The load balancer receives two new connections (F 1, F 2 ) at the same time. While connections rate are not available to the flow dispatcher, we assume a rate of one for both to compare the throughput of flow dispatching algorithms. In this example, C = 8 and a maximum throughput of T = 8 with Ω = 1 is achievable if new connections are assigned to second and third instances. Figure 3: An example of L4 load balancing with 3 DIPs and 2 new connections

8 8 A. Existing Flow Dispatchers 1) Equal Cost Multipath Routing: ECMP is the most commonly used flow dispatcher due to its simplicity. It statistically distributes equal number of connections among DIPs. Using ECMP flow dispatcher, a new connection is equally likely to be dispatched to any DIP in the pool: j, i : p[f j i ] = 1 N j Consider the example of Figure 3. If the two flows are assigned to instances h, i, we denote service throughput using τ h,i. The probability of assigning new flows to instances h, i is denoted as p h,i. Using ECMP p h,i = (1/3) 2 = 1/9. We can derive the expected value of service throughput under ECMP: E[T ECMP ] = E(Ω ECMP ) = 0.89 h=1..3 i=1..3 p h,i τ h,i = = 7.11 Statistically distributing equal number of flows to DIPs may not result in balanced distribution of load due to a number of reasons: (i) Connections have huge size discrepancies; indeed, it is well-known that flow size distribution in data centers is heavy-tailed [14] [16] and it is quite common for ECMP to map several elephant flows to the same resource and cause congestion [6]. (ii) DIPs may have different capacities; this is specially the case for NFV instances. (iii) ECMP does not react to the state of the system. As our example shows, ECMP may dispatch new connections to overwhelmed instances and deteriorate Ω as a result. 2) LCF: Least-Congested First (LCF) is a dispatching algorithm mainly used at the network layer. LCF is stateful; it periodically polls instances utilization and always dispatches new connections to the instance with the least utilization. For the example of Figure 3, LCF will consider the second instance as the least utilized one until the next polling; therefore, it will send both of the connections to the second instance. As a result, E[T LCF ] = 7 and E[Ω LCF ] = As our example shows, LCF may perform worse than ECMP when too many flows enter the system within a polling interval. Compared to ECMP, LCF relies on polling of resources utilization to perform stateful load balancing. The polling enables the system to avoid congestion

9 9 when flow size distribution is not uniform or instances have different capacity. However, LCF s performance depends on the pooling frequency as well as the flow arrival rate. High sensitivity to polling frequency makes LCF configuration challenging. Existing LCFbased routing schemes process flowlets [17] rather than flows and use very small polling interval ranging from a few RTTs [7] to sub-millisecond scale [8]. This makes LCF extremely expensive for L4 flow dispatching since: Frequent polling is extremely expensive in terms of communication and processing overhead since the number of DIPs is much larger than the number of available paths in L3 - tens of thousands compared to hundreds. LCF puts a lot of burst load on the current least-congested DIP until the next polling. B. Adaptive Weighted Flow Dispatching (AWFD) We propose AWFD stateful flow dispatching at L4. AWFD polls DIPs load to avoid sending new flows to overwhelmed DIPs. It distributes new flows among a group of uncongested DIPs rather than just a single DIP within each polling interval. Having a group of active instances reduces the pressure on DIPs and thus allows for less frequent polling of DIPs utlization. Furthermore, since multiple DIPs with various available capacities are active at the same time, AWFD assigns weight to DIPs to assure that incoming flows are distributed in proportion to instances available capacity. We optimize AWFD for implementation on programmable data planes. The first step is to partition the DIP pool into smaller groups comprised of DIPs with roughly equal available capacity. We use priority classes to refer to groups of DIPs with roughly equal avilable capacity. With use of priority classes, AWFD breaks down flow dispatching for new flows into two stages: 1) Assign the new flow to a priority class. Probability of choosing a priority class is proportional to aggregated available capacity of its members. 2) Assign the new flow to a DIP within the selected priority class. Since priority class members have identical available capacity, we can randomly select one of the instances to serve the new flow (same as ECMP). Next, we formally define AWFD and show that the two-stage algorithm will assign new flows to DIPs in proportion to DIPs available capacity.

10 10 1) AWFD Design: In this section we assume that instances current load and capacity are available to the flow dispatcher. IV-C elaborates how Spotlight estimates these parameters. AWFD is formally defined using the following notation: m: Maximum weight assigned to network function instances. 0 k m: Weight of network function instances. 0 w j i m: Weight assigned to f j i. B j k : Priority class k for jth VIP i.e. set of all instances of jth VIP that have weight of k. B j k : Size of priority class Bj k. p[b j k ]: Probability of choosing class k for a new connection assigned to network function j. p[f B j k ]: Probablity of choosing DIP f of Bj k given that Bj k AWFD. was selected by the first stage of To form priority classes, AWFD quantizes DIPs available capacity into an integer number between between 0 and m and use this value as instances weight : Aj i j, i : w j i = m max i (A j i ) Instances with the same weight have roughly equal available capacity and form a priority class. When a new connection enters the system, the first stage of AWFD assignes it to a priority class with probability that is proportional to the aggregated weight of priority class members: j : p[b j k ] = i:f j i Bj k i wj i w j i = k Bj k i wj i Note that in the first stage the probability of choosing B 0 (1) (group of DIPs with little to no available capacity) or empty classes is zero. Therefore, overwhelmed DIPs of B 0 and empty priority classes are inactive and do not receive new connections. The second stage selects an instance from the chosen non-empty class with equal probabilities: j, k > 0, f B j k : p[f Bj k ] = 1 B j k Given that two stages of the algorithm work independently, the probability of choosing a function instance will be proportional to its weight: j, k, f B j k : p[f] = p[f Bj k ]p[bj k ] = k (2) i wj i Equation 2 ensures that AWFD that probability of choosing an instance is proportional to its weights and thus to DIP s available capacity.

11 11 As an SDN application AWFD scales in data plane as well as in control plane. In the data plane, all of the operations of the two stages of the algorithm are compatible with P4 language [18] and can be ported to any programmable switch. From the control plane point of view, AWFQ does not require per-flow rule installation. Instead, forwarding decisions are handled at the switches data plane when new connections arrive. The control plane periodically transfers updates of priority classes to switches and enables them to make in-band flow dispatching. AWFD can match the behavior of WFQ, ECMP, and LCF. If we increase the value of m and the rate of updates, the performance will be similar to that of WFQ. Choosing a small value for m makes the algorithm similar to LCF and AWFD with m = 1 is equivalent to LCF since all DIPs will be deactived apart from the one with the largest available capacity. AWFD with no updates and m = 0 is equivalent to ECMP as all DIPs are put into one priority class regardless of their available capacity and have the same chance of getting a new flow. Consider the example of Figure3, if AWFD with m = 2 is used for flow dispatching, instances will get the following weights: w 1 = 0 w 3 = w 2 = = 2 B 2 includes second and third instances and B 1 is empty, therefore AWFD will ECMP the flows between second and third instances: E[T AW F D ] = 1/4(τ 2,2 + τ 2,3 + τ 3,2 + τ 3,3 ) = 7.5 E[Ω AW F D ] = 0.94 By dispatching new flows to multiple instances in different priority classes, rather than one instance in LCF, AWFD reduces the traffic burstiness dispatched to instances. As a result, DIPs available capacity are less volatile compared to LCF, and the polling frequency can be reduced. Thus, AWFD is more scalable than LCF as the amount of traffic on the control plane is reduced. In V we show that AWFD with O(100ms) polling interval can outperform LCF with O(1ms). IV. SPOTLIGHT ARCHITECTURE Spotlight periodically polls DIPs to estimate their available capacity. During each polling interval, it uses AWFD to distribute new flows among DIPs in proportion to their available capacity. Spotlight uses distributed load balancing at connections source as illustrated in Figure 2. It is

12 12 implemented at Programmable edge of the network, i.e. the programmable networking device that is closest to connections source. Programmable Top of the Rack (ToR) switch, Network Interface Card (NIC), or the software switch at the hypervisor can act as the programmable edge. Using P4 Language [18], [19], we can program the data plane and port it to a compatible switch [20] [22], smart NIC [23] [26], or software switch at hypervisor [27] with little or no modification. Spotlight s control plane delivers AWFD weights to edge devices; it is distributed across VIPs and scales horizontally. Spotlight flow dispatcher work at flow-level and is decoupled from L3 routing. Therefore, Spotlight is orthogonal to flowlet-based L3 multi-path routing solutions and can be implemented on top of such protocols. A. Implementation of AWFD Tables AWFD is implemented using P4 language. As shown in Figure 4, we use two tables for the two stages of the algorithm. The first table selects a Priotity Class (PC) based on 5-tuple flow information. In Eq. 1 we have established probability of choosing each PC. Since there are m PCs 2 for each VIP, the PDF of this PCs discrete can be implemented in the data plane using m + 1 discrete ranges in (0, i wj i ). The first table includes these ranges and utilizes P4 range matching to map the hash of 5-tuple flow information to one of the ranges and attaches the matched range as PC metadata to the packet. The second table, coresponding to the second stage of the algorithm includes m ECMP groups corresponding to the priority classes. This tables matches on the PC metadata and chooses one of the ECMP groups accordingly. 2 B 0 has a probability of 0; therefore we exclude it from the rest of PCs. AWFDTablesforjthVIP Figure 4: AWFD Tables in data plane.

13 13 From the control plane point of view, n weight updates for the DIPs may result in m updates in the first table and 2n on the second table as the updated DIPs should be removed from on ECMP group and added to another. After each polling that results in n DIP weight updates, at most 2n + m updates should be performed on switches. B. Data Plane Figure 5 illustrates Spotlight s data plane. The data plane includes connection table that map existing connections to DIPs as well as AWFD tables that contain controller-assigned AWFD ranges and ECMP groups. AWFD tables are updated periodically and are used to assign new flows to DIPs, in-band. VIP-bound packets first pass the connection table. The connection table guarantees PCC by assuring that existing connections will be directed to their pre-assigned DIP. If the connection table misses a packet, then the packet either belongs to a new connection or it belongs to a connection for which the DIP is assigned at the data plane but the switch API is yet to enter the rule at the connection table (discussed in IV-E4. Packets belonging to new flows hit AWFD tables: the first table chooses a priority class and the second table selects a DIP member from the chosen class using ECMP. Similar to Silkroad [2], once a DIP is chosen, a packet digest is sent to the switch API which will add the new DIP assignment to the connection table. VIP Branch (Input: Flow VIP) Spotlight's Data Plane at Networks' Edge Add Flow AWFD Tables AWFD Tables AWFD Tables Packet In Miss Connection Table (Input: Flow--> VIP) Hit Forward Table (Input: Flow--> DIP) Packet Out Data Path Switch API Figure 5: Spotlight s data plane.

14 14 Spotlight's Control Plane for VIP j Probe j i f Probe ti R Controller for VIP m-alctablesfor m-alctablesfor jthvip jthvipj onawfd on all all Edge Tables Edge Switches for VIP on all Edge Switches Switches Probe Control Channel j j Edge Controller DIPs j i Figure 6: Spotlight s control plane is distributed over VIPs. C. Estimation of Available Capacity Spotlight estimates the available capacity of service instances using their average processing time and load. As shown in Figure 6, Spotlight polls DIPs to collect instances average processing time (t j i ) and their load (Rj i ). For each instance, the average processing time will be used to estimate instance s capacity: C j i = 1/tj i The available capacity of a DIPs is then approximated using their capacity and Rate: A j i = (1 U j i )Cj i = Cj i Rj i IV-E elaborates how these can be acquired at data plane if DIPs do not support reporting to the controller. D. Control Plane As shown in Figure 6, the controller is distributed across VIPs i.e., each service has its own controller. The controller multicasts an INT probe to DIPs to poll t j i and R j i, and uses these values to approximate instances available capacity. The controller regularly updates the next version of AWFD tables on edge switches.

15 15 1) Control Plane Scalability: In addition to distributing the control plane per VIP, Spotlight uses various techniques to reduce the amount of traffic on control plane, and thus improve control plane scalability. These methods include: In-band flow dispatching. Spotlight s control plane programs AWFD tables which dispatch new connections in the data plane. As a result, incoming flows will not generate control plane traffic. Sending AWFD updates. Spotlight controllers only transfer updates in AWFD tables to switches. It means that if a controller updates the priority class for x DIPs, It has to update at most m ranges in the first AWFD table, remove x DIPs from old ECMP groups, and add them to the new ECMP groups. Therefore for x DIP updates at most m + 2x updates will be sent to the switch. Given that m is a small number for AWFD, the quantity of new rules at the switch is proportional to the number weight updates in the DIP pool which amounts to a small value at steady state. Low frequency AWFD polling. AWFD assigns weights to DIPs according to their available capacity to ensure that multiple DIPs are active in each polling interval. This makes AWFD less sensitive to the polling frequency, and relaxes the polling frequency. Spotlight updates switches after the polling; hence, using long intervals helps reduce the control plane traffic. E. Discussion 1) How does Spotlight utilize legacy network devices that do not support the reporting of load and processing times to the controller: Programmable data plane can estimate and report average processing time for legacy DIPs that do not communicate with Spotlight controller. In order to measure average processing time at networks edge, we can sample packets from different connections using a Sketch algorithm [28]. The switch adds an In-band Network Telemetry (INT) [29] header to each sample packet that includes the packets arrival time and directs them to assigned DIP. After the packet is processed by the DIP, it will return to the edge switch which uses current time and the included timestamp in the INT header to estimate the processing time. Then the switch sends DIP s estimated processing time (t j i ) to the controller. The controller uses an Exponential Weighted Moving Average Generator (EWMA) to estimate the average processing time. 2) What happens if the connection table grow too large to fit in limited memory of edge switches: There are a multiple solutions for this issue:

16 16 (i) The connection table at the switch may be used as a cache, while the complete copy of this table is kept by an SDN application. If a packet misses the connection table it either belongs to a new connection or it belongs to an existing connection not present in the cache. New connections are identified by checking the TCP SYN flag and are processed by AWFD tables. For existing connections that miss the connection table cache, a request will be sent to the SDN application to update the cache. (ii) Silkroad [2] proposes to limit the size of connection table by using hash tables on the switch to accommodate a complete copy of the connection table. The probability of hash collisions can be reduced by utilizing Bloom filters. (iii) Thanks to the P4 language, Spotlight can easily be ported to NIC or the software switch at hypervisor. These devices have a lot of available memory; while a switch may have hundreds of MegaBytes of SRAM, NICs and virtual switches have GigaBytes of DRAM. Moreover, the number of connections at host or VM level is much smaller compared to ToR level. Therefore, smart NICs and virtual switches can easily host full copy of the connection table. 3) How does Spotlight ensure PCC if a load balancer fails: If a load balancer fails, its connection table will be lost; therefore, we have to make sure that other load balancers can recover the connection-to-dip mapping to ensure PCC for connections assigned to the failing device. Any Spotlight instance can recover the connection to DIP mapping within the same polling interval since both stages of AWFD use 5-tuple flow information to assign connections to instances. However, for old connections, the AWFD tables may get updated; as such the connection-to-dip mapping cannot be recovered for old connections. To solve this issue, we have to rely on an another SDN application to track connection tables at Spotlight load balancers. If a device fails, the connection table at other devices will act as a cache, and the SDN application can provide connection-to-dip mapping for the cache misses (see the first answer to IV-E2). 4) How does Spotlight guarantee PCC during updates: The switch API adds new flows to the connetion table. However, using current generation of P4-compatible devices, th updates are not done by the ASIC and only the switch CPU can update tables. Therefore, we may observe some delay from the time that new flows are processed in data plane to the time that new rules are actualy installed in connection table. As such, some packets from assigned connection miss the connection table. AWFD tables use 5-tuple flow information to assign new connections,

17 17 therefore this delay would not cause PCC violation if AWFD tables are not changed during the update delay. However, PCC may be violated if a periodic AWFD update takes place during the update delay. Therefore, to meet PCC, we have to gurantee that during the update delay AWFD tables will not be changed. Silkroad proposes a solution for this phenomena: Edge switches keep track of multiple versions of AWFD tables to ensure consistency when control plane updates AWFD tables. For new connections the data plane keeps the state of latest version of the AWFD tables at the time of arrival for the packet and stores the version to a register which can be updated in the data plane and by the ASIC. We can identify the very first packet of new flows by matching the TCP SYN flag or using a Bloom filter and use the first packet to update the value of per flow register will to the latest version of AWFD tables. For following packets of the flow during the update delay, the value of the register will be loaded, which will point the data plane to the correct version of AWFD tables to be used for the specific flow to meet PCC. V. EVALUATION A. Flow-level Simulations To evaluate the effectiveness of Spotlight to dispatch flows to different instances, flow-level simulation is conducted. As discussed in the previous sections, available capacity of each instance is used as the weight for AWFD. The rationale behind it is to have all instances reach full utilization roughly around the same time. Therefore, we use the overall service utilization (Ω j ) as the performance metric to compare AWFD with other schemes. Ω j is defined as the total carried traffic across all instances divided by the total capacity across all the instances V IP j. If a flow is assigned to an instance with an available capacity smaller than the flow rate, flows running on this instance will have reduced rates instead of their intended rates. This is because all flows assigned to this congested instance would share the capacity. As a result, overall service utilization will be degraded as part of the flow traffic demand cannot be served. We compare the performance of AWFD to ECMP and LCF schemes. ECMP dispatches new flows to all instances with equal probability regardless of their available capacity. As for the LCF, the controller collects the available capacity from all instances at each update interval and chooses the one with the largest available capacity. This instance is then used for all the new flows that arrive before the next update. To provide a performance baseline for comparison, heuristic sub-optimal approach is also used in the simulation. This approach assumes the controller knows

18 18 Overall Service U,liza,on heuris3c sub-op3mal ECMP AWFD(m=4) LCF Update Interval (s) Figure 7: Overall service utilization of different flow dispatchers with synthesized trace. the flow size and the instantaneous available capacity of each instance when a new flow arrives. The controller then assigns the new flow to the instance with the largest available capacity. This is equivalent to LCF where the updates are done instantaneously. Two different traffic traces are used in the simulation. The first one is synthesized traffic trace. We generate the flows based on Pareto distribution to simulate the heavy-tail distribution found in most data centers [14] [16], [30] with the shape parameter (α) set to 2. This heavy-tail distribution provides both mice and elephant flows with the numbers of mice flows much more than the elephants flows. The flow inter-arrival time and flow duration are generated based on Poisson distribution with the average inter-arrival time set to 1ms and the average flow duration set to 10 seconds. The flow statistics are summarized in Table I In total, 100k flows are generated for the simulation. Four network functions (i.e., VIPs) are used with each having 20 instances (i.e., DIPs). Among all the instances, there are two types of DIPs with the a capacity ratio set to 1:2. Each flow is assigned to a service chain consisting of 1 to 4 different services. The capacity of each instance is configured so that the total requested traffic is slightly more than the total capacity of the services. Under this configuration a good flow dispatching scheme can stand out. Figure 7 shows the overall service utilization of different

19 19 Table I: Flow Statistics Traffic Trace Pareto Distribution Production Datacenter Number of Flows 100, ,779 Avg. Flow Duration 10s 7.2s Avg. Flow Inter-arrival Time 1ms 18ms Avg. Flow Size 2 KBps 15.8 KBps schemes, where the x-axis is the update window interval and the y-axis is the Ω. As we can see from the figure, AWFD always outperforms ECMP and its performance is very close to the heuristic sub-optimal approach. This is because ECMP does not take into account the available capacity of the instances. When there is capacity discrepancy among the instances, ECMP could overflow those with lower capacity while leaving those with higher capacity under-utilized. The performance of AWFD is very close to LCF when the update interval is very short. However, when the update interval increases, the performance of AWFD only slightly degrades while the performance of LCF decreases significantly. This shows that AWFD is less sensitive to the update interval as a result of using weighted flow dispatching. On the other hands, LCF is very sensitive to the update interval. This is because with a longer update interval, more flows will be assigned to the least congested instance and could potentially overload it. The second traffic trace used for the simulation is captured from a production data center [16]. There are in total around 200k flows in the captured trace and the duration is 1,000 seconds. The traffic traces only provide flow size, start time and finish time information, and therefore service chain assignment and instance settings were synthesized similar to the previous experiment. Figure 8 shows the Ω for different flow dispatchers. As we can see from the figure, AWFD outperforms both ECMP and LCF. However, it is less close to the heuristic sub-optimal scheme compared to the previous experiment. This is due to the heavy fluctuations in the traffic with heavy bursts that make most instances overwhelmed followed by lack of traffic that leaves most instances under-utilized. This kind of traffic pattern hinders most flow dispatching approaches from achieving their best performance. However, AWFD is still able to improve the performance compared to other solutions. As discussed in the previous sections, the value of m in AWFD is the quantization parameter that can be configured and impacts flow dispatching performance. With larger m values, the

20 Overall Service U,liza,on heuris3c sub-op3mal AWFD(m=4) ECMP LCF Update Interval (s) Figure 8: Overall service utilization of different flow dispatchers with production data center trace. 1.2 Overall Service U,liza,on AWFD (m = ) AWFD (m = x) m-value Figure 9: AWFD: overall services utilization vs. the value of maximum weight (m) with the synthesized trace.

21 21 instances are able to obtain weights closer to their real values based on the available capacity. However, larger m values increase the number of priority classes as well as the number of required updates from the controller and it would increase the amount of traffic in the control channel and hinders scalabality. Therefore, we also evaluate what m value is enough for AWFD to achieve good performance without burdening control channel. With the synthesized trace at update interval of 500ms, we vary m value from 1 to 16 and compare the performance with m set to infinity, which means the probability of choosing an instance is exactly proportional to its available capacity. Figure 9 shows overall service utilization of different m values. From this figure, we can see that m values as small as 4 can already achieve good performance close to the ideal case. B. Testbed Experiments We have implemented Spotlight - including the AWFD flow dispatcher, connection tables, control plane polling, and periodic AWFD updates - on a small-scale testbed. In our experiments, we have used Spotlight to distribute the load among instances of a hypothetical file server. We assume that requests are originated from the Internet and that a distributed service fulfills the requests. In our experiments, any instance can serve any request; however, the application requires PCC as instances drop the connections that violate PCC. Figure 10 illustrates the testbed architecture. A traffic generator acts as the client which randomly sends requests to download files. All of the requests are sent to file service s VIP and Client Spotlight Load Balancer 40G 40G 40G f1 f 2 f 3 f 4 f 5 f 6 f 7 f 8 Host1 Host2 Figure 10: Testbed architecture

22 22 the client has no knowledge about DIPs. Spotlight receives the requests on a 40G link at data center s gateway and distributes the requests among DIPs. We have implemented a Spotlight load balancer including the data plane and the control plane using a Hybrid Modular Switch (HyMoS) [22]. The HyMoS uses dual-port 40G Netronome NFP4000 [23] NICs on a server with a 10-core Intel Core i9 7900X CPU and 64GB of RAM. Spotlight s connection table that ensures PCC is implemented on HyMoS line cards using P4 language. AWFD tables and the control plane are implemented on the CPU using an asynchronous [31] single-threaded Python application. In our implementation, line cards send new flows to the Python application that runs AWFD. Switch CPU runs the Python application that assignes new connections to DIPs and updates line cards connection tables. The application also polls DIPs average processing times in order to perform AWFD. Eight DIPs on two physical machines serve the requests. As shown in Figure 10, Host1 and Host2 run three and five DIPs, respectively. DIPs are implemented as asynchronous Python file servers wrapped in Docker containers [32]. Containers can access all of the files; therefore, any DIP can serve any request. DIPs drop connections in unknown state [33] i.e., TCP connections that are not established. During the course the experiments which lasted more than six hours, we have not observed any PCC violations. Hosts 1 and 2 use Intel s 3.5GHz 6-Core Core i7 7800X and 3.3GHz 10-core Core i9 7900X, respectively. Both machines are configured with 64GB of RAM and identical SSD storage. In our experiment DIPs performance is I/O bound, as containers need to read and transmit files from the storage. As a result, the three DIPs on Host1 have a slightly higher capacity compared to the five DIPs on Host2. DIPs capacities are not constant and fluctuate depending on the number and size of active connections. We have run multiple experiments using real traffic traces. We have used flow size distribution from a production data center [16] to create 50K files at the file server. For each request, the client randomly chooses one of the files; therefore, the size of connections in our experiment follows the same distribution as [16]. Client requests have a Poisson inter-arrival distribution. The inter-arrival interval is extracted from [16], however the event rate (λ) is multiplied by ten to allow traffic generation at a faster pace. Figure 11 presents experiments results. We have measured different metrics: aggregated service throughput and average flow completion time. Our experiments have two variable parameters: number of AWFD classes (m) and AWFD update interval. m determines the flow

23 23 Average Flow Completion Time (ms) Aggregated Throughput (Mbps) (ECMP) 1 (LCF) 2 (AWFD) 4 (AWFD) m (a) Average flow completion time with 100ms updates (ECMP) 1 (LCF) 2 (AWFD) 4 (AWFD) m (b) Aggregated service throughput with 100ms updates. Average Flow Completion Time (ms) ECMP LCF AWFD (m=2) AWFD (m=4) Update Interval (ms) (c) Average flow completion time for various update intervals. Figure 11: Testbed results. dispatching algorithm. A value of zero for m yields ECMP flow dispatching. m = 1 is LCF and larger m values unleash the real performance of AWFD. Update interval specifies how often AWFD should poll DIPs to update the weights. Larger intervals help control plane scale by minimizing the number of updates on switches. In the first experiment we observe the choice of load balancing algorithm on average flow completion time using 100ms update interval for m > 0 and no updates for ECMP (m = 0). In each experiment the client sends random requests for 60 seconds at a rate that is close to the maximum capacity of the service. Experiments are performed 5 times to gather more reliable results. The average value and standard error of flow completion times are shown in Figure 11a. AWFD with m = 4 has the best performance. Compared to ECMP (m = 0), it reduces the

24 24 average flow completion time by 9%. LCF (m = 1) has the worst performance, since we are using a large update interval. Interestingly, as we increase the value of m the measurement error decreases. We believe this is a direct result of using polling and using feedback for load distribution. Put another way, ECMP results are more random since it does not take DIPs state into account, while AWFD results are more consistent as it polls DIPs and reacts to potential load imbalance. Next, we use the same settings as the previous experiment to measure aggregated throughput of the service. As shown in Figure 11b, AWFD with m = 4 shows 7% improvement over ECMP. Similar to the previous experiment, 100ms updates appear to be too long for LCF as m = 1 shows the least throughput. Finally, Figure 11c illustrates flow dispatching performance s sensitivity to the update interval. As shown in the figure, ECMP does not involve polling and its performance is unaffected. AWFD with m = 2 and m = 4 outperform ECMP even with 1s update interval. This shows that we can realize a congestion-aware L4 load balancer with very little cost in terms of switch updates. Indeed, using 1s update interval the amount of updates in AWFD tables is negligible compared to per-flow updates on connection table which is present in ECMP as well. 3 Our results show that LCF performance degrades rapidly when update interval is prolonged. Unfortunately, the combination of our Python application and RPC server at NIC cannot update the weights faster than 100ms; therefore, we were unable to find any scenario in which LCF (m = 1) shows good performance. VI. RELATED WORKS During the recent years a number of new load balancers have been proposed for data center networks. Ananta [1] is one the earliest works to point out the significance of L4 load balancing at data centers and proposed an ECMP-based software load balancer. Duet [5] introduces hybrid load balancing by using connection tables in software to maintain consistency while using ECMP in commodity hardware for flow dispatching. Maglev [4], an ECMP-based software load balancer, utilizes consistent hashing [13] to ensure PCC in face of frequent DIP pool changes and load balancer failures. Silkroad [2] uses modern programmable data planes for 3 ECMP without connection table cannot meet PCC as updates in DIP pool would cause stateless ECMP to violate PCC. Therefore, a PCC L4 load balancer requires the connection table even if ECMP flow dispatching is used.

25 25 hardware load balancing with connection tables and ECMP flow dispatching. Rubik [3] is the only software load balancer that does not use ECMP; instead, it takes advantage of traffic locality to minimize the traffic at data center network s core by sending traffic to closest DIP. OpenFlow [34] solutions [35] [37] rely on the SDN controller to install per-flow, wildcard rules, or a combination of both for load balancing; while being flexible, per-flow rule installation does not scale out. Wildcard rules on the other hand limit the flexibility of SDN. At L3 however ECMP-based load balancing has fallen out of favor. Hedera [6] is one of the earliest works to show ECMP deficiencies and proposed rerouting of elephant flows as a solution. CONGA [7] is the first congestion-aware load balancer that distributes flowlets [17] and prioritizes least-congested links (LCF). HULA [8] and Clove [9] extend LCF-based load balancing on flowlets to heterogeneous networks and at network s edge, respectively while having smaller overhead compared to CONGA. LCF, however, requires excessive polling that range from O(Round Trip Time) in CONGA to O(1ms) in HULA. Spotlight borrows the usage of connection table from existing work in the field to meet PCC and combines it with AWFD: a new congestion-aware flow dispatcher that generalizes LCF with less frequent updates. To the best of our knowledge AWFD is the first in-band congestion-aware flow dispatcher at L4. Software Defined Networking [34], [38] [41], Network Function Virtualization [42] [44], Programmable switches [20], [45], and network programming languages such as P4 [18], [19] are the enablers of research and progress in this area. Spotlight as well as most of the mentioned works in the area utilize these technologies to perform load balancing in data centers. VII. CONCLUSION In this paper we take a fresh look at transport layer load balancing in data centers. While stateof-the-art solutions in the field define per-connection consistency (PCC) as the sole objective, we aim at distributing the traffic uniformly among instances on top of PCC. Typical L4 load balancers use ECMP to obliviously dispatch traffic among service instances. We identify ECMP flow dispatching as the problem and propose Adaptive Weighted Flow Dispatching (AWFD) to distribute incoming traffic among service instances in proportion to their available capacity. We also introduce Spotlight, an L4 load balancer that ensures PCC while using AWFD for flow dispatching. Spotlight periodically polls instances load and processing times to estimate the

CONGA: Distributed Congestion-Aware Load Balancing for Datacenters

CONGA: Distributed Congestion-Aware Load Balancing for Datacenters By Alizadeh,M et al. Motivation Distributed datacenter applications require large bisection bandwidth Spine Presented by Andrew and Jack