Programmable Host-Network Traffic Management

Size: px

Start display at page:

Download "Programmable Host-Network Traffic Management"

Homer Atkinson
5 years ago
Views:

1 Programmable Host-Network Traffic Management Paper #42 (6 pages) ABSTRACT Applications running in modern data centers interact with the underlying network in complex ways, forcing administrators to continuously monitor and tune the system. However, today s traffic-management solutions are limited by the artificial division between the hosts and the network. While switches collect only coarse-grained statistics about traffic load, the end hosts can monitor individual connections, including TCP statistics and socket logs. This paper proposes the HONE architecture for joint HOst-NEtwork traffic management. Rather than design one specific management solution, HONE is a programmable platform that lowers the barrier for deploying new techniques. The programmer specifies measurement queries (using an SQL-like syntax), analysis operations (using functional streaming operators), and control actions. The HONE run-time system automatically partitions queries and analysis across multiple host agents, with queries running against virtual tables that are materialized lazily. A controller combines the results for further analysis and reconfigures the hosts and switches accordingly. We demonstrate the efficiency and expressive power of our HONE prototype through two example management applications. 1. INTRODUCTION Modern data centers run a wide variety of applications that generate large amounts of traffic. These applications have a complex relationship with the underlying network, including varying traffic patterns, elephant flows that overload certain paths, TCP incast, and suboptimal TCP parameters. To optimize application performance, network administrators perform a variety of traffic-management tasks, such as diagnosing performance problems and adapting the configuration of routing protocols, server load balancers, and traffic shapers. These tasks change over time, based on new insights into application requirements and performance bottlenecks. However, today s traffic-management solutions are constrained by the artificial division between the Measurement Analysis Control Measure servers utilization and request incoming rates Measure sockets backlog traffic demand Server Load Balancing Compute total request rate and a target distribution of requests Elephant Flow Scheduling Detect elephant flows. Compute routes for them Reconfigure load balancing policies to enforce the target Install routing rules in network Figure 1: Three stages of traffic management hosts and the network. Solutions that run in the network have limited visibility into the application traffic. Due to limited CPU and memory resources, network devices cannot collect connection-level statistics. Without access to the application and transport layers, the network cannot easily infer the causes of performance problems or the backlog of traffic waiting to enter the network. The end host is in a much better position to collect fine-grained measurements, as well as logs of TCP statistics and socket calls. Yet, the hosts do not have visibility into the network paths (e.g., to correlate data from multiple connections and hosts) or control over the network (e.g., to change the routing configuration). Instead, we advocate joint host/network management that harnesses the power of the hosts to collect and analyze fine-grained measurements, and to perform basic control operations like rate limiting. However, administrators cannot settle on a single traffic-management solution in advance the best approach can easily change as applications mix and network design evolves. In addition, multiple approaches should coexist, such as diagnosing performance problems, tuning a load balancer to distribute application requests, and performing traffic engineering to alleviate congestion. As such, a good architecture would be programmable, offering an expressive interface for writing traffic-management tasks, while minimizing the barrier for deploying new solutions. Most traffic-management tasks share a common pattern a three-stage pipeline of measurement, analysis, and control, as shown in Figure 1. For ex-

2 Legend: Programmer s work HONE Component Hosts APP Server OS Controller HONE Agent Management Program HONE Runtime System Network Figure 2: Overview of HONE System ample, the load-balancing task distributes incoming requests across multiple server replicas. The first stage measures the request rate and resource utilization (e.g., CPU, memory, and bandwidth) at each server. Next, the second stage combines the measurement results to estimate the total request rate and compute a target distribution of requests over the servers. Then, the final stage reconfigures the load balancers (or the underlying switches) to divide future requests based on the new distribution. These kinds of tasks can run on our HONE system for host/network traffic management, as shown in Figure 2. The programmer writes a management program that specifies measurement queries, analysis operations, and control actions. At run time, the controller partitions the measurement and analysis across multiple host agents, which send aggregate results to the controller for further analysis. Based on these results, the program can change the host and switch configuration, triggering the controller to send the appropriate commands to each device. This offers a simple, centralized programming model, while capitalizing on the inherent parallelism of having multiple hosts and switches. Two key technical contributions underlie HONE: Expressive yet efficient programming framework: HONE integrates a high-level, domain-specific query language directly with a powerful data-parallel analysis framework and a reactive network control engine. Both the query language and the analysis framework are designed so that they have simple, sequential semantics, and yet they admit a correct implementation in which the work is distributed across a network of devices. In other words, HONE enables the programmer to think locally yet have the implementation act globally, distributing the bulk of the computation over a network of hosts before aggregating the final results on the controller. Lazily materialize measurement data: Monitoring systems typically require operators to a priori define the measurement data to collect, forcing them to choose between efficiency and visibility. Instead, HONE presents the programmer with the illusion of rich, relational database tables with finegrained resource statistics about applications, connections, paths, and links but only materializes table rows when and if they are needed by one or more queries. This makes HONE appropriate for (a possibly changing set of) both long-lived monitoring and short-lived diagnostic applications, while collecting the minimal set of necessary data. 2. HOST MONITORING & CONTROL End hosts can offer tremendous visibility into the application and transport layers, as well as the finegrained control over traffic before it enters the network. However, changing application software would compromise the generality of the traffic-management system. Instead, HONE operates just below the application, by monitoring socket calls and TCP connection statistics, and scheduling, shaping, and marking traffic based on packet classifiers. 2.1 Monitoring: Socket Interface and TCP HONE provides fine-grained traffic statistics by monitoring individual connections: Socket interface: The host agent intercepts the socket calls issued by the applications. We can record the 4-tuple parameters of the sockets when an application opens or closes them (via connect, accept, or close). This enables HONE to associate socketlevel activities with connection statistics in the transport layer. Furthermore, HONE intercepts read and write socket calls to record the number of bytes read or written. Monitoring at the socket interface provides fine-grained information about application behavior that the network cannot collect. TCP stack: The host agent uses Web10G [2] to collect TCP statistics in the Linux network stack. These statistics reveal the performance of each connection with much lower overhead than packet monitoring. The TCP statistics in HONE fall into two categories: (i) instantaneous snapshots (e.g., of the congestion window or the number of bytes in the send buffer) and (ii) cumulative counters (e.g., of the fraction of time the connection is limited by the receiver window). These statistics are invaluable for diagnosing network performance problems [13]. The host agent can also collect CPU and memory usages of applications. The information helps to better understand how applications behave. 2.2 Control: Classify, Shape, and Schedule As end-host information can establish an association between connections (or sockets) and a running process, HONE supports traffic-management tasks 2

3 that precisely target on specific applications. This connection management can occur either on the end host or in the network. In addition to using OS utilities to limit applications CPU or memory usage, HONE currently supports several connectionoriented control operations, including: Connection classification: The host can identify the connections by application- and socket-layer information. Notifying the network about those connections enables the switches to customize QoS and routing policies for this traffic. Traffic shaping: Shaping (groups of) connections can limit the rates of traffic entering the network, to reduce or prevent congestion. Packet scheduling: OS-level packet-scheduling mechanisms, such as fair queuing, can ensure a fair allocation of bandwidth across multiple connections in a group, or between groups of connections. Using these mechanisms, for example, the hosts can use the size of backlog data in the socket level to identify the elephant flows. Then the hosts can either shape the flows traffic in the OS, or notify the network side with the flows identities to take traffic-engineering actions. 3. PROGRAMMING FRAMEWORK The HONE programming model supports the construction of measurement, analysis, and control phases of a typical traffic-management application. Measurement consists of SQL-like queries on virtual tables of connection and server statistics. To reduce overhead, statistics are not collected unless they appear in at least one query. Analysis uses functional data-parallel operators like map, filter, and reduce. With these operators, it is possible to write simple, high-level, deterministic algorithms that the run-time system can automatically parallelize and distribute over the host agents and network. Our preliminary design of the control phase consists of simple if-thenaction rules for configuring hosts and switches. 3.1 Measurement: SQL on Virtual Tables The hosts and the switches could conceivably collect a wide range of measurement data, but the overhead may be prohibitive. As such, HONE presents the programmer with the illusion of database tables with fine-grained statistics, but only collects the data that actually appears in queries. The virtual tables follow naturally from the protocol layers, as shown in Table 1 with the first two collected from the hosts and the last two collected from the switches. Conceptually, a unique instance of each of these virtual tables exists at each moment in time, though a (part of a) table is only materialized if and when Table Name Row Columns Applications A Process Host ID, PID, App s name, and CPU/memory usage, etc Connections A Conn. PID, TCP/UDP five-tuple, End-point and Connection stats, etc Paths A Path Sequence of links, and Flows on the path Links A Unidirec- IDs of two ends, Capacity, tional Link and Utilization, etc Table 1: Representations of Virtual Tables Query := Select(Stats), From(Table), Where(Criterion),Groupby(Stat), Every(Interval) Table := Applications Connections Paths Links Stats := Columns of Table Interval := Integer in Seconds or Milliseconds Criterion := Stat OpL value OpL := > < = = Table 2: SQL-like measurement query syntax a query needs the information. For querying the data, HONE offers programmers a familiar, SQL-like syntax, as shown in Table 2; more sophisticated manipulation of the data takes place in the analysis phase. The result of any query is a set-of-streams abstraction where each element of each stream consists of a table of queried values. Each stream in the set is generated and analyzed on a separate host, but programmers operate over the set-of-streams abstraction as if it were just an ordinary controller-local data type. To illustrate the key features of the HONE programming paradigm, we build a simple application for detecting the elephant flows. Upon the detection, these flows will then be routed based on the current traffic matrix and network topology. The first query we require follows: def ElephantQuery (): return ( Select ([ SrcIp, DstIp, SrcPort, DstPort, BytesWritten, BytesSent ]), From ( Connections ), Where ([ app == A ]), Every ( Seconds 1) ) This query probes the virtual Connections table, filters based on the constraint that the application (app) is A, and selects table columns corresponding to the SrcIp, DstIp, etc. Execution of this query has the effect of triggering the periodic parallel measurement of the statistics mentioned in the Select statement on each host. This generates a set of per-host streams in which each element of the stream is a table with connections as rows and the named statistics values as columns. In the analysis phase that we will show later, the per-connection BytesWritten and BytesSent will be used to detect 3

4 elephant flows. The next query collects the traffic data. It uses the Groupby construct to convert each table in the perhost stream into an association list of tables indexed by a (srchost, dsthost) pair. Such association lists form the elements of each per-host stream generated by this query. def TrafficMatrixQuery (): return ( Select ([ srchost, dsthost, BytesSent ]), From ( Connections ), Groupby ([ srchost, dsthost ]), Every ( Seconds 1) ) The last query generates a stream of tables containing network topology information: def LinkQuery (): return ( Select ([ BeginDevice, EndDevice, Capacity ]), From ( Links ), Every ( Seconds 1) ) Together, these three queries suffice to construct our elephant flow application. They also serve to illustrate the variety of different statistics that our system can collect from both host and network all within the same uniform programming model. 3.2 Analysis: Streaming Operators The analysis phase processes the set of streams of tabular data generated by the measurement phase. Programmers need not worry about the fact that the set of streams are distributed across the hosts and network; the system implementation will automatically set up the proper distributed rendezvous between the query and analysis codes. In order to support the dual goals of having a distributed implementation with a simple, local, deterministic semantics, we have designed the data analysis phase as functional data-parallel processing language with the following operators (among others): MapSet(f): Apply function f to every element of a stream in the set of streams, producing a new set of streams FilterSet(f): Create a new set of streams that omits stream elements e for which f(e) is false. ReduceSet(f,i): Fold 1 f across each element for each stream in the set, using i as an initializer. In other words, it generates a new set of streams where f(... f(f(i, e 1 ), e 2 )..., e n ) is the n th element of each stream when e 1, e 2,..., e n were the original first n elements of the stream. 1 Refer to the concept of the higher-order function in functional programming MergeHosts(): Merge a set of streams on the hosts into one single stream. MapSet, FilterSet, and ReduceSet operate in parallel on every host; MergeHosts() merges the results of a distributed set of end-host analysis streams into a single stream on the controller, for further data analysis or reactive control. To construct a program from a collection of queries and analysis operators, the programmer simply pipes the result from one query or operation into the next using the >> operator. HONE also supplies a variety of other library functions for mapping, filtering, and reducing over individual streams and tables. As an example, consider our elephant-flow application again. Below, as suggested by Curtis et al. [5], we find the elephant flows by defining the function IsElephant to figure out the connections for which the difference between bytes written (bw) and bytes sent (bs) is greater than 100KB. DetectElephant selects those rows of the table that satisfy the condition. Finally EStream is constructed by piping the ElephantQuery defined in the last subsection into the elephant-flow filter, which is mapped in parallel over all tables generated by all hosts. The final results are aggregated on the controller by MergeHosts. The main point is that this powerful, distributed computation is achieved by composition of just a few, simple, high-level operators associated with userdefined functions. def IsElephant ( row ): [sip,dip,sp,dp,bw,bs] = return ( bw - bs > 100) row def DetectElephant ( table ): return ( FilterTable ( IsElephant, table )) EStream = ElephantQuery () >> MapSet ( DetectElephant ) >> MergeHosts () A second part of our analysis involves building a network topology from each successive link table produced by the LinkQuery, which is abstracted as a single data stream from the measurement in network (the auxiliary BuildTopo function is not shown): TopoStream = LinkQuery () >> MapStream ( BuildTopo ) The last analysis task involves computing a traffic matrix. For this purpose we use the ReduceSet operator. It takes as argument a function CalcTM, which, for each new element of a per-host stream, computes the throughput of all connections and sums them by (srchost, dsthost) pairs. A second auxiliary function AggTM joins the partial traffic matrices computed per-host into a network-wide matrix. 4

5 The per-host query and analysis occurs in parallel on each end host; global traffic matrix synthesis occurs on the centralized controller. TMStream = TrafficMatrixQuery () >> ReduceSet ( CalcTM, [[],[]]) >> MergeHosts () >> MapStream ( AggTM ) 3.3 Control Policies The final phase of a HONE program generates network control policies that may change network routing behavior, help balance load, limit the sending rate of end hosts, and balance application resource utilization. These policies are defined as a set of predicate-action pairs. The predicate determines when and where the action will occur. The actions enforce traffic rate limits, forward certain traffic flow over specific pathes, or limiting the CPU and memory usage of the applications. A policy is enable by piping it into the RegisterPolicy operator. The HONE run-time system processes such policies and implements their actions on the specified end hosts and network switches. To complete the elephant-flow example, we assume the existence of a function Schedule that, for each elephant flow, computes the path with the maximum residual capacity. It also registers the packetforwarding rules in the switches to direct the flow over this path. Combining all three pieces together, the complete program merges the streams from the measurement and analysis phases, computes over the Schedule, and registers the policy: def main (): MergeStreams ( [ TopoStream, TMStream, EStream ] ) >> MapStream ( Schedule ) >> RegisterPolicy () HONE s run-time system automatically partitions this program into distributed execution on hosts, network, and controller, as shown in Figure EVALUATION OF HONE PROTOTYPE In this section, we briefly describe our initial HONE prototype, and experiments evaluating the overhead of the host agent and controller platform. 4.1 HONE Prototype Our HONE prototype, implemented in Python, follows the architecture in Figure 2. On the controller, a programming library defines the representations for streams and operators, and uses a pushbased functional reactive approach [7] to process the streams. The run-time system uses the positions of the functional operators to drive program partitioning, and it runs a directory service that discovers the topology and the locations of applications. The host agent consists of a kernel module (implemented in C for Linux) and a manager. The eventdriven kernel module intercepts socket system calls to collect the socket-interface statistics, such as the 4-tuple of the TCP sockets opened by applications. The manager schedules connection monitoring and the execution plans of multiple management tasks, in order to minimize the overhead. When multiple queries refer to the same statistic, the agent synchronizes the monitoring to collect each statistic just once. Yet when a large number of connections are running, figuring out which ones to measure may incur high overhead. So the manager first filters the connections with the socket statistics, which can be measured by the kernel module with low overhead. Then the remaining connections are further checked against the Where clause of the query, which may incur the polling-based TCP data collection. These techniques lower the overhead of the host agent, which makes HONE efficient. The network module is implemented using Frenetic [7] running on top of the NOX [8] OpenFlow controller. This module measures the routing and link statistics and enforce the traffic control policies. The choice of OpenFlow is convenient but not essential; HONE could use other approaches for monitoring and controlling the network. 4.2 Performance Evaluation To evaluate our system, we implemented two real traffic-management tasks. We use these examples to demonstrate both HONE s programming expressiveness and its resulting run-time efficiency. Elephant Flow Scheduling: We built the elephant flow scheduling task shown in Section 3. The program requires just 89 lines of Python codes. Distributed Rate Limiting: DRL is used to control the aggregate network bandwidth used by an application running on multiple hosts in pay-byusage data centers [11]. The example task monitors the application s throughput on the hosts, and constantly updates the controller to generate new traffic-shaping policies on the hosts. The program requires just 111 lines of Python codes. We ran our experiments using the DRL task on 51 Emulab machines (fifty end-hosts and one controller) [1]. The machine has a single 3GHz processor and 2GB of RAM. Each machine has a 100Mbps network interface, and all machines are fully connected to each other. Our experiments evaluate the overhead on the host 5

Hosts Execution Plans Controller Execution Plan Estream TMStream Measure MapSet DetectElephant() ReduceSet CalcTM() ToController Merge-Hosts MapStream AggTM() MergeStreams MapStream BuildTopo()

6 Hosts Execution Plans Controller Execution Plan Estream TMStream Measure MapSet DetectElephant() ReduceSet CalcTM() ToController Merge-Hosts MapStream AggTM() MergeStreams MapStream BuildTopo() MapStream Schedule() Figure 3: Example of Program Partitioning RegisterPolicy TopoStream Network Measure agent (as the number of connections grows) and the controller (as the number of hosts grow). As the number of connections grow from 0 to 600, the average memory usage grows from less than 70 MB to around 300 MB, and the average CPU usage grows from 8% to 15%. The overhead of the host agent is acceptable, since the machines used in data centers usually have multiple cores and much larger RAM. We can further improve the performance of the host agent by optimizing the scheduling of measurement and execution of the management tasks, which is in the future works. As the number of hosts grow from 0 to 50, the memory usage stays relatively constant at just over 100MB, and CPU usage grows from 2% to around 3%. The results show that distributing monitoring and analysis to the hosts significantly reduces load on the controller. 5. RELATED WORKS Other recent projects have sought to incorporate end-hosts into network management [6, 9]. But these solutions view the hosts only as software switches or as trusted execution environments for the network; they do not provide the same level of visibility or control as offered by HONE. Prior work also adopts the stream abstraction for network traffic analysis [3, 4]. But they mainly focus on extending the SQL language, while we use functional programming to define traffic-management mechanisms expressively. Further, some of these works [3, 10] focus on a specific problem (e.g., intrusion detection) when designing their programming language, while HONE aims for a more generic programming interface for traffic management. Finally, several recent projects develop domainspecific programming languages [7, 12] that model the network in a higher-level abstraction for network management. But HONE mainly moves the management logic to the end hosts, rather than solely network devices, and our programming framework encompasses a joint host/network scenario. 6. CONCLUSION This paper describes the design of HONE, a programmable system for data-center traffic management. By breaking the artificial division between the hosts and the network, HONE utilizes fine-grained connection statistics for better traffic management. HONE is designed with an expressive yet efficient programming framework for defining management tasks, which integrates a domain-specific query language with a powerful data-parallel analysis framework and a reactive control schema. The framework enables the partitioning of the sequential program into a distributed execution, while various end-host optimizations lower the overhead of data measurement and analysis to minimum. Our prototype of HONE, and the two management tasks we implemented within its framework, help demonstrate both its efficiency and expressiveness. 7. REFERENCES [1] Emulab. [2] Web10G Project. [3] K. Borders, J. Springer, and M. Burnside. Chimera: A Declarative Language for Streaming Network Traffic Analysis. In USENIX Security, [4] C. Cranor, T. Johnson, O. Spataschek, and V. Shkapenyuk. Gigascope: A Stream Database for Network Applications. In ACM SIGMOD, [5] A. Curtis, W. Kim, and P. Yalagandula. Mahout: Low-Overhead Datacenter Traffic Management using End-Host-Based Elephant Detection. In IEEE INFOCOM, [6] C. Dixon, H. Uppal, V. Brajkovic, D. Brandon, T. Anderson, and A. Krishnamurthy. ETTM: A Scalable Fault Tolerant Network Manager. In USENIX NSDI, [7] N. Foster, R. Harrison, M. J. Freedman, C. Monsanto, J. Rexford, A. Story, and D. Walker. Frenetic: A Network Programming Language. In ACM ICFP, [8] N. Gude, T. Koponen, J. Pettit, B. Pfaff, M. Casado, N. McKeown, and S. Shenker. NOX: Towards an Operating System for Networks. ACM CCR, 38(3), [9] T. Karagiannis, R. Mortier, and A. Rowstron. Network Exception Handlers: Host-network Control in Enterprise Networks. In ACM SIGCOMM, [10] X. Ou, S. Govindavajhala, and A. W. Appel. MulVAL: A Logic-based Network Security Analyzer. In USENIX Security, [11] B. Raghavan, K. Vishwanath, S. Ramabhadran, K. Yocum, and A. C. Snoeren. Cloud Control with Distributed Rate Limiting. In ACM SIGCOMM, [12] J. Sherry, D. C. Kim, S. S. Mahalingam, A. Tang, S. Wang, and S. Ratnasamy. Netcalls: End Host Function Calls to Network Traffic Processing Services. Technical Report UCB/EECS , U.C. Berkeley, [13] M. Yu, A. Greenberg, D. Maltz, J. Rexford, L. Yuan, S. Kandula, and C. Kim. Profiling Network Performance for Multi-tier Data Center Applications. In USENIX NSDI,

Programmable Host-Network Traffic Management

Programmable Host-Network Traffic Management Peng Sun, Minlan Yu, Michael J. Freedman, Jennifer Rexford, David Walker Princeton University University of Southern California ABSTRACT Data-center administrators