An Open-Source Platform for Distributed Linux Software Routers

Size: px

Start display at page:

Download "An Open-Source Platform for Distributed Linux Software Routers"

Delphia Preston
6 years ago
Views:

1 1 An Open-Source Platform for Distributed Linux Software Routers Raffaele Bolla DITEN, University of Genoa, Italy Roberto Bruschi CNIT, Italy Abstract In this paper, our main objective is to explore how Linux Software Routers (SRs) can deploy advanced and flexible paradigms for supporting novel control-plane functionalities and applications. To this end, we investigate and study a new open-source software (SW) framework: the Distributed SW ROuter Project (DROP), which aims to develop and enable a novel cooperative middleware for distributed IP-router control and management. DROP allows logical network nodes to be built through the aggregation of multiple SRs based on the Linux operating system and commodity hardware, which can be devoted to packet forwarding or control operations. In addition to the original ForCES design, DROP aims to extend router distribution and aggregation concepts by moving them to a network-wide scale to enable and support value-added services for nextgeneration networks. Index Terms SW Router, distributed router architecture, open source software. INTRODUCTION The evolution of router and network architectures is one of the most relevant aspects of the Internet as we know it today, since it directly reflects the main open issues and the needs of current and upcoming network technologies and services. In a notable and sound proposal in this area, L. G. Roberts [1] showed that current network routers are too slow, costly, and power hungry. This effect is mainly caused by the architectural inefficiencies of the IPv4 protocol, which requires independent lookup operations to be performed for each incoming datagram. Starting from these considerations, Roberts proposed to evolve next-generation network protocols towards a more scalable forwarding paradigm that would be able to handle traffic at the flow level, rather than at the packet level. Van Jacobson et al. introduced the promising concept of content-centric networking [2]. Arguing that people value the Internet for the content it contains, the authors proposed to replace the location-based routing of IP with a new communications architecture built on named data, where a packet address specifies content, not location. This novel model was specifically designed to retain the simplicity and scalability of IP, while offering much better support for security, delivery efficiency, and disruption tolerance. Both of the above-mentioned contributions gave prominence to two critical needs for Future Internet device design and development: (i) the flexible, autonomous and network-integrated support of value-added heterogeneous services beyond the classical best-effort paradigm; (ii) the management of such heterogeneous operations, to be deployed inside network devices in a scalable way to provide high performance levels. The answer to these needs can be probably summarized in two simple keywords: advanced programmability and workload distribution. Regarding the programmability aspects, router and network device architectures based on general purpose hardware (HW) and open-source software (SW) have recently regained a remarkable consideration from the industrial and academic communities. Such renewed interest certainly stems from the intrinsic nature of newgeneration Software Router (SR) platforms, which provide complex mixtures of flexible SW-based commodities and efficient HW-based offload functionalities, with a boundary that continuously changes over time. HW capabilities can expose flexible header processing Application Program Interfaces (APIs), and software functionalities can rely on offload HW. An increasing number of network HW features are becoming programmable, and most commodity CPU designs are dedicating HW blocks to optimize special categories of instructions ranging from multimedia to encryption. At the same time, open-source operating systems, like Linux, offer evergrowing support for such HW enhancements and for complete network protocol stacks. In the common opinion, the primary concern with such architectures regards performance levels, which are thought to be much lower than those provided by specialized network platforms. However, in 2008 Intel [3] announced that general-purpose multi-core processors, when adopted in the data-path operations of networking devices, provide better performance levels than state-of-the-art network processors. Starting from these considerations, we want to take a step toward the evolution of open-source SRs, by moving the focus partially away from their data-plane performance analysis, to demonstrate the feasibility and viability of the SR approach with respect to commercial HW platforms. In this work, our main objective is to investigate how simple Linux SRs can scale beyond classical single-box architectures, providing advanced network capabilities and performance levels comparable to some medium-level commercial routers with multi-chassis architectures. To this end, we explore and study a new open-source SW framework: the Distributed SW ROuter Project (DROP) [4],

2 which aims to enable a novel distributed paradigm for control and management of flexible IP-router platforms based on commodity HW and the Linux operating system.

This multi-chassis router can be easily applied both as reference platform for advanced research experimentation [5], and as a low-cost alternative to medium-size commercial routers.

2 2 which aims to enable a novel distributed paradigm for control and management of flexible IP-router platforms based on commodity HW and the Linux operating system. In rough words, DROP allows building extensible and flexible multi-chassis routers on top of low-cost server HW platforms and the open-source Linux operating system. This multi-chassis router can be easily applied both as reference platform for advanced research experimentation [5], and as a low-cost alternative to medium-size commercial routers. In more detail, the DROP open-source nature allows researchers to modify any part of device functionalities in a quite easy and quick way. The adoption of commodity HW, which is in continuous and rapid evolution, assures acceptable levels of performance at very low costs. This paper is an extended version of previous contributions in conference proceedings [6] [7] [8], and it tries to give a more organic and complete overview of the DROP framework. Relevant state-of-the-art approaches are discussed and compared with DROP; the internal architecture and the embedded mechanisms are introduced in a deeper detail; working dynamics and performance evaluation results are described more thoroughly.. DROP is partially based on the IETF ForCES architecture [9] [10] [11], and it allows building logical network nodes through the aggregation of multiple devices (i.e., SRs) that host standard Linux SW objects devoted to packet forwarding (e.g., the Linux kernel, or the Click modular router, etc.) or to control operations (e.g., Quagga, Xorp, etc.). As suggested by the IETF ForCES directives, DROP provides an architectural solution that is able to orchestrate such objects. In more detail, it was specifically designed in order to (i) meet the features of Linux network system interfaces and applications, (ii) hide the complexity of distributed network nodes in an autonomic way, and (iii) offer a single control and management interface to system administrators. From a physical point of view, and as shown in Fig. 1, the DROP architecture is composed of three main building blocks: a number of devices (or elements ) running the SW for performing operations at the data- and/or controlplane; a number of interfaces toward external public networks, connecting DROP to other network devices; an internal private network that is used to exchange data-path traffic and inter-element signaling messages. DROP is specifically aimed at aggregating and coordinating all these building blocks in a single logical IP node that is able to directly manage all the public interfaces towards other external devices. The internal private network could be realized by means of different layer 2 and 2.5 technologies, such as Ethernet (currently supported), Infiniband, MPLS and OpenFlow (whose support is currently under progress). The basic idea consists of enabling a selected set of devices at the edge of this internal network cloud to coordinately work like a single logical node with a unique configuration and management interface, without exposing the presence of the internal network and the complexity of the distributed architecture. Figure 1. Overview of distributed router architecture. The current paper focuses on introducing the DROP architecture and working dynamics, validating its mechanisms through standard testing methodologies and demonstrating that it allows increasing SR performance in a scalable way. The paper is organized as follows. The next section describes some related work. The basic concepts concerning Linux SR architectures are summarized in Section 3, by focusing on the internal APIs used between control- and data-planes. Section 0 introduces the main design and functional issues in distributing IP router functionalities, and Section 0 provides an overview of DROP and its architecture. Section 0 shows some working dynamics of the proposed architecture and sketches how DROP s building blocks interact among themselves. Section 0 reports the results of tests that were conducted to analyze the performance and the architectural bottlenecks of the prototype. Conclusions are finally drawn in Section 0. RELATED WORK The idea of a distributed SR has already been investigated in recent works, such as [12], [13] and [14]. However, these papers tackled the router distribution issue in a different and somehow complementary way with respect to the DROP objectives. A software router architecture, called Routebricks, was proposed in [12], [13]. Routebricks was specifically designed to scale SR data-plane performance through the parallelization of router functionality, both across multiple servers and across multiple cores within a single server. By using four state-of-the-art general

3 3 purpose servers and the Click software [15], the authors demonstrated a 35 Gbps parallel router prototype; this router capacity can then be linearly scaled through the use of additional servers. With a similar purpose, Bianco et al. [14] [16] proposed a multistage architecture, exploiting PC-based routers as switching elements. The multistage architecture was designed to overcome the intrinsic performance limitations of a single-box SR. By combining simple layer-2 load-balancing capabilities at the front stage with an array of layer-3 routers at the back stage, the authors demonstrated that the proposed architecture has excellent scalability characteristics, which could, for example, enable the routing of minimum size packets at line rates in the order of the Gbit per second. In this respect, Sarrar et al. [17] [18] proposed a scheme to increase lookup scalability in data-plane elements. In more detail, they propose a two-stage lookup mechanism: the data-plane elements include a small and fast lookup table cache, filled with the most used and recent routing entries. If the incoming packets hit one of the cached entries, their forwarding happens entirely on the first-stage data-plane element. In case of cache miss, the packets are sent to a second-stage element, named controller, which contains the entire routing table of the device. With respect to the previous contributions, the current work does not directly focus on how to scale the dataplane performance of a multi-box SR, but rather on how to realize its dynamic and autonomic management. In fact, the main objectives of the DROP platform consist of offering support to the following tasks: the run-time composition of the aggregated router, by dynamically managing subscription and disassociation of control and forwarding elements; the flexible adoption of different internal network topologies and organizations: unlike the Routebricks and Bianco et al. proposal, DROP does not fix specific requirements for the distributed data-plane architecture and organization; the dynamic update of routing information and parameters (i.e., routing tables, network interface status, etc.); the management of multiple control elements for resilience purposes; the execution of common signaling and control functionalities (e.g., OSPF and BGP) and the management of the router slow-path. The DROP prototype also includes native support for Netlink communications with the Linux kernel and opensource routing suites, such as Quagga [19] and Xorp [20]. Some developments based on the IETF ForCES architecture have been proposed in the last years (see for example [21] [22] [23], among others). However, to the best of the authors knowledge, such developments do not fit natively on the Linux operating system and commodity HW. The OpenFlow project [24] [25] is a further activity relevant for the current work. Briefly, OpenFlow is a protocol for interfacing the data- and control-plane in a flexible and easily extensible way. Openflow is already supported in a large set of commercial network devices, and it allows the external management of their data-plane capabilities through external control-plane applications. In this respect, OpenFlow can be seen as an interesting alternative to the ForCES protocol. Future versions of the DROP software will include support of OpenFlow. In this respect, the RouteFlow project [26] [27] is relevant to DROP. Like our proposed framework RouteFlow aims at providing an abstraction interface for interconnecting the Quagga routing suite with elements specialized in data-plane operations. The main difference between the two frameworks certainly consists of both the typology of forwarding elements, and the protocol used to remotely manage them: RouteFlow is based on OpenFlowenabled switches, DROP on Linux software routers. This base difference certainly affects the internal organization of data and mechanisms, but above all it directly leads to the underlying and thin gap between open-source software-based network devices and software-defined networks. The latter allow using a pre-determined set of base functionality methods exposed by data-plane hardware in a flexible way. In the OpenFlow case these methods follow a flow-driven model. The extension of the set of methods may result complex, but the data-plane elements generally provide high performance, since interface methods are often directly translated into configurations of specialized hardware. On the contrary, software-based devices intrinsically offer the possibility of extending their functionality, generally at the cost of a lower performance level. Moreover, in open-source software-based devices like DROP, the networking functionalities can be easily extended, upgraded or modified by everybody without any additional costs. In addition, software-defined approaches may not be sufficient to effectively cover every advanced network functionality, like, e.g., traffic monitoring and deep packet inspection [28] [29]. Nevertheless, the same functionalities may be natively handled by software-based devices. The resulting situation suggests that the two approaches are complementary and that may be jointly applied to realize highly programmable future networks. Finally, other recent works, such as [30], [31], and [32], deal with the performance optimization of single-box SRs. These contributions showed that new-generation single-box SR platforms can exploit a Linux-based networking SW system and can correctly deploy a multi-cpu/core PC architecture. The achieved results demonstrate that Linux SW routers can attain remarkable levels of data-plane performance, while at the same time preserving portions of the PC s capacity for the application layer. MONOLITHIC LINUX SW ROUTERS Standard architectures for monolithic SRs have to provide a wide and heterogeneous set of functionalities and

4 4 capabilities. Thus, they can move these functionalities and capabilities from those that are directly involved in the packet forwarding and switching process to those that are needed for control (e.g., OSPF and BGP), dynamic configuration and monitoring. With focus on Linux-based architectures, as outlined in [33], [34] and [35], all of the forwarding functions are realized inside the Linux kernel, while the large part of the control and monitoring operations (e.g., routing and control protocols) are daemons/applications running in user mode. The most well-known examples of network applications/daemons are the Quagga [19] and the XORP [36] routing suites. Similar to their commercial relatives, SRs obviously provide two main kinds of internal traffic paths: the fast and the slow path. The fast path substantially consists of the L2 and L3 forwarding chains, and it is selected for all data packets that need only to be routed or switched and do not require processing at the service/control layer of the local router. In contrast, the slow path is used by all packets that are directed towards local service and control applications (e.g., OSPF Hellos and Link State Updates (LSUs), as well as BGP keep-alive messages have to be delivered to local IP control applications). Packets following the slow path are generally referred as exception packets. As pointed out in Fig. 2, the delivery of exception packets to control and service applications is performed through well-known standard interfaces between kernel and user spaces, namely, network sockets. In addition to packet-related APIs, a router architecture needs a further set of interfaces between data- and control-planes to exchange control data (e.g., for updating the Forwarding Information data Base (FIB) or for exchanging information about the status of a network interface). In this regard, Linux includes a highly advanced and complete API, called Netlink [37], which is used as an intra-kernel messaging system, as well as between kernel and user-space. Netlink includes all the L2 and L3 interfacing capabilities needed to synchronize the control and the data engines of an entire router. It also allows unicast, multicast and broadcast delivery of control data between data-plane components and applications. L2 Control Plane Control and Service Layer Router L3 quagga Control Plane(Quagga) Plane & Services Services & Applications User space Kernel space Packet Sockets Control Status API PHY L3 FIB L3 forwarding Chain L2 FIB L2 forwarding Chain Switching Matrix PHY Figure 2. Reference architecture of a Linux based monolithic SR. DESIGN CONCEPTS, CONSTRAINTS AND GUIDELINES FOR DISTRIBUTING ROUTER FUNCTIONALITIES The realization of a distributed router deals with the separation of data- and control-plane functionalities in multiple logical elements that can mutually cooperate to behave like common single-box devices. The control-plane processes and tasks of an IP router cannot be easily distributed among different HW platforms, because today s routing protocols are designed to work with a single aggregation point. For example, the software for realizing routing operations (e.g., OSPF and BGP) usually runs as applications/daemons on a single device. Starting from these considerations, the aggregation point has to run such control applications and maintain the entire Forwarding Information Database (FIB) of the distributed device, which includes all of the routing and policy tables and the list of network interfaces. As outlined in [9], the availability of multiple elements that are capable of performing control-plane functionalities can be easily exploited only for resilience purposes. For example, if multiple elements acting at the control- plane are present, only one can work actively, while the others can only be used as backup copies for fault recovery. The active control element has to manage multiple data-planes in a dynamic and coordinated way. For example, it has to maintain and update the aggregated FIB (i.e., the FIB of the logical node resulting from the aggregation of all the elements). This aggregated FIB corresponds to the one that an equivalent single-box router would have. Its filling obviously depends on the information and data coming from both the forwarding elements and the network control/signaling processes (e.g., Quagga). Thus, the information in the aggregated FIB has to be suitably disaggregated in a number of local routing databases, one for each element performing forwarding operations. In addition, a copy of the disaggregated FIB has to be installed at each forwarding element and synchronized with the central database. All of these design requirements lead to the

5 5 need for a distributed software that is able to bi-directionally transfer a router s control data 1 and to aggregate/disaggregate the data among multiple data-plane elements, including the active control-plane. In addition to FIB management, network control and signaling applications (e.g., routing daemons like Quagga), which run on control elements, need to receive and transmit protocol signaling packets from/to the network interfaces of data-plane elements. Common IP routing protocols require different typologies of signaling packets, which range from IP datagrams with unicast and/or multicast addresses (e.g., OSPF) to TCP connections toward the IP addresses of external network interfaces (e.g., BGP). Therefore, a distributed router architecture must also provide specific solutions for enabling data-plane elements to forward such signaling packets to the control element. In other words, a specific solution is needed for the management of the router s slow path. THE DROP ARCHITECTURE The proposed architecture consists of a cooperative and distributed software framework that controls and coordinates the data-plane and the control-plane functionalities among different SRs in a dynamic and autonomous way. SRs that specialize in data-plane functionalities are referred to as Forwarding Elements (FEs), and those acting at the control-plane are known as Control Element (CEs). In more detail, DROP aims to: 1) aggregate and coordinate the fast path of a set of Linux boxes; 2) use a single Linux box to act as a reference control element, called master CE (mce), and run control applications (e.g., Quagga and Xorp) and user-driven configuration and monitoring tools (e.g., tc and ip) 2 ; 3) realize a flexible paradigm for supporting the router s internal slow path; 4) provide multiple backup elements for the control functionalities, called backup CEs (bces), which may quickly become active upon the failure of the master control element. It is worth noting that the large part of base IP router functionalities provided by DROP can be realized also using Openflow-based data-plane hardware in place of Linux SRs (future DROP versions will include such capabilities). As discussed in section 0, the value-added contribution given by a SR solution consists in extending and supporting advanced functionalities (e.g., deep packet inspection) at the data plane in a more flexible and effective way. The DROP framework consists of two main applications: the Control Element Controllers and Forwarding Element Controllers (CEC and FEC, respectively), which run on elements that perform the control-plane and dataplane functionalities 3 and jointly cooperate to make these multiple entities behave as a single network element. CEC and FEC applications interact among themselves to realize the distributed management of the slow and fast paths in a transparent and autonomic way with respect to other network processes and applications. CECs are devoted to managing and exchanging information regarding the whole aggregated router with the control applications. These applications include Quagga and/or other Operation, Administration and Management (OAM) tools like command line interfaces. FECs are responsible for managing the forwarding configuration of each FE. Fig. 3 shows an overall overview of the proposed framework and outlines the different SW modules included in CEC and FEC applications and the interfaces between them, including those to control applications and to the Linux kernels of the FEs. The aggregation/disaggregation of such data is centrally guided by the active CEC, which includes most of the DROP logic mechanisms and algorithms. FECs are devoted to the application of commands and configurations from the CECs and to forwarding all routing exception data to the CECs. The rest of this section will describe in detail the main architecture and functionalities of the DROP framework. Sub-section 0 introduces the internal interfaces needed to exchange control data. The CEC and FEC applications are described in sub-sections 0 and 0, respectively. Sub-section 0 discusses how multiple backup CECs can be maintained for fast recovery purposes. Finally, sub-section 0 focuses on the Tx/Rx Slow Path Exception Packet Manager (xppm) module, which is part of the FEC element and is the key element for the management of slow path processes in the DROP framework. The Distributed Router Internal Interfaces As shown in Fig. 3, the DROP framework uses different interfaces to bi-directionally exchange control data and exception packets among CEC and FEC applications, the data-planes of FEs, and control-plane applications. In detail, we have three communication stages: (i) the communication between the FE kernel and the CEC application, (ii) the communication between the FEC and the CEC, and finally, (iii) the communication between the control processes and the CEC. As far as the communications towards the control-plane and the kernels are concerned, we decided to use standard Linux intercommunication interfaces to guarantee the maximum compatibility with the Linux system and its applications. Exception traffic is exchanged through standard network sockets. Con- 1 The same kind of information is carried by Netlink in a standard Linux-based architecture. 2 Control applications may indeed run on third-party elements, but a master control element is required to maintain the database of the distributed router elements and to synthesize the Forwarding Information Database (FIB). 3 If an SR performs both functionalities, separated instances of a FEC and a CEC are required for such element.

6 6 trol data are exchanged through the Linux native Netlink protocol 4 [37]. These control data include data regarding the configuration of the router such as, for example, a routing table or the parameters of a network interface. The control data communication between the CEC and the FEC is realized through a very simple protocol inspired to the ForCES one [38]. The forwarding of exception packets is performed with the packet encapsulation mechanism introduced in section 0, and the control data is carried by using the same templates and contents of Netlink messages. Figure 3. Software architecture of the DROP framework: CEC and FEC applications and their intercommunication interfaces. Forwarding Element Controller As previously mentioned, the FEC is a control application that acts mainly as a bi-directional SW bridge between the CEC and the forwarding element s data-plane. As shown in Fig. 3, the xppm module is part of the FEC applications, but because it performs a different kind of application with respect to the other FEC threads, we decided to introduce it in a separate sub-section. In detail, the main objectives of the FEC consist of the direct management of the SR functionalities and control data of the FE and of the synchronization of the local network information with the CEC. For this purpose, each FEC uses two communication interfaces: the FEC-to-kernel and FEC-to-CEC interfaces. The former allows the FEC to write/modify network parameters in the SR kernel. Such parameters include configurations of network interfaces and routes. The FEC-to-kernel interface also reads notification events from kernel (e.g., link and failures). The latter is used for two-way communications between FEC and CEC. The FEC is composed of multiple SW threads, each one with a specific role. The core thread, called the FE Manager, maintains a copy of the FE disaggregated FIB and provides the mechanisms for performing three important functions. These functions are: (i) communication management with the master CEC, (ii) control data and event retrieving from the kernel, and (iii) processing of commands/notifications from the CEC or from the kernel. As far as the last functionality is concerned, the FE manager can query or write configuration data through the Netlink client thread, which simply translates the operation syntax from the CE manager to the Netlink manager and sends the message to the SR kernel. Kernel answers and notifications are received by the Netlink server thread. Upon receiving the message, the Netlink server parses the kernel message and forwards it to the FE manager, and if needed, it notifies the CEC. The FEC-to-CEC connectivity is locally provided by four threads that realize two separate layers of communication. The former provides an authentication mechanism for the message exchange (Transmission and Reception Authentication Tx and Rx Aut) to create a secure channel between the CEC and the FEC. The latter, TCP Tx and Rx, implements basic operations for TCP connection management. Another FE thread is the Bootstrap Client, which implements the initial CEC discovery mechanism and all the needed routines for connecting with the CEC. Control Element Controller The principal aim of the CEC is threefold: (i) maintain the connectivity to FEs and backup CEs; (ii) provide an 4 The communication between FEC and kernel is realized over a standard Netlink socket, and the one between the CEC and control application by using Netlink packets over a loopback UDP socket.

7 7 interface layer to control-plane applications and expose the aggregated FIB to them; (iii) elaborate all of the data coming from the control-plane applications and the FEs to manage the aggregated FIB and the disaggregated copies for FEs. Similarly to the FEC, the CEC application consists of a set of SW threads, as shown in Fig. 3. Here, the core thread is the CE Manager, which includes all of the mechanisms and algorithms needed to dynamically coordinate the FEs and control-plane applications. In fact, all the other CEC threads are solely devoted to maintaining or initializing communications to other FECs, CECs or local control applications/services. Specifically, the bidirectional communication with control and service applications (e.g., Quagga) is provided by two threads, namely, the Netlink client and server, which are devoted to sending and receiving control data, respectively. The connectivity toward other FECs and CECs is managed through a stack of threads: Tx and Rx Aut, which are devoted to authentication operation of outgoing and incoming messages; Two Tx and Rx TCPs for each connected FE and CE, which are then used to the manage transmission and reception operations of TCP sockets. In addition to the previously cited threads, a Bootstrap Server is used to listen to connection requests originated by FECs in the initialization phase. When it receives a valid join request, the CEC sends the FEC the proper configuration parameters such as the IP address and TCP connection port. Coming back to the CE Manager, it realizes a complex state-machine that manages any FIB and configuration modifications. For example, it updates the routing table and adds a network interface or a new FE in a secure and reliable way. This state machine can be triggered by messages and notifications coming from the local controlplane applications by means of the Netlink Server thread, or from other elements by means of the TCP Rx and Rx Aut threads. Messages from other elements or Netlink messages from control-plane applications contain one or more requested operations. Operations can be classified into two main typologies: set and get. A get operation is a simple request for reading one or more parameters of the FIB; consequently, the CE manager replies to the sender with a message containing the required information. A set operation is much more critical, since it requires modifications to the FIB and, potentially, to the configurations of all FEs, so that a reliable mechanism for handling unexpected states is clearly needed. Therefore, we started our design by considering the Netlink features. In fact, this protocol already provides a robust set of atomic commands that are used by applications to request changes to the local kernel, and notify the results back. Linux applications performing complex operations usually include the necessary logic for translating them into atomic Netlink commands, and for managing their (intermediate) results. However, in the DROP context, each atomic Netlink command or event (e.g., disconnection of a link) ends up in a number of updates for synchronizing remote FE FIBs. The return status of a command can be positive only upon the acknowledgment of all the involved FEs. Otherwise, the command needs to be aborted and the pending changes removed by FIB copies on CEs and FEs. Specifically, as shown in Fig. 4, when the manager receives a set operation, it usually has to: process and update its internal databases according to the received message, and mark the modification as pending ; forward the requested modification to the local control-plane and/or to the elements (as needed); await for an acknowledgement that the request modification has been applied/received by control applications or elements; when all the expected acknowledges arrive, a confirmation to the FEs is sent, and the status of modifications in the CE database is passed from pending to confirmed ; a notification of the FIB change is sent to control-plane applications. It is worth noting that the CE manager has to communicate with elements and with local control applications not only by means of two different protocols (Netlink and ForCES) but also with different data structures. In fact, communications towards the control-plane are performed with the aggregated FIB and those towards other elements with the disaggregated version. The aggregated FIB obviously contains a part of the information that is maintained by the disaggregated FIBs, because the latter includes additional parameters of the internal configuration of the distributed router. Tables I, II and III are an example of the three tables that compose the entire aggregated FIB, namely, the list of connected elements, the table of network interfaces, and the routing table.

8 8 New configuration change request (e.g., triggered by control plane applications, link fault events on a FE, fault of a FE) Stop Recalculation of the new «aggregated» and «disaggrated» FIBs Generate a Nack for Control Plane Application Generate a notification to all the Control Plane Applications Send FIB change requests to involved FEs Pending State Remove pending modifications from local new «aggregated» and «disaggrated» FIBs Confirm the changes to the local «aggregated» and «disaggrated» FIBs No Received the notifications from all the FEs? Timer has expired Remove pending modifications from FEs Confirm modifications to all FEs Yes The notifications are all successful? No Yes Figure 4. Base state machine implemented by the CE Manager for set operations. In detail, Table I shows that the CE Manager maintains a specific entry for each connected element, which includes a univocal identifier, the type of the element (CE or FE), its list of private IP address (i.e., the addresses of interfaces that compose the private network), and a list of available routing capabilities. As shown in the example data in Table I, DROP allows elements to have multiple internal interfaces and to build any internal topology to interconnect elements. The data in Table I describe a star topology (the network /24 where all the elements are connected) with the addition of two full-duplex links between FE 0 and FE 1 and FE 1 and FE 3. The traffic routing inside such a topology is managed by the CE manager during the FIB aggregation and disaggregation operations. Table II maintains all of the data related to the public interfaces of the distributed router. For each network interface, this table includes all the parameters that are usually accessible through the ifconfig command in a standard Linux box (e.g., L2 protocol, link speed, MAC address, IP address, IP netmask, status of the link, and Tx and Rx traffic counters). Moreover, because the interfaces of different FEs may have overlapping names and indexes (as shown in Table II), this table has also to provide a suitable and univocal remapping of such parameters. Table III reports the disaggregated routing table, which corresponds to the sum of the local routing tables of all the FEs. The corresponding aggregated version is shown in Table IV, and it is substantially the classical routing table of an equivalent single box router. The content of this last table is computed by routing daemons (i.e., Quagga) and sent to the CEC through the Netlink interface. TABLE I: LIST OF CONNECTED ELEMENTS MAINTAINED BY THE CE MANAGER. Element ID Type Private addresses Capabilities 0 FE /24 IP forwarding, IP QoS, Ethernet /24 forwarding, etc. 1 FE /24 IP forwarding, IP QoS, Ethernet /24 forwarding, etc /24 2 CE /24 IP control-plane, backup 3 FE /24 IP forwarding, IP QoS, Ethernet /24 forwarding, etc. TABLE II: TABLE OF THE PUBLIC NETWORK INTERFACES MAINTAINED BY THE DISAGGREGATED FIB OF THE CE MANAGER. Aggregated Interface name Local Interface name FE Id Aggregated Interface index Local Interface index eth0:0 eth /24 Eth 100 eth0:1 eth /24 Eth 1000 eth1:1 eth /24 Eth 1000 IP address Type Speed

9 9 TABLE III: DISAGGREGATED FIB MAINTAINED BY THE CE MANAGER. # FE# Network Next hop Interface Origin / eth0:0 Static / Static / Static / Quagga / eth0:1 Quagga / Quagga / Static / eth1:1 Static / Static As the CE manager receives an aggregated routing entry, it derives an entry for each FE. As shown in Table III, the FE that owns the egress interfaces has the same entry as the aggregated routing table, and the routing lines of the other FEs have an internal delivery. The delivery is calculated by the CE manager based on the internal topology and to guarantee shortest-paths and load balancing. Once derived, these entries are sent to FEs through the mechanism introduced in section 0. The aggregated FIB (i.e., the control data that is exchanged with the applications working at the control plane) is composed only by the routing table in Table IV and the network interface list in Table V, which includes only a sub-part of the parameters in Table II. In fact, in the aggregated version of the network interface list, all parameters that are referred to the internal organization of the distributed router (e.g., the ID of the FE owning the interface) are omitted. TABLE IV: AGGREGATED FIB MAINTAINED BY THE CE MANAGER. # Network Next hop Interface Origin / eth0:0 Static / eth0:1 Quagga / eth1:1 Static TABLE V: TABLE OF THE PUBLIC NETWORK INTERFACES MAINTAINED BY THE AGGREGATED FIB OF THE CE MANAGER. Aggregated Interface name Aggregated Interface index IP address Type Speed eth0: /24 Eth 100 eth0: /24 Eth 1000 eth1: /24 Eth 1000 Backup CE The distributed router has a centralized architecture because it supports only one active CE at time. The presence of a single CE is obviously critical because it is a single point of failure. To avoid this drawback, DROP supports the presence of multiple backup CEs, which, in case of failure of the master, can replace it in an automatic and transparent manner without modifying any internal configurations. In usual operating conditions, all of the backup CEs (bces) maintain an active TCP connection to the mce, which is used for synchronizing FIB data and for exchanging heart-beating messages. In detail, every time the master CE updates its databases, it forwards the updated data to the bces with ForCES messages. In this way, every bce has a full and updated copy of all the master CE s databases. In contrast, heartbeat messages are used to monitor potential failures in the mce. Every CE has a univocal identifier number that is set during initial bootstrap operations. The identifier is used to manage CE priorities in case of failure. The mce takes on the maximum value for this identifier, while the best candidate bce to replace the mce takes on the second maximum value. However, upon mce failure, a re-election phase among CEs is performed to assure the presence of other bces as well. It is worth noting that, upon mce failure, and meanwhile a new master CE is elected, all the FEs maintain their configuration for a certain time (as specified in the RFC 3623 non-stop forwarding mechanism [39]). If the new mce is elected before this time (the usual case), the forwarding process has no interruption. If no new mce is elected before such time period expires, the FEs reset their configurations. Tx/Rx Slow Path Exception Packet Manager (xppm) As sketched in section 0, one of the major issues in distributing routing functionalities is related to the management of the slow path. Most of the protocols signaling is carried by heterogeneous packets, which are often destined to IP addresses of router interfaces or to IP multicast addresses (e.g., OSPF). Moreover, the signaling data are encapsulated with highly heterogeneous protocol stacks. For instance, OSPF uses signaling packets directly encapsulated in IP datagrams. Specifically, on broadcast networks, the OSPF protocol uses two multicast addresses

10 10 ( to send a Hello Packet and to send information data). BGP transfers signaling packets through TCP connections on port 179. When Internal-BGP is adopted, the TCP connection usually ends the IP loopback address; however, when external-bgp is adopted, TCP connections end at the IP addresses of external interfaces. In a monolithic router, both external-bgp and internal-bgp connections will be directly tied to the BGP routing software. In our case, if the loopback address is bound to the CE, only internal BGP connections may be directly received. In contrast, e-bgp connections have to be proxied by the FE towards routing daemons at the CE. Starting from all these considerations, we decided to introduce a specific process in the FEC architecture: xppm, which is devoted to the following functions: (i) intercepting signaling packets to be sent to the CE, (ii) forwarding such packets to control-plane applications that run on the CE by also including some useful local data (e.g., the identifier of the network interface where the packet was received), and (iii) intercepting possible replies from control-plane applications and forwarding them towards the correct external interface. As shown in Fig. 5, the xppm includes two main building blocks: the Connection-Oriented xppm (CO-xPPM) and Connection-Less xppm (CL-xPPM), which are oriented to manage exception traffic carried on TCP connections and connection-less traffic, respectively. In detail, the CL-xPPM is specifically designed for managing every exception packet that is not carried through TCP connections but with other heterogeneous encapsulations, such as UDP, IP, and Ethernet. Signaling traffic carried by TCP is managed in a separate way only because standard TCP sockets need quite different software dynamics with respect to the connection-less sockets (e.g., raw sockets ). CL-xPPM The CL-xPPM is composed of a set of thread pairs. Each pair is specifically devoted to managing the Tx and Rx operations of a certain typology of signaling packets (e.g., UDP packets on a certain Rx port and OSPF packets). The thread pairs are allocated by the FEC manager upon an explicit request from the CEC, which must provide all of the required information of the packet template to manage. Depending on the specific packet template, a different typology of socket (e.g., L2 and L3 raw sockets and UDP sockets) is used to receive and transmit packets to/from the external and the internal networks. The pair is composed of a first thread that receives packets from external networks and retransmits them to the CE and a second thread that manages the signaling traffic in the opposite direction. The packet delivery from the CL-xPPM to the CE is performed by encapsulating the data and all of the needed headers over new IP headers. The new IP headers have the CE address as the destination IP, and add further information related to the Rx network interface. In a similar way, signaling packets from the CE are encapsulated over IP headers, and directed to the FE. Signaling packets also include information on the Tx link interface to be used. It is worth noting that the simple signaling packet delivery between the xppm and the CE is not sufficient, since some control and service applications may need to exchange additional information/commands from the external interface. For example, in applications using raw sockets (e.g., OSPFd in Quagga), it is often necessary to preserve the identifier of the external interface that originally received the packet. Thus, the xppm was designed to enclose all such parameters in the traffic forwarded to the CE. Moreover, because well-known signaling protocols such as OSPF dynamically use different L2/L3 multicast addresses, the CE applications can directly request through a simple IP packet to the CL-xPPM pair to enable the reception of a certain multicast address. CO-xPPM The CO-xPPM is devoted to managing signaling traffic carried on TCP connections. It is composed of a master thread that acts as a controller and a variable number of thread pairs, one for each managed TCP connection. The aim of each thread pair is to proxy a TCP connection, coming from an external device and ending at the FE towards a new connection from the FE to the CE. The first thread on each pair manages the I/O operations of the TCP sockets towards the external network, and the second thread manages the same operations for the internal TCP socket. The two threads have two shared buffers that are used to exchange data from the external to the internal connection and vice versa. When a TCP connection is closed by the CE or by an external device, the thread receiving the connection closure signals it to its twin, and both connections will be closed. The controller thread has two main objectives. 1) to periodically check the status and the correct work of each thread pair; 2) to automatically allocate a new thread pair if a new TCP connection is initialized from the external or from the internal network. Obviously, the FEC manager has to notify a connection template (i.e., the values of the Tx port of the TCP) that the controller thread must manage.

11 Figure 5. xppm architecture and main building blocks. DISTRIBUTED ROUTER DYNAMICS This section introduces some examples of DROP working dynamics.

Sub-section 0 introduces how a link fault on a FE is managed by the distributed router.

Updating the routing tables When a control application updates one or more entries of the routing table, it interacts with the CEC through the Netlink interface, as it does in standard Linux boxes.

The information is forwarded to the CE Manager, which updates the aggregated FIB with the new data, marking it as pending.

11 11 Figure 5. xppm architecture and main building blocks. DISTRIBUTED ROUTER DYNAMICS This section introduces some examples of DROP working dynamics. In detail, sub-section 0 shows the main operations and SW dynamics when a routing application (i.e., Quagga) updates the routing table. Sub-section 0 introduces how a link fault on a FE is managed by the distributed router. Finally, sub-section 0 shows the main operations performed by the master CE and FEs for the Tx and Rx of exception packets. Updating the routing tables When a control application updates one or more entries of the routing table, it interacts with the CEC through the Netlink interface, as it does in standard Linux boxes. Routing table updates are intercepted by the Netlink Server, which subsequently reads the message payload and decodes it. The information is forwarded to the CE Manager, which updates the aggregated FIB with the new data, marking it as pending. The CE manager produces the new disaggregated FIB and sends the new data to the FEs by following a twostep mechanism. First, it communicates the new route entry through the ForCES interface to the FE that owns the route egress interface. For example, with reference to entry #1 of Table IV, the CE Manager sends the update message to FE #0 because it owns the network interface eth0:0. The disaggregated route sent to FE #0 is represented in the first line of Table III. On the other hand, the FEC receives the update message through Authentication Rx, which checks the network with TCP Rx. The FE Manager then processes the request and adds a routing path. This action is realized by writing a Netlink message, which is dispatched towards a Linux kernel across the Netlink Client thread. When the FE that owns the egress interface acknowledges the successful update of its local FIB, the CE manager signals to all the other FEs that the new route is available through distributed router internal delivery. To this end, the CE manager sends a different local route update to each FE, which contains the disaggregated FE-specific versions of the routing entry. However, with reference to the entry #1 of Table III, the second line is the local entry to be sent to FE #1 and the third line for the local entry to be sent to FE #3. Upon the successful completion, the FEs sends acknowledgement messages to the CEC. When all the acknowledgements are received, the CE manager changes the routing modifications from the pending state to the stable one, and forwards FIB synchronization messages to bces. Finally, the CE manager sends a further acknowledgement message to the control-plane applications through the Netlink interfaces. Similar procedures are also executed for deleting entries in the routing table. Link Fault A link fault event can be caused by the failure of link media, of the local network interface, or of the neighboring node. The management of such an event involves both the FEs and the mce, because it must be centrally managed by the control application of the CE, although it is an advertisement that occurs on a FE. Specifically, such messages are received from the kernel by the Netlink Server threads of the FEC. The FE Manager creates a notification message for the mce to be sent through the ForCES interface. When the message is received and parsed by the

12 CEC, the CE Manager: (i) updates the status of the network interface in the FIB; (ii) invalidates all the routes towards such interfaces, and sends update messages to FEs according to the

Using the last notification, routing protocols can immediately advertise the topological change due to the link fault, propagate the link fault to neighboring nodes, and populate the new routing

12 12 CEC, the CE Manager: (i) updates the status of the network interface in the FIB; (ii) invalidates all the routes towards such interfaces, and sends update messages to FEs according to the mechanism shown in section 0; (iii) sends a notification to control-plane applications through the Netlink interfaces. Using the last notification, routing protocols can immediately advertise the topological change due to the link fault, propagate the link fault to neighboring nodes, and populate the new routing table. I/O operations for Exception Packets The complete realization of the slow path obviously deals with both the reception and transmission operations of exception packets. As already introduced in section 0, the key element is the xppm module, which maintains a number of network sockets listening for exception traffic both in the internal and external interfaces. The type and the number of sockets are signaled by the mce on the basis of the active control applications. As far as the reception phase is concerned, it usually works according to the steps indicated in Fig. 6, which can be summarized as follows: 1) the exception packet is received by the FE kernel on a public network interface and delivered to the xppm through the network socket; 2) when the xppm receives the exception packet, the xppm encapsulates it and some additional parameters in a new packet (the additional parameters include the aggregated index i.e., the univocal index in all of the distributed routers, see table II of the network interface that received the packet); 3) the new encapsulated packet is sent to the mce through a TCP connection; 4) when the mce receives the packet, it de-encapsulates the original exception packet and forwards it along with the aggregated interface index to the routing daemon through the CE-internal loopback network interface. The transmission phase, which is shown in Fig. 7, consists of the same operations as the reception phase, but applied in the opposite order. Figure 6. Reception operations of exception packets. Figure 7. Transmission operations of exception packets. PERFORMANCE AND ARCHITECTURAL EVALUATION Our primary objective in this section is to evaluate the performance levels and the main advantages that the DROP platform can offer. To this end, we decided to evaluate the DROP architecture through four main analysis and validation aspects with regards to the efficiency of its internal operations, the scalability of the overall distributed router architecture, as well as the data and the control-plane performance. Notwithstanding DROP s support of different topologies and layer 2 protocols in the internal private network, we decided to perform all of the tests with a star-switched Gigabit Ethernet LAN. This simple architecture allows us to obtain results that permit the evaluation of the DROP performance in a clearer and more intuitive way than the one that would be possible with a more complex network architecture. The distributed router is composed of a variable number of elements. Each FE and CE is a Linux SR with a HW platform based on two 3.0 GHz Intel Xeon Quad-Core processors and equipped with up to 8 Gigabit Ethernet network interfaces. Concerning testing and benchmarking tools, we used the Ixia N2X Router Tester, which allows us generate and measure traffic flows accurately and emulate routing protocols. All the tests have been carried out the number of times necessary to achieve a confidence interval of 3% and a confidence level of 95% on measured results.

Fundamental Questions to Answer About Computer Networking, Jan 2009 Prof. Ying-Dar Lin,

Fundamental Questions to Answer About Computer Networking, Jan 2009 Prof. Ying-Dar Lin, ydlin@cs.nctu.edu.tw Chapter 1: Introduction 1. How does Internet scale to billions of hosts? (Describe what structure