ON THE DESIGN, PERFORMANCE, AND MANAGEMENT OF VIRTUAL NETWORKS FOR GRID COMPUTING

Size: px

Start display at page:

Download "ON THE DESIGN, PERFORMANCE, AND MANAGEMENT OF VIRTUAL NETWORKS FOR GRID COMPUTING"

Rachel Gregory
6 years ago
Views:

1 ON THE DESIGN, PERFORMANCE, AND MANAGEMENT OF VIRTUAL NETWORKS FOR GRID COMPUTING By MAURÍCIO TSUGAWA A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA

2 2009 Maurício Tsugawa 2

3 To all who, directly or indirectly, contributed to make this milestone possible 3

4 ACKNOWLEDGMENTS I would like to express my sincere gratitude to my advisor, Prof. José Fortes, for his invaluable guidance and support throughout my many years as graduate student. Prof. Fortes presented me with opportunities to pursue challenging research problems; always made time to work with me through his busy schedule; and shared his experience, helping me become a better researcher. I would like to thank Prof. Alan George, Prof. Renato Figueiredo, and Prof. Shigang Chen for serving on my supervisory committee. I would like to thank Tim Freeman and Dr. Kate Keahey for their support with Nimbus clouds, and for the opportunity to work in a "flash" internship at the Argonne National Laboratory. I would like to thank Dr. Ananth Sundararaj, Dr. Peter Dinda, Dr. Sebastien Goasguen, Dr. Sumalatha Adabala and Rick Kennel for the help in setting up the test environment at Northwestern University and Purdue University. I am thankful to Andréa Matsunaga for all her help. Special thanks to my sister, Dr. Márcia Rupnow, who always supported me. This work was funded in part by the National Science Foundation under Grants No. EIA , EIA , ACI , EEC , EIA , EIA , OCI and CNS ; NSF Middleware Initiative (NMI) collaborative grants ANI /ANI , SCI I would also like to acknowledge the BellSouth Foundation, SUR grants from IBM and gifts from VMware Corporation and Cyberguard. 4

5 TABLE OF CONTENTS ACKNOWLEDGMENTS... 4 LIST OF TABLES... 8 LIST OF FIGURES ABSTRACT CHAPTER 1 INTRODUCTION page Grid Computing Grid Networking Problem Solution Overview Contributions Organization BACKGROUND Network Infrastructure for Grid Computing - Problem Description Grid Deployment Difficulties in the Internet Virtual Network Approaches Virtual LAN (VLAN) Virtual Private Network (VPN) VNET Northwestern University SoftUDC vnet HP Labs VIOLIN Purdue University X-Bone University of Southern California RON - Massachusetts Institute of Technology Peer-to-peer (P2P) overlay networks OCALA University of California Berkeley IPOP University of Florida LogMeIn Hamachi Summary ON THE DESIGN OF VIRTUAL NETWORKS Virtual Networks for Grid Computing Network Address Space Network Interface Routing Firewall traversal

6 Design of a Virtual Network (ViNe) Architecture for Grid Computing ViNe Address Space and ViNe addresses ViNe Node Configuration ViNe Infrastructure Firewall traversal ViNe routing Multiple Isolated Virtual Networks Putting it all together Discussion Security Considerations ViNe Prototype Implementation VR-software components Configuration Module Packet Interception Module Packet Injection Module Routing Module ViNe Prototype VR performance ViNe performance ON THE PERFORMANCE OF VIRTUAL NETWORKS Characterizing Network Virtualization Experimental Setup Virtual Network Processing Encapsulation Overhead Packet Interception Packet Injection Routing Virtual Links Cryptographic Operations Compression Discussion IP Forwarding Performance IP Fragmentation Packet Interception vs. Copy Java Forwarder Performance Tuning Effects of Virtual Network Processing Time Using Worker Threads Case Study: OpenVPN Improving ViNe Summary ON THE MANAGEMENT OF VIRTUAL NETWORKS Managed Network Challenges in Network Management

7 User-level Virtual Network Management Security Configuration and Operation Monitoring and Tuning ViNe Management Architecture ViNe Authority Address Allocation VN Creation and Tear-down VN Merging and Splitting VN Membership VIRTUAL NETWORK SUPPORT FOR CLOUD COMPUTING Networking in Cloud Environments Network Protection in IaaS User-level Network Virtualization in IaaS Enabling Sky Computing TinyViNe Middleware Architecture and Organization Avoiding L2 Communication Avoiding Packet Filtering TinyViNe Overlay Setup and Management Evaluation BLASTing On the Sky TinyViNe Overheads TinyViNe intrusion on other applications TinyViNe impact on communication-intensive applications Summary CONCLUSIONS Summary Future Work APPENDIX VINE MIDDLEWARE Source Code Building ViNe Binary Code Running ViNe Configuration TinyViNe LIST OF REFERENCES BIOGRAPHICAL SKETCH

8 LIST OF TABLES Table page 2-1 VLAN characteristics VPN characteristics VNET characteristics vnet characteristics VIOLIN characteristics X-Bone characteristics RON characteristics P2P networks characteristics OCALA characteristics IPOP characteristics LogMeIn Hamachi characteristics Private IP address space ViNe terminology VR configuration parameters LNDT entry information GNDT entry information NSKT entry information ViNe header fields ViNe performance experimental results Maximum UDP/TCP throughput (Mbps) with (frag) and without IP fragmentation Maximum UDP/TCP throughput and round-trip latency of forwarders Maximum UDP/TCP throughput and round-trip latency of OpenVPN ViNe management roles

9 5-2 ViNe management operations Qualitative comparison of existing ON solutions Characteristics of VMs in different providers Virtual cluster distribution CPU utilization distribution with execution of matrix multiplication and network intensive application between UC and PU VMs ViNe characteristics

10 LIST OF FIGURES Figure page 1-1 The layered grid architecture Virtual Network (ViNe) architecture for grid computing IEEE 802.1Q tag in Ethernet frame VLAN-aware network Site-to-site VPN and user-to-lan VPN VMs connected to their owners (Virtuoso client) network Multiple isolated private LANs of Xen-based VMs enabled by vnet Private LANs of VMs created by VIOLIN Managed overlay networks in X-Bone RON improves the robustness of Internet paths Legacy applications are bridged to multiple overlay networks using OCALA OS routing tables Internet routing versus virtual network routing Encapsulation used in TCP/IP over Ethernet Virtual network datagram ViNe architecture IP aliasing configuration OS routing table manipulation Firewall traversal in ViNe LNDT and GNDT examples ViNe at work example VR components Intercepting packets for VR software processing

11 3-13 Packet injection using libnet Routing Module ViNe header VR performance experimental setup VR performance results ViNe performance experimental setup Experimental setup Overlay network processing system Maximum TCP and UDP throughput versus VN header overhead Packet interception performance of TUN/TAP devices, raw sockets and queue interface of Netfilter IP packet injection performance of TUN/TAP devices and raw sockets Routing table access time using Java hash tables and arrays UDP and TCP performance Processing time of symmetric key encryption for different algorithms in Java Processing time of MD5 and SHA1 in Java versus length of data Processing time of Java-based compression and expansion of text, binary and image files, divided in fixed block sizes Pseudo-code of the developed IP packet forwarders Effects of network processing during packet forwarding Effect of using worker threads for packet processing ViNe Management Architecture ViNe merge example TinyVR a stripped-down version of a FullVR TinyViNe deployment

12 6-3 Speedup performance of BLAST when processing 960 sequences against the 2007 NR database Time spent using secure copy to transfer files of different sizes with different WAN interconnections A-1 ViNe source tree A-2 Partial list of ViNe configuration parameters

13 Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy ON THE DESIGN, PERFORMANCE AND MANAGEMENT OF VIRTUAL NETWORKS FOR GRID COMPUTING Chair: José A. B. Fortes Major: Electrical and Computer Engineering By Maurício Tsugawa August 2009 Contemporary distributed systems, exemplified by grid environments, involve resources spanning multiple administrative domains. Existing networking techniques and solutions do not necessarily apply since they are generally designed to be used in enterprise networks i.e., single administrative domain. The presence of firewalls and network address translation devices compromise the connectivity among resources across different sites, and considerable effort is required from site administrators to offer, when possible, a computing environment for distributed applications. In this scenario, grid administrators need privileged access to core network equipment of every site, and possibly of network providers, in order to manage the grid networking an unrealistic requirement. Even when resource providers agree to release the control of network equipment, heterogeneity in terms of vendor, hardware and firmware make the management and operation of such an environment difficult. This dissertation advocates the use of user-level network virtualization to address the networking problems in grid environments, since such virtualization can be designed to not require changes in the core network infrastructure and it can be easily deployed in regular computers i.e., desktops, servers, and laptops. To this end, this work (1) describes the design of 13

14 a virtual network infrastructure identifying and satisfying the grid network needs; (2) thoroughly investigates implementation options and its implications in virtual network performance; and (3) presents a virtual network management architecture that can help both providers and end users in the operation of virtual networks. The results of this research are reflected in the ViNe middleware package implementing a flexible virtual network approach based on the deployment of user-level virtual routers, which are machines loaded with ViNe processing software. ViNe software allows for dynamic run-time configuration, a feature that facilitates the management of virtual networks through the use of tools and middleware that hide the complexity of configuration processes. ViNe software packet routing capacity, in excess of 800 Mbps, has the best user-level virtual network performance levels reported to date. Finally, mechanisms to address networking challenges unique to cloud computing environments are developed and evaluated using an extended version of ViNe, called TinyViNe. TinyViNe enables end users to deploy virtual networks on cloud environments without the need for specialized networking knowledge. 14

15 CHAPTER 1 INTRODUCTION Grid Computing A fundamental goal of grid computing is to share resources distributed across institutions among users. Resource sharing must be highly controlled, with resource providers and consumers defining clearly all sharing rules and policies [1]. These sharing rules and policies define a collection of individuals, institutions and resources called a virtual organization (VO) [1]. Significant progress has been made on enabling distributed computational grids and cyberinfrastructures [2], and today there are several deployments of grid computing infrastructures [3][4][5][6][7][8][9][10][11][12][13] that enable users to access shared resources across several organizations. A key component, which is necessary for grid computing, is the communication infrastructure. Grid protocols and mechanisms depend on connectivity among resources e.g., Secure SHell (SSH) [14], Grid Security Infrastructure (GSI) [15], Condor [16], Portable Batch System (PBS) [17] and Web Services. Currently, the Internet is used as the main communication media when multiple geographically dispersed organizations are involved in a collaborative effort. Figure 1-1 illustrates the layered grid architecture proposed in [18]. The Fabric layer provides access to shared resources (e.g., computer, network, storage, and sensors) with sharing rules and policies being defined and implemented by grid protocols in upper layers. Resource and Connectivity layer defines protocols for secure communication among Fabric layer resources. Note that the Resource and Connectivity layer depends on network resources of the Fabric layer and builds on the Internet protocols to define Grid protocols such as GSI. The Collective layer defines Application Programming Interfaces (APIs) that capture interactions among multiple resources. It is based on information protocols that keep track of the structure 15

and the state of grid resources, and on management protocols that mediate access to resources. Finally, applications are implemented calling services at any layer.

16 and the state of grid resources, and on management protocols that mediate access to resources. Finally, applications are implemented calling services at any layer. Grid computing research and development focused on defining and implementing Resource and Connectivity and upper layer protocols, reusing as much as possible existing fabric components, protocols and interfaces. Figure 1-1. The layered grid architecture. Grid Networking Problem Each resource provider needs at least one network resource that can offer the necessary services to establish the communication among shared resources (intra- and inter-organizations). When such services are not available, resources cannot be incorporated into a grid. A typical network resource is a Local Area Network (LAN) connected to the Internet through an Internet Service Provider (ISP). When resources connected to different LANs are involved in a grid deployment, network resources need to offer connectivity among resources crossing LAN boundaries. In many cases, network resources lack the capability of offering connectivity to remote resources, and grid deployments become challenging due to the following reasons: 16

17 a. The presence of network devices that perform packet filtering, mangling and network address translations breaks the bi-directional or full connectivity between nodes: communication between two processes A and B can only be established if it is initiated by process A. The majority of distributed programs expect full connectivity among nodes, and only resources connected to the same private network or domain can be used [19]. b. Most grid computing middleware projects are not designed for connectivity-limited network environments, and in general, middleware components themselves depend on full connectivity among nodes. Projects that deal with network connectivity expose new APIs to applications, which create obstacles when grid-enabling existing applications [20]. c. Reconfiguration of network infrastructure equipments (e.g., switches, routers and firewalls) to adequate the network for grid computing requires efforts and active participation of system and network administrators from all organizations. Other factors, such as differences in network equipments (with vendor dependent features and implementations), geographical location, ISP services and local network policies in each organization make grid network management task very challenging. Requiring network resources to offer advanced capabilities such as virtual networking, performance monitoring and traffic prioritization would allow upper layer protocols, especially the resource management related ones, to implement interesting resource allocation strategies. At the same time, grid deployment becomes very difficult since it is not easy to satisfy these requirements. The following problems need to be addressed by grid fabric network resources: Any pair of grid resources needs to be able to communicate with each other. However, due to the connectivity limiting devices present in the Internet, a network resource may not be globally addressable. Network resources need to incorporate mechanisms to overcome connectivity limitations in the Internet so that grid resources can communicate without dependency with ISP services and the need to reconfigure different network equipments in each participating organization. While grid infrastructures can benefit from new network architectures designed with basis on lessons learned with the Internet architecture, difficulties in deployment make these architectures not practical for the near future note the slow transition from Internet Protocol version 4 (IPv4 or simply IP) to version 6 (IPv6). Therefore, solutions that can be easily deployed in the current Internet are needed. Grid-wise mechanisms to configure and manage network resources do not exist. Current network equipment management is based on vendor-dependent implementations designed for enterprise and corporate networks. Existing solutions are not adequate when multiple administrative domains are involved. If network resources can support the deployment of 17

18 isolated networks on demand, grid resource management can be improved significantly in a way similar to how Virtual LAN (VLAN) technology changed enterprise network management. Grid-wise network monitoring, mechanisms to specify Quality-of-Service (QoS) (possibly on a per-vo basis) and mechanisms to enforce network policies do not exist. Solutions need to be designed under the following constraints: Improvements to network resources need to be implemented without disrupting existing services as Grid technologies are built on Internet mechanisms and protocols. Solutions must be platform independent as Grid environments are heterogeneous and integrate different Central Processing Unit (CPU) architectures and operating systems (OS). Features added to network resources need to export existing interfaces for applications so that existing applications can be used without modification. Resources should not require complex configuration in order to use new network services. Many existing network services are not well suited for grid environments. For example, VLAN (detailed in Chapter 2) cannot cross corporate or campus network boundaries. In addition, since it is implemented in network equipments (switches and routers), it is challenging to introduce modifications and improvements to the technology. Existing techniques to overcome connectivity limitations are not adequate to be incorporated into grid network resources because of at least one of the following reasons: New network APIs are exposed, and existing applications cannot be executed; Resources are required to run software for additional network processing, compromising platform independence and requiring non-trivial configurations; Absence of support for isolated networks requires firewalls for traffic isolation, making the environment similar to the Internet; Special equipments or services from ISPs are required, making a solution not generally applicable; Performance overheads are high enough to make a solution not practical. 18

19 In this work, mechanisms and protocols to improve the services offered by network resources in the grid fabric layer are proposed. Virtualization techniques are applied to network resources in order to control the network in its virtual domain. This allows virtual network resources to expose the same interface and services of the physical network entities, and also allows the addition of new services. Prototype implementations are used to assess feasibility, performance and management aspects. Solution Overview Figure 1-2. Virtual Network (ViNe) architecture for grid computing. Multiple independent virtual networks (one for each virtual organization) are overlaid on top of the Internet. ViNe routers (VRs) control virtual network traffic. Management of VNs is accomplished by dynamically reconfiguring VRs. 19

20 An efficient and flexible communication infrastructure for grid computing can be achieved by applying virtualization concepts to the physical network infrastructure. Figure 1-2 gives a brief overview of the proposed virtual network architecture. The proposed Virtual Network (ViNe) architecture [21] is based on virtualization techniques applied to network packet routers ViNe routers (VRs). Entities that participate in a grid computing use VRs as means to reach each other when crossing LAN boundaries. VRs work with carefully designed data structures (routing tables) that allow overlaying multiple independent and isolated virtual networks (VNs) on top of the physical network infrastructure. ViNe addresses connectivity limitations imposed by firewalls 1 establishing necessary communication channels among VRs. The fundamental properties of virtualization technologies identified in [3] manifolding, multiplexing and polymorphism can be observed in ViNe. Multiple independent and isolated virtual networks (manifolding) are overlaid on top of the physical infrastructure (multiplexing). Each virtual network can further be customized in terms of QoS, access control and supported higher layer network protocols (polymorphism). ViNe has been designed to address the previously listed problems. Each VR is implemented as user-level software able to perform the necessary connectivity recovery with very low performance overheads. By manipulating routing tables, VR behavior can be changed dynamically, serving as the base for the virtual network management infrastructure. 1 Personal firewalls or programs installed in individual nodes can also compromise the end-to-end bi-directional connectivity. Network virtualization techniques, in general, do not deal with personal firewall traversal. 20

21 Contributions In this research, the benefits of user-level network virtualization are explored in order to deliver a low overhead, managed user-level network virtualization infrastructure for grid computing. Network virtualization is systematically studied in terms of architectural design, performance and management. Design: A virtual network architecture for grid computing, called ViNe, is proposed, implemented and evaluated. The networking services needed by grid infrastructures are first identified. Existing technologies and concepts are studied in order to unfold the architectural design space of virtual networks. Unique features of ViNe include (1) an application transparent connectivity recovery mechanism that is based on regular Transmission Control Protocol (TCP) and User Datagram Protocol (UDP) communication, completely independent from physical infrastructure services (i.e., it can work with any type of firewalls); (2) dynamically reconfigurable routers that facilitate the management and operation of mutually independent virtual networks; and (3) flexibility in terms of number of routers, overlay network topology and virtual network address allocation. Performance: Network virtualization incurs overheads and it is important to understand and analyze the sources of overheads. Sub-components of virtual network architectures have been investigated and implementation options of each component evaluated in order to expose performance tuning opportunities. The study shows that it is possible for a userlevel network virtualization software to perform with low overheads. Surprisingly, highlevel language implementations (e.g., Java) were able to perform at the level of C implementations. Management: The virtual network management process should minimize interactions with human resources (network and system administrators) as coordination among participants is difficult, especially when multiple organizations in different geographical locations (possibly in different time zones) are involved. Leaving management in its entirety to grid middleware may not be desirable, and only possible when full trust among participants can be established. It is important to balance the management task, i.e. while some control is given to grid middleware allowing tasks to be automated, critical aspects of the management should be the responsibility of site administrators. To this end, ViNe management infrastructure that exposes high-level interfaces is presented. For network administrators, a set of services allows for the exploration of the full configuration flexibility of ViNe. For end users, an auto-configuration mechanism enables the download-and-run mode of operation. For middleware, interfaces that can be programmatically invoked are provided. 21

22 Organization Chapter 2 presents the necessary background for this work. First, the requirements for a grid computing friendly network infrastructure are introduced. Difficulties of using the Internet for grid computing are discussed and existing solutions that address some of the problems are then presented. Each solution is carefully examined to expose strengths and weaknesses. Chapter 3 studies the design of virtual networks. In particular, important features related to address space, network interface, routing and firewall traversal are studied to support design decisions of the proposed ViNe architecture. ViNe is the result of exploring the virtual networks design space, and incorporating necessary features for computational grids in an efficient manner. Implementation details and performance evaluation of a prototype are presented in the end of the chapter. After exploring the architectural design space in Chapter 3, the implementation space is explored in Chapter 4 targeting performance tuning opportunities. Tasks executed by virtual network processing software are identified, and components performing these tasks are characterized. Prototype programs that combine all components, developed in C and Java languages are experimented and the performance results analyzed. Chapter 5 presents the design of a virtual network management system. The basic set of requirements to properly drive the ViNe Infrastructure (described in Chapter 3) are identified and described, providing the basis for the ViNe Management interfaces. In Chapter 6, lessons learned in terms of architecture, performance and management are used to develop the necessary virtual network extensions to support cloud computing. Chapter 7 summarizes this dissertation and presents future directions. 22

23 CHAPTER 2 BACKGROUND This chapter provides the necessary background to conduct this work. Requirements for a grid computing friendly network infrastructure and difficulties in the Internet to satisfy those requirements are presented. An overview of existing solutions, designed to address connectivity limitations on the Internet, highlights strengths and weaknesses of each approach. The efficacy of the solutions when applied to grid computing is then analyzed. Network Infrastructure for Grid Computing - Problem Description As a system that executes distributed computation, grids require communication among participating entities. Many grid middleware and services have been designed and implemented considering the ideal Internet connectivity model where all nodes are considered peers. They also rely on existing Internet services such as domain name services (DNS) and web. Modifying or replacing existing Internet services is impractical due to the large number of affected network equipments. An efficient network infrastructure is essential for a successful grid deployment. In particular, the following features are required: a. Full connectivity: applications are typically designed and implemented considering the LAN model for communication processes can initiate connections (client processes) and/or wait for connections (server processes). In grids, a large number of administrative domains, possibly scattered around distant geographical locations, are involved, and Wide Area Network (WAN) is necessary to connect different sites. In the current Internet, network devices performing packet filtering and/or mangling (e.g., firewalls) compromise the full connectivity among end nodes. In the best case scenario, the reconfiguration process of those devices, in order to recover the necessary connectivity, would need participation of administrators from all sites. In general, recovery of connectivity by reconfiguring network equipments is impractical (or impossible) due to lack of necessary support from ISPs or network equipment incompatibility issues, which is common in grid deployments as each participating site has different installations. b. Application friendly network programming model: distributed applications are typically implemented by making use of the Berkeley sockets APIs [22]. A network infrastructure that requires applications to use a programming model other than the Berkeley sockets 23

24 would require existing applications to be modified. Many applications, although actively in use, lack support due to several reasons including discontinued and end-of-life products. Source code is often not available and almost impossible to obtain in case of commercial products, making it difficult to adapt applications, as needed, to the available network environment. c. Allow multiple independent and isolated networks to coexist: the deployment of one big network where all connected nodes can freely communicate is not well accepted, as demonstrated by the deployment of large number of equipments that restrict connectivity in the Internet i.e., firewalls. Having multiple networks with well defined usage policies would relieve many concerns when deciding to participate (or not) in a grid. In LANs, VLAN technology [23] enables multiple networks to share the same physical installation. Only nodes that are members of the same VLAN are allowed to communicate, even if those nodes are connected to the same group of network switches. The use of VLAN technology beyond LAN (or beyond campus networks) is challenging as it requires compatible equipments (usually switches and routers from the same vendor) to be centrally managed by a team of administrators. In grid computing, the issue is aggravated by the fact of multiple administrative domains and multiple ISPs being involved each site has an independent team of administrators with different equipments, services and management policies. d. Platform independence: computational grids are heterogeneous by nature, aggregating resources with different architectures running several OS variants. A solution that requires software to be installed in all resources present code portability issues. e. Management: the definition and deployment of multiple virtual networks need to be managed through a well defined interface. Moreover, the management task should be carefully divided between network administrators of organizations (human resources) and grid middleware (automated software). Critical configuration, such as machines that are shared in the grid and sharing policies definition, needs to be manually defined by human resources. It can optionally be supported by scripts that automate part of the process, but final decisions need to be made by administrators. Grouping machines that agree on participating in a computational effort into a VN can be left to a fully automated system. f. Security: a network infrastructure traditionally does not provide end-to-end security among nodes. True end-to-end security is only possible when implemented in applications. Instead, the infrastructure should provide mechanisms to maintain the isolation among deployed VNs and to guarantee that only authorized entities can participate in VNs and management tasks. Grid Deployment Difficulties in the Internet The Internet is a network of networks. It provides the packet routing infrastructure to allow the aggregation of LANs to form one big network of computers. Conceptually, every node connected to the Internet should be able to communicate with each other. Private networks and 24

25 packet filtering, introduced due to IPv4 architecture limitations and security concerns, break the all-to-all communication model. However, the all-to-all communication is necessary for grid nodes to perform collaborative computations. Many grid related processes, including applications and services offered by middleware, act as servers and depend on the ability to receive and accept network connections initiated by client processes. Therefore, public and static IP addresses and appropriate permissions in firewalls become necessary conditions for a machine to participate in grid computing. Unfortunately, in many cases, those conditions cannot be offered by resource providers. Public IP address availability and proper permissions in firewalls, which are conditions increasingly more difficult to satisfy, depend on the services offered by ISPs and also on network management policies in each participating site. Network management also suffers from connectivity limitations. In LANs and corporate networks, management is accomplished by monitoring and configuring network equipments (e.g., network switches and routers). Network equipments need to be reached so that management software can perform the tasks appropriately. In grid deployments, network equipments are likely to be not reachable. A small change in grid configuration (e.g., addition of a node in a site) can potentially require the reconfiguration of equipments in all participating sites, which is clearly impractical, especially for large deployments. Fortunately, there is no need for all machines in the Internet to communicate. Only machines participating in a collaborative effort, or a VO [1], are required to communicate. The all-to-all network model can be offered and contained to a selected group of machines on a per- VO basis. The isolation and protection offered should be sufficient to make many general purpose packets filtering protection unnecessary within VOs. 25

26 VLAN technology is very successful in allowing co-existence of multiple isolated networks in corporate and campus network environments. However, the difficulty to extend its use beyond a single administrative domain, limit its application in grid computing where several geographically distributed administrative domains participate. Another difficulty is how to distribute administration and management tasks, since core network components of each site would need constant reconfiguration. Grids are very challenging in terms of network management. One possibility is to maintain system administrators of all participating sites in constant contact, so that required changes in network equipment configuration are effected in timely manner. A second option is for all sites to release control of their network equipment to the grid middleware. Both options are impractical for obvious reasons and also would not solve connectivity issues in its entirety, since dependency to ISP services will always exist e.g., no public IPs offered and/or filtering enforced at ISP gateways. Network virtualization techniques, when applied properly, have the potential to allow control of the network in its virtual domain, exposing the same interface and services of physical network entities to machines without interference, being a very attractive approach to provide necessary network services for grid computing. To date, several network virtualization techniques have been proposed and implemented. The most significant ones are overviewed in the next subsections. Virtual Network Approaches Several virtual networking approaches have been proposed, implemented and deployed. These systems were developed to address limitations on physical networks, such as management and security. Each approach concentrates on addressing particular issues and no single solution satisfies the grid requirements presented previously. Most network virtualization techniques are 26

27 based on the IP for network routing and on the Ethernet data link layer. In general, they do not take advantage of high-speed specialized LANs such as Infiniband [24], Myrinet [25], Scalable Coherent Interface (SCI) [26], and 10Gbit Ethernet with TCP off-load engines (TOE) [27]. However, techniques based on tunneling can be modified to transmit encapsulated IP packets or Ethernet frames using high-speed LANs. Virtual LAN (VLAN) Ethernet LANs are, by concept, broadcast networks i.e., every node receives all transmitted network frames. The broadcast model was broken when Ethernet switches were introduced. Switches avoid broadcasts by learning the devices, each of which is identified by Media Access Control (MAC) addresses, connected in each port. Switches inspect the Ethernet headers of transmitted frames, and based on the destination MAC address, decide to which port(s) a frame should be forwarded. Broadcasts are still used during learning process. The information about devices connected in each port of a switch expires periodically in order to accommodate changes in the network (e.g., a new device being connected or being removed). To allow the coexistence of multiple broadcast domains, new technologies, such as VLAN, capable of restricting broadcast traffic to a selected group of ports, were embedded into switches. A VLAN is essentially an Ethernet broadcast domain restricted to a selected group of switch ports. To avoid the use of multiple ports, as many as the number of deployed VLANs, network equipment manufacturers developed proprietary solutions for inter-switch communication: Cisco s Inter-Switch Link (ISL) protocol [28], 3com s Virtual LAN Tagging (VLT) [29] and Cabletron s SecureFast [30] are the most representative examples. A standard to manage multiple broadcast domains was necessary, and the one adopted was the IEEE 802.1Q [31]. The standard extended the Ethernet frame headers by 4 bytes in which VLAN information is embedded (see Figure 2-1). 27

Figure 2-1. IEEE 802.1Q tag in Ethernet frame. To form 802.1Q frames, a 4 byte tag field is inserted into Ethernet frames. Note the 12-bit VLAN ID, which indicates the VLAN a frame belongs to.

28 Figure 2-1. IEEE 802.1Q tag in Ethernet frame. To form 802.1Q frames, a 4 byte tag field is inserted into Ethernet frames. Note the 12-bit VLAN ID, which indicates the VLAN a frame belongs to. Figure 2-2. VLAN-aware network. Only machines that belong to the same VLAN can communicate independently of the switch they are connected to. VLAN tags are inserted and removed automatically, as needed, by switches for ports configured in access mode. Ports configured in trunk mode carry tagged frames of multiple VLANs allowing a VLAN to span multiple switches. Multiple isolated VLANs, connecting a set of machines, can coexist in a physical LAN. IEEE 802.1Q is a layer-2 protocol, hence compatible with any layer-3 protocol including IP. VLANs are configured through software that accesses VLAN-aware network devices (Figure 2-2). Unfortunately, the protocols used for device configuration is manufacturer dependent. 28

29 VLAN membership is port based: the switch port that a machine is connected to, determines its membership. VLAN membership based on MAC addresses is also possible, but interoperability is an issue with proprietary implementations of this approach. Table 2-1. VLAN characteristics. Full VLAN technology operates at layer-2 and does not deal with network packet routing. Connectivity The technology was not developed to address connectivity issues. However, the interswitch/router communication enabled by the standard allows VLAN to cover a significantly large area (e.g., campus networks). For small grid deployments, 802.1Q is a very stable, established and mature technology that offers full connectivity among members of a VLAN. Network Programming Model Support for Multiple Networks Platform Independence Management Security VLAN technology is application transparent since applications, in general, interact with layer-3 and higher protocols and APIs. Since IEEE 802.1Q is implemented and embedded into network equipments and devices, applications can run unmodified being VLANs deployed or not. VLAN technology was developed to allow multiple broadcast domains to share a single physical infrastructure. The adoption of a standard (IEEE 802.1Q) allowed inter-switch/router communication, and VLANs can be deployed in large areas. However, it is impractical to use the technology when crossing LAN or campus network boundaries, especially when multiple organizations, each of which is connected to the Internet through different ISPs, are involved. VLAN technology is embedded into network equipments and even in network interface cards. Any platform, independently of processor and OS networked using Ethernet can participate in VLANs. Network equipments and devices hold information about deployed VLANs and the management is performed accessing their configuration interfaces. As the standard does not define how VLANs are managed, management interfaces are manufacturer dependent. When multiple organizations are involved, it is not realistic to expect all equipments to be from a single vendor. Added to the fact that each organization has its own network management policies, VLAN management in a grid becomes almost an impossible task. In the most common form, membership of devices to a VLAN is determined by the port of a switch or router in which a device is connected. Each port of a switch is configured to give access to one of deployed VLANs. In order to avoid malicious users to have access to a VLAN, access to the physical locations of network cable endpoints needs to be limited in many cases unrealistic, since unplugging the network cable of a workstation is easy. Virtual Private Network (VPN) VPN [32] is a technology used to securely connect remote network sites (site-to-site or LAN-to-LAN configuration) or a machine to a network site (user-to-lan configuration) using a 29

public and shared network (e.g., the Internet) as transport. VPN technology has been developed with the goal of connecting private networks without the need for expensive dedicated connections.

30 public and shared network (e.g., the Internet) as transport. VPN technology has been developed with the goal of connecting private networks without the need for expensive dedicated connections. The basic concepts behind VPN technology are encapsulation, tunneling and encryption. VPN works at the IP layer and packets that need to cross LAN boundaries are routed by VPN firewalls through encrypted tunnels. In the case of a machine accessing a LAN, the VPN client software running on the machine opens an encrypted tunnel to the target network VPN firewall (Figure 2-3). Figure 2-3. Site-to-site VPN and user-to-lan VPN. In site-to-site configuration, VPN firewalls establish an encrypted tunnel where packets between machines in different private networks will flow. In user-to-lan VPN, a machine establishes an encrypted tunnel to VPN firewall to gain access to a private network. In site-to-site operation, VPN does not require network related software installed on participating nodes. All VPN related configuration is performed in firewalls. In user-to-lan operation, VPN client software is required and users need to authenticate themselves to a VPN firewall. In both cases, IP-based applications are transparently supported. VPN technology is well established and implemented in several network equipments. With the help of management tools (e.g., McAfee s Secure Firewall CommandCenter [33]), VPN can efficiently manage (virtual) networks in an enterprise network. However, the need for 30

31 coordination among network administrators of different sites limits the applicability of VPN technology in grids. VPN technology characteristics are summarized in Table 2-2. Table 2-2. VPN characteristics. Full VPN can be used to connect networks (including VLANs) that are not routed in the Connectivity Internet. Machines on those networks are offered full connectivity with each other. However, VPN capable equipments themselves are required full connectivity, and public and static IP addresses need to be used on those equipments. Some sites willing to participate in a grid might not be able to satisfy the condition, due to lack of support from ISPs. Network Programming Model Support for Multiple Networks Platform Independence Management Security In site-to-site configuration, VPN related processing is completely done in VPN firewalls (in general, the default gateway of nodes). VPN operates at IP layer, and applications are unaware of the presence, or not, of VPN tunnels. Applications work unmodified on VPN deployments. In user-to-lan configuration, a VPN client program is required to be installed on the machine connecting to a LAN. Machines on the target LAN do not need the software. Several VPN client software programs have been developed and implemented for most modern OS. VPN technology itself does not offer support for multiple networks, but it is possible to define multiple isolated networks by carefully configuring involved VPN firewalls. The configuration process can be complex, lengthy, error prone and certainly impractical in grid deployments, since perfect synchronization among network administrators of several sites is necessary. In site-to-site VPN configuration, VPN firewalls handle the necessary network processing and any machine, independently of OS and processor, can participate in VPN. In user-to-lan configuration, VPN software is required to be installed in client machines. In general, VPN software programs are available for most modern OS. VPN management is not simple and it is controllable in small networks or well defined static (or without many changes in short periods of time) enterprise networks. Global network management tools are only useful if all managed equipments are compatible, a scenario very unlikely in grid deployments. Security was always carefully designed and implemented in VPN projects. The use of encrypted tunnels and strong authentication mechanisms among involved parties makes the technology resilient to many forms of network related attacks, but misconfigured firewalls can easily compromise VPN security. VNET Northwestern University VNET [34][35] is a network virtualization technology developed as part of the Virtuoso [36] Virtual Machine (VM) grid computing system, and it is used to provide connectivity to Virtuoso deployed VMs to respective clients. VMs requested by Virtuoso clients are started in Virtuoso 31

32 servers and bridged to the client s LAN through VNET. VNET provides a virtual network wire that connects VMs on Virtuoso servers to clients LAN. VNET user-level proxies (VNET daemons) are started on VM host servers and on Virtuoso client s network. VNET proxies monitor Ethernet traffic in order to capture VM-related frames and transfer from the VM s host to the client s LAN and vice-versa. Instead of connecting VMs to the network environment of VM host servers, they are attached to client s network through a long virtual network cable provided by VNET. Table 2-3. VNET characteristics. Full VNET attaches VMs hosted in provider sites to client s network. Machines in Connectivity client s network are fully connected to VMs running in provider s site. Connecting VMs from several providers geographically dispersed in the Internet, although possible, is not appealing in terms of performance. Nodes running VNET proxies are required to have full connectivity among each other VNET does not deal with connectivity recovery. Thus, in a way similar to VPN, VM providers and clients might not be able to use VNET due to service limitations from ISPs. Network Programming Model Support for Multiple Networks Platform Independence Management VNET proxies need to be running on a machine connected to client s network and on VM host machines. No additional software is necessary on other machines, and applications run unmodified. At work, VNET looks very similar to user-to-lan VPN, and there is no apparent advantage of VNET over well established VPN technology. One difference is that VNET operates at data-link layer, and it can support very special but rare applications that access the data-link layer directly. Support for multiple networks was not one of design goals of VNET. For a given set of sites and VNET proxies, only one virtual network can be active. VNET proxies handle necessary virtual network processing and offer platform independence to participating nodes. VNET configuration and management complexity is comparable to that of VPN. One complication, in the case of VNET, is that more network sites (VM providers and client networks) are involved. The use of VNET in large grid deployments is challenging due to the need of coordination among site administrators. Security In its most complete form, VNET tunnels use Transport Layer Security (TLS) [37] and VNET is protected against many types of malicious attacks. From site network administration perspective, VNET deployment may raise many security concerns due to its user level tunneling that can potentially bypass security policies enforced on physical network equipments. 32

33 VNET operates at layer-2 intercepting VM-related Ethernet frames in VM host servers and transferring them to the target network. The software agents performing the task of transferring Ethernet frames to appropriate networks are called VNET proxies (Figure 2-4). VNET characteristics are summarized in Table 2-3. Figure 2-4. VMs connected to their owners (Virtuoso client) network. One VNET daemon (proxy) is required to run in all VM host servers, and one VNET daemon is required in each client LAN (i.e., LANs where owners of VMs are connected). VNET daemons in VM hosts capture L2 frames relevant to VM communication, transferring it to the VNET daemon in the appropriate client LAN. SoftUDC vnet HP Labs. The term vnet 1 in the context of the project on software-based Data Center for utility computing (SoftUDC) [38] refers to a virtual private local area network that connects VMs. The SoftUDC Virtual Machine Monitor (VMM) controls VMs network traffic via a kernel module present on a special virtual machine (domain-0 VM in Xen VMM terminology [39]). This kernel module encapsulates Ethernet frames generated on VMs and transfers them to another kernel module running in a server that hosts the destination VM. This technology creates multiple and isolated networks of VMs, distributed over VM host servers. 1 In order to differentiate with VNET project developed at Northwestern University, SoftUDC vnet is written in lower case although both projects use capital letters. The choice for lower case for SoftUDC vnet is due to its special daemon process (vnetd) being written in lower case. 33

34 vnet operates at layer-2, intercepting Ethernet frames generated in VMs and transferring them to the destination. Since the domain-0 takes care of all frames to and from user domains (domain-n), having a kernel module that tunnels and transports frames destined to VMs hosted in remote servers is an elegant solution for virtual networking of Xen VMs. A data structure is maintained by vnet kernel module that maps virtual network interface (VIF) MAC addresses to corresponding domain-0 hosting each VIF (called care-of address). The data structure is automatically maintained via Virtual Address Resolution Protocol (VARP) designed for vnet in order to discover VIF care-of addresses (Figure 2-5). Figure 2-5. Multiple isolated private LANs of Xen-based VMs enabled by vnet. VMs can be hosted in different servers. Ethernet frames generated by VMs are captured by the vnet kernel module running in domain-0. Captured frames are encapsulated into vnet format, receiving vnet tag (ID) in the header and transmitted to the server hosting the destination VM via regular TCP/IP communication. vnet transport and VARP both use multicasts. To be independent of multicast routing in wide-area deployments, a special purpose daemon called vnetd was developed. Each LAN hosting VIFs need to have one vnetd running, which will forward local vnet multicasts to its peers and also resend received multicasts locally. 34

35 vnet characteristics are summarized in Table 2-4. Table 2-4. vnet characteristics. Full vnet provides full connectivity among Xen-based VMs that belong to the same virtual Connectivity network. However, in order to provide such functionality, all involved domain-0 VMs and vnet daemons (vnetd) are required to be connected to one another. The requirement is easily satisfied in LAN deployments, but in grid deployments it is a big problem as previously discussed. Network Programming Model Support for Multiple Networks Platform Independence Management Security No modification or additional software is required in participating Xen VMs and applications should run unmodified, but vnet is only supported in SoftUDC infrastructures. vnet is similar in concept to VLAN technology and supports multiple and isolated networks. VMs can be isolated from each other (in terms of network connectivity), even if they are hosted in the same server. vnet supports only Xen-based VMs. VARP automatically keeps track of deployed vnets and necessary mappings between VIFs and care-of addresses. The deployments of VIFs (and Xen VMs using them) still need to be defined manually by the SoftUDC administrator. vnet communication (transport and control) makes use of IP Security (IPSec) protocol for message authentication and confidentiality. As in Northwestern University VNET, vnet can bypass network security policies that are enforced in physical network equipments. In addition, vnet traffic is not visible in network monitoring devices. VIOLIN Purdue University VIOLIN, which stands for Virtual Infrastructure on OverLay INfrastructure [40][41], creates isolated virtual networks of VMs on top of an overlay infrastructure. An overlay infrastructure is assumed to give connectivity to VM host servers. The main component of VIOLIN is the user-level virtual network switch that runs on every VM host server and it is responsible for capturing and transferring Ethernet frames to and from VMs. VIOLIN supports User Mode Linux (UML [42]) and Xen VMMs. It is closely related to SoftUDC vnet, in the sense that a piece of software runs on the host system in charge of controlling VMs (domain-0 VM) and routes packets generated by VMs (Figure 2-6). The main difference between the two technologies is that vnet operates in kernel mode while VIOLIN operates in user mode. 35

Figure 2-6. Private LANs of VMs created by VIOLIN. Ethernet frames generated in VMs are captured in domain-0 running VIOLIN daemon (virtual switch).

36 Figure 2-6. Private LANs of VMs created by VIOLIN. Ethernet frames generated in VMs are captured in domain-0 running VIOLIN daemon (virtual switch). Captured frames are transferred to the destination VM host through UDP tunnels established among VIOLIN switches. VIOLIN instantiates user mode VN switches responsible to route VM-generated Ethernet frames. Communication between virtual switches is established through UDP tunnels. VIOLIN characteristics are summarized in Table 2-5. Table 2-5. VIOLIN characteristics. Full VMs within VIOLINs are offered full connectivity with respect to each other. Connectivity Connectivity to and from the Internet is not available for VMs. VIOLIN virtual switches are required full connectivity to one another, which may become a blocking factor for grid deployments. Network Programming Model Support for Multiple Networks Platform Independence Management Security No modification or additional software is required in participating VMs and applications should run unmodified, but VIOLIN is only reported to work with User Mode Linux (UML) and Xen VMMs. VIOLIN virtual switches are not implemented to handle multiple networks. Instead, a set of connected VIOLIN switches define a virtual network. A VM host server must instantiate multiple VIOLIN switches in order to support VMs connected to different VIOLINs. VIOLIN supports UML- and Xen-based VMs. VIOLIN virtual switches need to be manually started and configured in VM host servers and VMs need to be manually configured to bridge to appropriate switch. No service to facilitate the configuration process is currently provided. VIOLIN membership depends to which switch a VM is bridged, which is determined at VM configuration time. No mechanisms to avoid VMs to be connected to a wrong switch or to prevent malicious switches to be integrated into a VIOLIN are provided. VIOLIN transport also does not provide message authentication and confidentiality. 36

37 X-Bone University of Southern California X-Bone [43][44] is a distributed system that automates the creation, deployment and management of IP overlay networks. X-Bone is based on two main components: Overlay Managers (OM) that are responsible for resource discovery and overlay network deployment, and Resource Daemons (RD) that configure and monitor hosts and routers (Figure 2-7). Figure 2-7. Managed overlay networks in X-Bone. X-Bone configures and manages end-to-end overlay networks with different topologies on top of the physical network. All resources participate in overlay routing (thus called end-to-end overlays), establishing necessary tunnels amongst themselves. Tunnels establishment and configuration of overlays are controlled by two processes that run in all resources. Overlay Managers (OMs) are responsible for overlay network deployment and resource discovery while Resource Daemons (RDs) are responsible to configure and monitor resources. X-Bone creates end-to-end overlays and all resources need to run OM and RD. Current implementation supports Linux and FreeBSD systems. OM starts the creation of an overlay when it receives a request from a client an end user through a graphical interface or through an API call. A multicast discovery message is sent in the form of invitation to RDs, which respond with unicast messages their willingness to participate when capabilities, available resources and permissions match the request. OM selects adequate resources and sends specific configuration information (e.g., tunnel end point addresses, and 37

38 routing tables) to each selected RD. Finally, RDs apply received configurations to resources and the overlay network is deployed. X-Bone characteristics are summarized in Table 2-6. Table 2-6. X-Bone characteristics. Full X-Bone is an overlay network management infrastructure and does not address Connectivity connectivity issues. Resources in deployed overlays are offered full connectivity. X- Bone depends on all resources, including OMs and RDs, to be fully connected in the base network. Network Programming Model Support for Multiple Networks Platform Independence Management Security IP based applications run unmodified on X-Bone deployed overlay networks. X-Bone resources can be Linux or FreeBSD (with some kernel patches or modifications). X-Bone can deploy multiple overlay networks, and X-Bone resources can participate in more than one overlay at a given time. Linux and FreeBSD are supported. X-Bone is an overlay network management infrastructure specifically designed to dynamically define, deploy and monitor multiple overlay networks. X-Bone messages are protected using X.509 Certificates [45] and TLS. In order to reconfigure resources, X-Bone daemons (RDs) require administrative privileges on resources. This fact can be a negative factor in some grid deployments. RON - Massachusetts Institute of Technology Resilient Overlay Network (RON) [46] is an application layer overlay on top of the Internet. The main focus is on resilience: to overcome path outages and performance failures, without introducing excessive overhead or new failure modes (Figure 2-8). All nodes participating in a RON need RON libraries/router, and the RON infrastructure aggressively monitors their peers. Each path is classified according to its performance, and the decision to use the Internet path or an alternative path is made depending on the conditions of paths at packet send time. 38

Figure 2-8. RON improves the robustness of Internet paths. RON can detect path failures by constantly monitoring the quality of paths. When an outage is detected (e.g., A to C), RON provides alternative paths (e.

39 Figure 2-8. RON improves the robustness of Internet paths. RON can detect path failures by constantly monitoring the quality of paths. When an outage is detected (e.g., A to C), RON provides alternative paths (e.g., A to B to C). Table 2-7. RON characteristics. Full RON operation is negatively affected by the presence of firewalls: RON network Connectivity monitoring probe packets may be blocked by firewalls, which would be misinterpreted as network outage. In such case, RON may route traffic through alternate paths that is not optimal. RON does not establish communication between two nodes behind firewalls. Network Programming Model Support for Multiple Networks Platform Independence Management Security RON exports an API so that new applications can benefit with the added resilience. Existing applications are required to use RON IP forwarder implementation. Although not designed to support multiple networks, distinct sets of machines can define independent RONs, but a given machine cannot participate in more than one RON. RON libraries, written in C++, need to be built. RON membership can be statically or dynamically managed. In dynamic membership protocol, each RON node periodically floods to all other nodes its list of peers. To bootstrap the process, each RON node needs to know at least one peer. RON membership is not controlled, in the sense that any node willing to join RON just needs to know one end node that is already a member. RON packets are not authenticated, and RON is vulnerable to many types of network attacks. No mechanism to authenticate or block a node to join RON is provided. Using the RON API, application writers can develop resilient network programs. RON is based on a C++ library with potential portability issues. IP-based network applications are supported by using RON implementation of IP forwarder. RON IP forwarder uses FreeBSD divert sockets to send IP traffic over RON. Each machine needs to be carefully configured, and 39

40 the addition/departure of a host affects the operation of all participants, which adds management complexity. RON is not scalable, and it is reported to support up to 50 nodes. RON characteristics are summarized in Table 2-7. Peer-to-peer (P2P) overlay networks In P2P systems, all clients provide resources, including storage, computation and bandwidth. In an ideal case, addition of nodes would increase the overall performance. P2P overlay networks provide all the necessary network services for participating nodes. Every participating peer acts as network node and exercises routing functions. P2P networks are classified as unstructured or structured depending on how overlay links are established among peers. When overlay links are established arbitrarily, a P2P network is called unstructured. Many file sharing systems (e.g., Gnutella [47] and FastTrack [48]) fall into this category. When overlay links are established following a pattern (e.g., distributed hash table or simply DHT), a P2P network is called structured. Examples of structured P2P networks are represented by Chord [49], Content Addressable Network (CAN) [50] and Tapestry [51]. Hybrid implementations that use both structured and unstructured overlay links depending on network services also exist (e.g., Brunet [52]). An interesting application of Chord is the Internet Indirection Infrastructure (i3) [53], an overlay and rendezvous-based communication abstraction. In i3, each packet is associated with identification information before entering i3. Receivers use the identification to obtain delivery of the packet. By decoupling the act of sending and receiving packets, i3 supports services such as multicast, anycast and mobility. There are many implementations of P2P networks and inevitably developers solve similar problems and duplicate core network service implementations. Project JXTA [54] is an open network computing platform that tries to develop standardized basic building blocks and services 40

41 for P2P computing. JXTA standardizes protocols that provide peer discovery, network services discovery, self-organization, peer communication, and monitoring. Most P2P networks address the connectivity issues on the Internet, but only new applications can benefit from the services offered. Existing applications need to be modified to use P2P APIs. Each P2P network has its own set of APIs. Solutions that seek support for legacy applications on P2P networks include Overlay Convergence Architecture for Legacy Applications (OCALA) [55] and IP-over-P2P (IPOP) [56] research projects, and commercially available LogMeIn Hamachi [57]. One drawback of P2P networks is that every participating node is required to run P2P routing software. This fact raises portability issues of P2P software on grids that have highly heterogeneous resources. It also puts additional load on machines for network processing. Characteristics of P2P networks are summarized in Table 2-8. Table 2-8. P2P networks characteristics. Full P2P networks, in general, are designed to overcome connectivity limitation on the Connectivity Internet, so that peers behind firewalls can also participate. Network Programming Model Multiple Networks Platform Independence Management Security Several P2P system projects exist, and each system exports a different set of APIs. Existing programs, based on sockets, are not supported. P2P systems seek the establishment of large networks, and support for multiple networks is not one of design goals. P2P routing software is required to run in all resources. As such, portability issues always exist in P2P systems. P2P systems are, in general, unmanaged networks where no control or restrictions are enforced on joining or leaving of peers. To be applied in grid computing, unmanaged nature of P2P networks raises several security issues. OCALA University of California Berkeley OCALA bridges legacy applications to overlay networks by presenting an IP interface to applications and tunneling the traffic over overlay networks. OCALA supports simultaneous 41

42 access to different overlays: for example a host may browse the Internet, chat through i3, and SSH through RON. OCALA [55] consists of an Overlay Convergence (OC) layer (network layer that replaces IP) positioned below the transport layer, which export IP-like interface to applications. OC is composed of two sub-layers: overlay dependent sub-layer (OC-D) and overlay independent sublayer (OC-I). OC-I interacts with applications, while OC-D interacts with overlay networks. In order to support a new overlay network, corresponding OC-D module needs to be developed. Many OC-D modules, including i3 and RON, are available (Figure 2-9). Figure 2-9. Legacy applications are bridged to multiple overlay networks using OCALA. A secure file transfer can be done using i3, a web server accessed through RON while s are checked via regular Internet. OCALA enables legacy applications to use newly developed overlay networks, and does not add services to underlying networks. OCALA characteristics are summarized in Table

43 Table 2-9. OCALA characteristics. Full OCALA does not address connectivity issues. It depends on the network it is using for Connectivity transport. Network Programming Model Support for Multiple Networks Platform Independence Management Security IPOP University of Florida OCALA was specifically developed to bridge existing applications to overlay networks without modification. OCALA allows a machine to participate in multiple different overlay networks at the same time. Windows, Linux and Mac OS X are supported. Management needs to be handled by the underlying overlay networks. Dependency with the underlying overlay networks. Table IPOP characteristics. Full Network connectivity between peers is addressed by the Brunet P2P system. Connectivity Network Programming Model Support for Multiple Networks Platform Independence Management Security IPOP supports existing applications without modification. IPOP bridges legacy applications to Brunet. The concept of IPOP namespace, introduced in [60], allows mutually-isolated virtual networks over a single Brunet deployment. Linux on x86 is the current development platform. IPOP management design goal is to have decentralized and automated mechanisms. Pre-configured system VM based appliances are used in IPOP deployments to avoid manual configurations. Brunet is used as a distributed data repository to store configuration and management information. Security aspects of IPOP and Brunet are work in progress. In the current stage of the IPOP project, mechanisms to securely join Brunet or an IPOP namespace are not provided. IPOP [56] is closely related to OCALA as both enable legacy IP-based applications to transparently use P2P networks. While OCALA s design and implementation is modular to support a number of P2P implementation, only Brunet P2P network is supported by IPOP. IPOP and Brunet are ongoing research projects. Currently, the development concentrates in adding services to support high throughput computing based on VMs [58][59]. Important services such 43

44 as creation of multiple namespaces (to support multiple isolated networks) and authenticated joining of nodes to networks (based on IPSec and x.509 certificates) are under development. Characteristics of IPOP are summarized in Table LogMeIn Hamachi LogMeIn Hamachi [57] is a UDP-based P2P system that establishes links between peers by means of a third node called mediation server. Connections are bootstrapped based on information collected by the mediation server, which gets out of the way once the link is established. Connected peers form virtual private networks. LogMeIn Hamachi authenticates peers in a process based on public key cryptography. Symmetric keys used for encryption of tunnels are also established during authentication process. A graphical user interface (GUI) facilitates the creation and management of networks. Characteristics of LogMeIn Hamachi are summarized in Table Table LogMeIn Hamachi characteristics. Full All peers are offered full connectivity (reportedly Hamachi cannot traverse some Connectivity firewalls). Network Hamachi supports existing applications without modification. Programming Model Support for Users are allowed to create independent networks. Multiple Networks Platform GUI is offered for Windows. Console versions are available for Linux and Mac OS X. Independence Management Management, in terms of creation of networks, is under full control of users. Membership control is responsibility of the user that created the network. Security Peers are authenticated to Hamachi using public-key infrastructure. All communication in Hamachi is encrypted. Summary Adding features or replacing existing network services offered by the Internet in order to support grid computing is very impractical given the number of network equipments that need to 44

45 be updated. Applying virtualization techniques to networks leaving the existing Internet infrastructure intact is an attractive solution for grid computing. A good virtual network solution for grid computing would offer full network connectivity between participants as in P2P networks and support for coexistence of multiple VLAN-like networks that can cross LAN boundaries via VPN-like tunnels. Virtual networks need to be managed securely via X-bone-like or Hamachi-like management interfaces. Past projects did not specifically target grid environments, and no comprehensive solution, adequate for grid computing, exists. To design a virtual network for grid computing the design space needs to be carefully explored. Next chapter studies aspects of virtual network architecture design, used as a guideline for ViNe architecture. 45

46 CHAPTER 3 ON THE DESIGN OF VIRTUAL NETWORKS Previous chapters presented the motivation to apply virtualization techniques to provide necessary network services and desirable features of a network infrastructure for grid computing. This chapter explores the architectural design space of such network infrastructure to incorporate essential features for grid computing, and presents a virtual network architecture called ViNe [21], justified by the design study. ViNe architecture differs from existing solutions in its design goal: to augment the services offered by grid fabric network resources and to improve their manageability, with minimal changes in infrastructure components. Vital and desirable features of a grid network infrastructure are identified and existing implementation approaches for each feature are studied. This study forms the basis for the ViNe architecture design. Prototype implementation details and preliminary performance are reported in the end of chapter. Virtual Networks for Grid Computing In order to design a general purpose network infrastructure that can be used in different grid deployments, several aspects need to be considered: network address space dictates how applications identify nodes and how OS and applications are interfaced to the network; network interface implementation has impact on platform independence characteristics of a solution; routing has influence on performance of a system; and firewall issues need to be addressed while using the Internet as a transport. Network Address Space IP addresses, in particular IPv4 addresses, are used to identify nodes in computational grids. Applications are designed and implemented to use IP addresses as communication endpoint identifications. The IPv4 address space is a 32-bit space, possibly insufficient to identify all existing network-capable devices. Private sub-spaces and network address translation (NAT) 46

47 techniques allow multiple devices to share the same IP addresses on the Internet by dynamically mapping private addresses to public addresses (and vice-versa). NAT allows the Internet to scale beyond 2 32 devices. The Internet routing infrastructure currently runs optimized for IPv4 traffic. Replacing IPv4 is a very slow process as changes to infrastructure equipments and to application software are required. IPv6 [61], a carefully designed network protocol specified in 1995 that addresses many limitations of IPv4, is still not widely deployed. As in IPv6, P2P systems define larger address spaces (in general 128 to 256-bit spaces) compared to IPv4, but existing applications need to be modified to take advantage of P2P systems. IPv4 is essential to support legacy applications. Thus, the natural choice for address space in a virtual network for grids is the IPv4 space. Virtualizing the entire IPv4 space is possible by making independent virtual 32-bit address spaces available for each application, very similar to the way OS handle memory virtualization. In order to offer independent network address spaces on a per-application or per-process basis, changes in OS kernel would be required, which is undesirable. If network address space virtualization is handled at routing level (network layer), independent virtual address spaces can be defined on a per-node basis. In general, grid nodes utilize existing Internet services, and whole space virtualization can deny nodes to access these services. One option is to use sub-spaces not routed in the Internet, which include private IP address spaces [62] (see Table 3-1) and unassigned spaces. Note that private spaces are not routed, but certainly they are routable in the Internet. In fact, those spaces are routed inside corporate networks and campus networks. However, the Internet routers are configured to not route traffic that belongs to private spaces. The use of private spaces as virtual identities of nodes can be a problem since many private networks are already deployed, and resources joining a Grid typically belong to private 47

48 networks. The ideal solution would be to have one of the currently unassigned spaces assigned to grid virtual network traffic, in a similar way that private spaces have been reserved by the Internet Assigned Numbers Authority (IANA) [63]. Table 3-1. Private IP address space Address range Network mask bit bit bit Network Interface Hosts may have their connectivity limited by firewalls, and the use of real (or physical) network identifiers is not possible when symmetric communication or unique addresses are required. In order to participate in virtual networks, new network identifiers, independent from physical ones, are necessary. As previously discussed, the definition of new network identifiers is not attractive as many programs are implemented using IPv4 APIs. Reconfiguring the existing IP address so that a host can participate in virtual networks can block hosts to access Internet services. Additional IP address(es) becomes necessary for the hosts to participate in virtual networks and still have the original Internet connectivity intact. When Internet services are not needed, hosts can be configured only with virtual network addresses. The most direct way to add an additional IP address to a host is to plug in a physical network interface card (NIC) and configure it with the desired IP address. Many computers, especially server products, are configured by manufacturers with two NICs, and often only one is used, leaving an extra NIC for virtual network use. In terms of cost, this is the most expensive option. Even if a free NIC is found, costs of cabling and switches need to be considered. Software solutions, in general, are low cost and do not involve hardware expenses. 48

49 There are several ways to virtualize, via software means, a network interface. One method is to intercept packets before the physical media is reached. Intercepted packets can then be modified and routed according to virtual networking needs. The best place to intercept packets is in the OS kernel network stack. Disadvantages of doing so are difficulties in kernel programming added to portability cost i.e., the need to implement and maintain software for a variety of OS. Another option is to intercept OS networking calls and modify the behavior of those calls to implement the desired functions. This approach can potentially degrade the performance of the machine and also suffers from portability costs. A complete virtual network interface card that mimics the functionality of physical NICs can also be implemented. Virtual NICs are software components, in general implemented as OS kernel modules that emulate hardware NICs. The use of universal devices such as TUN or TAP drivers [64] is a possibility, but this requires the installation of TUN/TAP packages in all hosts, which is not supported by all OS. In the case of system-level virtual machines, support for multiple (virtual) NICs is offered by VMMs. A very simple, inexpensive, and efficient solution is the use of IP aliasing the capability of binding multiple IP addresses in one physical NIC. Using IP aliasing, hosts would not require any additional software installation in order to be ready for virtual networking. Routing Informally, routing in computer networks is the act of finding a path from a source node to the destination node on which IP packets can travel. Routing of IP packets starts on the nodes that are transmitting packets. Based on the destination IP address of the datagram to be transmitted, the OS checks its routing table, which contains entries indicating subnets that can be directly reached (i.e., LAN connections) and other entries specifying an intermediate node that can deliver packets destined to some subnets. A special entry is the default gateway. All packets 49

with destination that does not match the entries in the routing table are forwarded to the default gateway (see Figure 3-1). Figure 3-1. OS routing tables.

The table is inspected sequentially, until a match is found. Default gateways or routers are network equipments specialized in routing.

50 with destination that does not match the entries in the routing table are forwarded to the default gateway (see Figure 3-1). Figure 3-1. OS routing tables. route command lists the content of OS routing tables (Windows on top, Linux on bottom). The list is ordered by network size. Single nodes are listed first and the default gateway is the last one. The table is inspected sequentially, until a match is found. Default gateways or routers are network equipments specialized in routing. The Internet consists of a large number of routers, organized into interconnected autonomous systems (ASs) [65]. An autonomous system, also referred as a routing domain, is a collection of networks and routers administered by one organization that presents a common routing policy. Routers need to collaborate with each other in order to accomplish their mission of delivering packets. Simply put, packets are forwarded from one router to another through a path 50

51 determined to be the best and to get them closer to the destination until they are delivered. Routing algorithms are used to determine the paths, while routing protocols are used to gather and distribute relevant information for routing processing. Inside ASs, Interior Gateway Protocols (IGPs) are used. Routing Information Protocol (RIP) [66] and Open Shortest Path First (OSPF) [67] are the most used IGPs. Exterior Gateway Protocols (EGPs) handle inter-domain routing between ASs, with the Border Gateway Protocol (BGP) [68] being the standard protocol currently in use. Those protocols define the information to be exchanged between routers and how the exchanges occur. Figure 3-2. Internet routing versus virtual network routing. In the Internet, routers constantly exchange information so that packet forwarding decisions can be made. Routing path for a given source and destination can change over time, depending on the conditions of links between routers. A message sent from host A to host B needs to traverse several routers. In the case of virtual network routing, VRs see direct links to all other VRs. A message from host C to host D will appear as being routed only by 2 VRs, but in reality several Internet routers are traversed. Modifying this complex Internet routing infrastructure, making it handle virtual network traffic, is not practical. The use of the Internet infrastructure as means of transport is very attractive, since the Internet is practically ubiquitous. Instead of changing infrastructure equipments, a virtual routing infrastructure with additional routers specialized in routing virtual network traffic needs to be implemented. The additional routers, defined as virtual routers (VRs), 51

This fact would allow VRs to operate with simple routing tables without requiring complex routing algorithms and protocols. Figure 3-2 compares Internet routing and virtual network routing.

52 should encapsulate packets that belong to virtual networks and route them using the Internet infrastructure. A fundamental difference between Internet routers and VRs is that each VR can potentially have links to all other VRs (depending on the connectivity provided by the Internet). This fact would allow VRs to operate with simple routing tables without requiring complex routing algorithms and protocols. Figure 3-2 compares Internet routing and virtual network routing. The majority of network virtualization technologies, if not all, use the Internet to transport encapsulated datagrams. Encapsulation is the technique that enables movement of virtual network datagrams across the Internet. Encapsulation is not a particular technique to virtual networks and it is applied in all network stack designs (see Figure 3-3 and Figure 3-4). Figure 3-3. Encapsulation used in TCP/IP over Ethernet. Figure 3-4. Virtual network datagram. Virtual network IP headers have addresses not routed in the Internet. Network virtualization works intercepting those packets before entering the Internet. Intercepted packets are encapsulated into VN datagrams, which in turn are transmitted as regular messages on the Internet (between VRs). In the case of Ethernet based TCP/IP networks, messages from applications are encapsulated into TCP (or UDP) packets, which in turn are encapsulated into IP datagrams. Finally, IP datagrams are encapsulated into Ethernet frames for transmission. 52

53 In the Internet, hosts do not participate in routing process, in terms of executing routing algorithms and protocols. Packet delivery in LANs is handled by L2 switches or broadcast mechanisms at datalink layer. When LAN boundary needs to be crossed, hosts simply forward packets to the appropriate router, configured in OS routing tables. Tunneling based technologies configure additional route entries in participating hosts. For example, hosts using VPN are configured to forward VPN related traffic to VPN routers. In P2P systems, all nodes act as routers. Routing information is distributed among participating hosts, which need to cooperate with each other to establish communication flow. In this work, infrastructures that follow the Internet model for routing are referred as router-based systems. When routing requires participation of all nodes, the infrastructure is referred as Fully-Decentralized Routing (FDR) systems. When comparing router-based systems with FDR systems, the obvious difference is that in systems with specialized routers, hosts can concentrate in computing without having to process network routing. Intuitively, the overhead of additional network processing in all FDR system nodes degrades communication performance even among nodes with direct LAN connectivity. This routing processing can also have an impact on application performance. In the case of router-based systems, the need for dedicated resources for routing can increase the overall cost. One advantage of FDR systems is to easily support mobility of hosts: since all hosts participate in routing, there is no notion of LAN and network identification can be kept independently of the physical location where a host is connected. Firewall traversal Full connectivity among nodes participating in the routing process is essential in overlay network systems. In FDR systems, full connectivity among all hosts is a necessity, while in router-based systems, connectivity among resources allocated for routing activity is imperative. 53

54 In both cases, the use of firewall traversal techniques becomes inevitable. In many cases, firewalls and NAT gateways are configured to enforce network security policies. Sometimes, the introduction of NAT gateways is necessary due to technological limitations (e.g., lack of public IP addresses). Nonetheless, the term firewall traversal will be always threatening to network and system administrators, and these techniques should be used cautiously. There are many techniques proposed and implemented. One common characteristic, shared by all techniques, is the need for at least one well-known globally reachable node (i.e., with a public and static IP address). The referred node is used as synchronization point to bootstrap the communication between two firewalled nodes (e.g., Simple Traversal of UDP through NATs - STUN [69], Traversal Using Relay NAT TURN [70]), as an intermediate router (e.g., tunneling techniques in general [71][72]). Some techniques require dynamic configuration of firewalls (e.g., Generic Connection Brokering GCB [73] and Cooperative On-Demand Opening - CODO [74]), or support from Internet infrastructure services (e.g., IP Next Layer IPNL [75], Address Virtualization Enabling Service AVES [76], Realm Specific IP RSIP [77]). Dynamically configuring firewalls imposes difficulties in deployments due to obvious skepticism of releasing control of firewalls. Techniques implemented targeting specific firewalls also suffer from deployment difficulties as many firewall implementations exist. Approaches that change Internet equipments require a long time for adoption and widespread deployment, as exemplified by IPv6. Design of a Virtual Network (ViNe) Architecture for Grid Computing Previous sections described the design space of virtual networks i.e., components to be virtualized, virtualization techniques available, issues to overcome, how to address these issues, and strengths and weaknesses of each idea. In the following sections, the design of ViNe is described and qualitatively analyzed. Design decisions are supported by the study presented in 54

55 previous sections. Quantitative analysis and in-depth characterization are presented in Chapter 4. Table 3-2 lists the terminology used in ViNe description. Table 3-2. ViNe terminology. ViNe Refers to the overall ViNe architecture or ViNe system. For clarity, ViNe architecture or ViNe system are used when necessary. VN ViNe space ViNe domain ViNe node ViNe address ViNe packet ViNe header VR Regular-VR Limited-VR Queue-VR Destination-VR VR-software VR-host ViNe-I ViNe-M ViNe ID A virtual network. IPv4 sub-space used to uniquely identify ViNe nodes. LAN where a VR is deployed. A machine configured in a ViNe space. IPv4 addresses configured in ViNe node. IP packet generated by ViNe nodes encapsulated with a ViNe header. Information added to IP packets generated by ViNe nodes used during the routing process. ViNe Router or virtual router. VR without connectivity limitations imposed by ISPs or firewalls. In general, regular-vrs have static public IP addresses. VR under connectivity limitations imposed by ISPs or firewalls. Regular-VR serving as intermediate node where packets, destined for networks under limited-vrs control, are queued. Packets destined to limited-vrs are routed to an associated queue-vr, since direct delivery is not possible. The VR to where a ViNe packet needs to be forwarded. It can be a regular-vr or queue-vr. ViNe routing software running in VRs. Machine where VR-software runs. A VR-host with VR-software is a VR. ViNe Infrastructure is the routing foundation of ViNe. VRs collaboratively establish all-to-all virtual links so that communication between any pair of ViNe nodes can be established. ViNe-I consists of VRs and ViNe hosts. ViNe Management monitors and controls ViNe-I. ViNe-M consists of middleware driving VRs by invoking management interfaces exposed by VRs. A 32-bit integer number that uniquely identifies a VN. ViNe architecture needs to support the coexistence of multiple and isolated VNs on top of the Internet. This means that it should be possible to define a VN with a collection of resources on-demand, tear-it-down when no longer necessary, and while resources are part of one VN, offer full connectivity with each other, independently of geographical locations and connectivity configurations (e.g., limited or not by firewalls) of each resource. 55

56 Figure 3-5. ViNe architecture. ViNe nodes are configured with an address in ViNe space and use VRs to reach each other. VRs have all-to-all (virtual) links and collectively compose the ViNe Infrastructure. ViNe Infrastructure is controlled by ViNe Management, a collection of services that monitors and reconfigures VRs whenever necessary. ViNe architecture is best described by subdividing it into three main components, as illustrated in Figure 3-5. ViNe Nodes: nodes configured to participate in ViNe system. ViNe Infrastructure (ViNe-I): responsible for routing ViNe traffic and maintaining full connectivity among ViNe nodes. It consists of VRs and ViNe nodes. ViNe Management (ViNe-M): controls and manages the deployment of independent VNs. It consists of middleware that drives VRs by invoking management interfaces exposed by VRs. For the benefit of platform independence, ViNe is designed as a router-based system and no additional software should be required to be installed on ViNe nodes, which rely on VRs to 56

57 establish communication with each other. Configuration of ViNe nodes and the design of ViNe Infrastructure are described in the following sub-sections. ViNe-M is described in Chapter 5. ViNe Address Space and ViNe addresses The use of IPv4 addresses is convenient to support both existing applications and newly developed ones, since Sockets API is well established and used in the majority of network applications. IPv4 addresses configured in ViNe nodes are called ViNe addresses. There is no restriction on which region of the IPv4 space to use, and the selected region is called ViNe Address Space. In fact, it is possible to use of the entire 32-bit address space. The use of the entire space should be limited to deployments where hosts do not need services on the Internet, as machines connected to the Internet would not be reachable from ViNe nodes. ViNe Node Configuration Complete software solutions of network interfaces (e.g., TUN/TAP devices) are avoided for the benefit of platform independence. Since ViNe addresses are, in essence, IPv4 addresses, and resources (physical or virtual machines) are assumed to have some network connectivity, configuring a ViNe address on an existing NIC is sufficient for a node to be ViNe ready. No ViNe specific network interface is required. When an unused NIC is available, the interface can be configured for ViNe activity. In the case of VMs, most VMMs offer the capability of adding virtual NICs that guest OS believes to be physical NICs. If the available NIC is already taken (e.g., for Internet activity), IP aliasing can be used (Figure 3-6). IP aliasing allows multiple IP addresses to be bound on one physical NIC, and it is available in most modern OS. 57

58 Figure 3-6. IP aliasing configuration. Windows OS (top) provides graphical interface to configure additional IP addresses to a NIC. On Linux (bottom), and in many UNIX flavors, ifconfig command provides means for IP aliasing configuration. 58

Once a node is configured with a ViNe address, its OS routing table needs to be configured with an entry that forces ViNe traffic to be handled by ViNe-I (see Figure 3-7)

59 Once a node is configured with a ViNe address, its OS routing table needs to be configured with an entry that forces ViNe traffic to be handled by ViNe-I (see Figure 3-7). Figure 3-7. OS routing table manipulation. The command route allows users with administrator privilege to manipulate the OS routing table. In the example above, the OS is instructed to use the machine as a gateway to the network /24. ViNe Infrastructure ViNe-I is the routing substrate of ViNe architecture. It is composed of virtual (or ViNe) routers (VRs) that cooperate with each other in order to offer full connectivity between nodes that are members of a VN. For simplicity, VRs refer to machines running ViNe routing software, which is called VR-software. To avoid ambiguity, VR-host is used to refer to the machine where VR-software runs. ViNe nodes are geographically dispersed, belong to distinct administrative domains and are possibly connected to physical private networks with connectivity limitations imposed by 59

60 firewalls. By design, ViNe nodes do not require additional software installation and ViNe-I is fully responsible for firewall traversal. Nodes within the same LAN (or VLAN) do not have connectivity limitations 2, and only ViNe traffic that needs to cross LAN boundaries are handed to ViNe-I. LAN concept is not lost, and communication between ViNe nodes within a LAN occurs at physical network performance. This is an important advantage over P2P systems, where all nodes participate in routing, with overheads in communication even between nodes connected to the same physical switch. Each ViNe node needs to be able to directly reach at least one VR, in order to use it as a gateway to ViNe and consequently to VNs. Forwarding packets to a VR (or routers and gateways in general), is handled through data-link layer (L2) mechanisms, which implies that at least one VR needs to be present in each LAN with ViNe nodes. A ViNe node cannot directly forward packets to VRs in other LANs as crossing LAN boundaries involves routing (L3 or IP layer) in the Internet. Firewall traversal VRs are, in general, under the same connectivity limitation rules imposed by the physical network infrastructure. Connectivity limitations of VRs impede the application of existing routing algorithms and protocols on ViNe, since these were designed under the assumption that there is at least one path between any given pair of routers a path can involve multiple routers, but presence of VRs with connectivity limitations makes a VR graph a non-connected one. Firewall traversal mechanisms are optimized for different scenarios and setups, and depending on firewall types, the solutions may not work. For example, GCB and CODO are 2 Connectivity limitations can be imposed by OS firewalls (or personal firewalls), but users or local system administrators have full control over those firewall rules. Necessary configuration should be handled locally in each site to allow nodes to participate in distributed computing. 60

61 designed for Condor deployments while STUN works only for UDP traffic. Assuming that all VRs have outgoing access to the Internet (e.g., HyperText Transfer Protocol, or simply HTTP, access to the web), a solution that uses at least one VR with a well-known public IP address would make communication among all VRs possible. This VR is used as an intermediate that queue messages (hereon called queue-vrs) to VRs suffering from connectivity limitations (hereon called limited-vrs). When a message needs to be transmitted to a limited-vr, the transmitter sends the message to a queue-vr. The limited-vr can retrieve the message, at its convenience, opening a connection to the queue-vr in a similar way that web pages are retrieved using browsers from private networks (see Figure 3-8). A simple optimization is to make an always-open communications channel from limited-vr to its queue-vr. The limited-vr can open a TCP connection to the assigned queue-vr and keep the connection alive. When a message destined to the limited-vr arrives at the queue-vr, it can be immediately forwarded to the target through the TCP connection without the need for queuing. Another optimization is to dynamically assign Figure 3-8. Firewall traversal in ViNe. Web browsers connected in private networks can retrieve documents from web servers on the public network by initiating a communication (1). Communication establishment requests create necessary state in firewalls that allow response messages to reach the requestor. In a similar way, limited-vrs open communication to queue-vrs to retrieve packets. VRs need to send messages destined to limited-vrs to their respective queue-vrs, as the direct delivery is not possible. 61

62 appropriate queue-vrs according to the physical network conditions (i.e., available bandwidth or congestion). ViNe routing The firewall traversal technique described previously guarantees that any given VR can send messages to any other VRs. In other words, from ViNe-I s perspective, each VR has direct connection to all VRs (a limited-vr is reached through its queue-vr). From Internet s perspective, each VR-to-VR link is a set of possible paths from which one is chosen in the process of Internet routing. Improving the Internet routing would require administrative access to Internet routers and/or development of new routing algorithms and protocols, which is not a goal of ViNe-I. Instead, ViNe-I completely relies on the Internet routing for inter-vr communication. A fully-connected network of VRs allows ViNe-I to implement a very simple routing algorithm. There is at least one VR with direct connection to a ViNe node (or address). So, given a message generated by a ViNe node, a VR just needs to forward it to a VR capable of delivering to the destination. This can be accomplished with a routing table that lists destination subnets along with VRs responsible for each subnet. For subnets under limited-vr control, the routing table should list those subnets along with corresponding queue-vrs. One possible drawback of this algorithm is that all VRs need to hold the full routing table, however the 2-hop 3 routing (or 3-hop routing when limited-vrs are involved) offers low overhead routing. A more detailed performance study is presented in Chapter 4. The process of joining ViNe starts connecting and configuring one VR in a LAN (or VLAN). The assigned ViNe subnet and VR s IP address is broadcast to all existing VRs while 3 Hop count is defined as the total number of VRs a packet traverses, not the Internet router count. The source node and destination node are not considered in the hop count. 62

63 the new VR receives the current (and possibly updated) routing table. At this stage, it suffices to understand that VRs can be dynamically configured and any broadcast or data distribution mechanism can be used to update VRs routing tables. For details, see Chapter 5. Multiple Isolated Virtual Networks The routing algorithm described previously would allow any ViNe node to reach all others. However, ViNe-I needs to enable the co-existence of multiple isolated VNs, aggregating ViNe nodes into independent groups. This is accomplished with identifying VNs with unique ViNe IDs and defining ViNe node membership to VNs. Each deployed VN receives identification a 32-bit integer number called ViNe ID. A ViNe node can be configured to participate in one or more VNs. ViNe node membership determines in which VNs a node participates. ViNe node membership information is held on VRs responsible for the LAN (i.e., VRs capable of delivering messages to the ViNe nodes). Routing tables maintained by VRs are organized in two types, namely, Local Network Description Tables (LNDTs) and Global Network Description Tables (GNDTs). Each VR maintain the list of ViNe nodes it is responsible for in a LNDT an entry describes the ViNe membership of a node. GNDT stores information about the structure of a ViNe an entry indicates the physical IP address of the destination VR to where a packet needs to be forwarded if the destination address falls in the listed range (see example in Figure 3-9). The information in the LNDT enables a VR to support the participation of ViNe nodes in different virtual networks. GNDTs define the structures of independent VNs. When a VR receives a packet for routing, the source host address is verified if it is not listed in LNDT the packet is immediately dropped. The corresponding LNDT entry will point to the GNDT that should be used for routing. The packet is then forwarded to the VR described in 63

the GNDT. The destination VR verifies whether the destination host of the packet is in its LNDT, and whether the destination host is part of the correct VN. Figure 3-9. LNDT and GNDT examples.

64 the GNDT. The destination VR verifies whether the destination host of the packet is in its LNDT, and whether the destination host is part of the correct VN. Figure 3-9. LNDT and GNDT examples. When a host-generated packet destined to a particular network reaches the VR, the virtual network ID is verified in LNDT. Using the corresponding GNDT, the packet is tunneled to the destination VR, where it will be delivered. Consider node sending a packet to LNDT indicates the source node as a VN 8000 member. Consulting GNDT 8000, the VR concludes that the packet needs to be sent to VR with public IP A using TCP protocol. Putting it all together Figure 3-10 illustrates an example where two VNs are defined. VN1 (ID = 1) connects participating hosts (represented by circles) in domains A and B while VN2 connects participating hosts in domains A and C. All VRs (shown as diamond shapes) receive a copy of the GNDTs of every VN deployed (GNDT-VN1 and GNDT-VN2 in this example). When a packet is sent from to , LNDT- Domain A is checked to find out that the source host ( ) is a member of VN1. Then, (the VR handling this packet) consults GNDT-VN1 and forwards the packet to pub.a.110. Finally, pub.a.110 (the VR serving the destination node) delivers the packet. Packets destined to /24 (which has a VR behind a NAT gateway in physical space) are forwarded to pub.a , the VR of Domain A, opens a TCP channel to pub.a.110 from where it will receive all messages destined to ViNe nodes of Domain A. 64

Figure 3-10. ViNe at work example. For each physical network (Domain A, Domain B and Domain C ), a partition of the virtual network space is allocated. All Virtual Routers receive copies of GNDTs.

65 Figure ViNe at work example. For each physical network (Domain A, Domain B and Domain C ), a partition of the virtual network space is allocated. All Virtual Routers receive copies of GNDTs. Each Virtual Router maintains the LNDT corresponding to its subspace. If a host in Domain B tries to communicate to a host in Domain C, packets will be dropped by VRs, as there is no matching entry in GNDT-VN1 and GNDT-VN2. Discussion The presented routing infrastructure, based on VRs, offers end-to-end connectivity between ViNe nodes by defining queue-vrs capable of queuing packets, when destination-vrs cannot be reached directly. The implication is that at least one VR needs to be placed on public network (i.e., at least one queue-vr is necessary), and that hosts that depend on limited-vrs will potentially experience lower network performance than those connected to public VRs. 65

66 In order not to disrupt the original Internet services in hosts joining ViNe, private address space (or any non-routed IPv4 address space) is used to provide global virtual identities. Packets destined to the virtual range are redirected to VRs for routing in virtual space. Since redirection is achieved by defining static routes in hosts, there is no performance penalty for the regular Internet traffic. Furthermore, there is no overhead in local communication between hosts. Configuring a network to be part of ViNe is equivalent to configuring a VR. A VR is a machine which could be physical or virtual, dedicated or non-dedicated placed in participating networks with Grid resources. Configuration of routing software can be made very simple, as creation and management of ViNe are automatically done through communication between VRs. Virtual machine technology can further help on deployment of VRs, allowing transfer and instantiation of pre-configured VRs into target networks. Configuration of hosts is even simpler. Hosts need a one-time very simple system administrator intervention: configuration of a virtual interface (IP aliasing) and definition of a static route to the ViNe address space. The ViNe architecture leaves the Internet traffic untouched. There is no need to change security policies already in place. However, since the architecture adds a new network interface (private IP address for ViNe) in each participating host, network firewalls and/or host firewalls need to allow traffic from/to the ViNe private address space. The policies defined for the original LAN can be applied to the ViNe address space. ViNe traffic is only visible in the ViNe space, as the Internet does not route private addresses. The ViNe architecture supports multiple independent and isolated virtual networks, and the definition, deployment and maintenance of ViNe can be fully automated. Resource allocation is 66

67 under Grid middleware control, allowing ViNe creation to be triggered by the middleware, possibly (but not necessarily) in response to requests from hosts to join a virtual network. From the applications perspective, ViNe and physical networks are indistinguishable. It is even possible to aggregate machines in different private networks into a cluster and run parallel applications, for example based on MPI, without recompilation or reengineering of software. The host configuration for ViNe does not involve any additional software installation. The presented approach does not require any change to the Internet services. The unmodified Internet routing infrastructure is used to tunnel ViNe traffic between VRs without requiring other Internet service. When an Internet service (e.g., DNS) is needed in ViNe, it needs to be configured in virtual space and can be easily integrated as no software modification is required. The ViNe approach does not require software installation on participating hosts. Running OS are only required to support the definition of multiple IP addresses per NIC and configuration of static routes (which are supported by most modern OS). This fact makes the ViNe approach virtually compatible to any platform. The maximum number of hosts simultaneously connected to a ViNe is limited to the selected IP space (e.g., 20-bit private address space) - not an architectural limit. The real limit is much larger considering the fact that two networks can share the same virtual address partition if it can be established that they are not active at the same time. It is even possible to have networks sharing addresses that could be active at the same time if they participate in different virtual networks. Looking forward to the possible adoption of IPv6, firewalls are likely to remain a problem with regard to asymmetric connectivity. Since the ViNe architecture is not tied to IPv4, 67

68 the implementation described in Section 6 can be easily modified to support IPv6, in which case a much larger number of hosts can be supported. Security Considerations As ViNe creates connectivity between hosts without links in the physical space, many security issues can be raised, especially since connections might have been limited partially due to security concerns. It is important to be noted that ViNe does not try to increase the security of the Internet, as this is not the objective. Some security aspect may be improved due to the way ViNe is architected or implemented, but most importantly ViNe cannot exacerbate Internet security problems. The following are the main aspects to be considered: Isolation between ViNe and the Internet: Communication related to a distributed computation goes through ViNe, while regular Internet services (e.g., web, ) use the Internet. ViNe nodes are connected to the Internet and possibly to multiple VNs. Usual techniques to protect nodes (e.g., network and personal firewalls, anti-virus software) need to be applied to ViNe nodes. VRs do not route packets to or from the Internet, keeping VNs and Internet traffics isolated. VRs have direct connectivity to VNs and the Internet, and its implementation needs to follow top quality standards. Connecting a node to one or more VNs opens connectivity with new hosts in virtual network space, but that was the intent when an organization decided to have some of its resources join ViNe and a grid infrastructure. Physical network security reconfiguration: ViNe approach requires one host in the joining network to work as a VR. This host does not need to have a public IP address and firewall rules need not be changed to allow incoming connections. All original security policies are in effect after enabling virtual networking. ViNe traffic is handled exclusively by VRs, and due to encapsulation of ViNe packets most, if not all, security policies defined in physical network equipments are bypassed. However, packet encapsulation and unpacking are fully processed by VRs and unmodified IP packets are delivered to end ViNe nodes, and security policies, if necessary, can be enforced on ViNe nodes. VR module can also be implemented to be integrated in joining network s core equipments, or a VR can be configured to work closely with policy and security enforcing equipments, delegating ViNe packets delivery and only accepting outgoing packets that passed security checks. Attacks from the Internet: When VNs are configured to use IPv4 address spaces not routed in the Internet, ViNe nodes are not reachable from Internet nodes, unless an Internet router is compromised. If IPv4 spaces routed in the Internet are used for VNs, ViNe nodes are required to be physically isolated from the Internet. In both cases, a compromised VR can open access of ViNe nodes to the Internet. So, VR software needs to be carefully implemented and the resource hosting VR software needs to be well protected. 68

69 Attacks to the Internet: ViNe-enabling a node does neither add nor limit the original Internet connectivity of the node. If a compromised node can attack the Internet, it is because of its Internet connectivity, and not because it was ViNe-enabled. VR protection: VR software should be implemented so that all messages are authenticated and a compromised VR can do harm only to ViNe nodes under its control. VR hosting resources are exposed to the same level of security as any host connected to the same physical network. VR hosts are assumed to be protected by physical firewalls. While compromising a VR host might lead to a compromised VR, an attacker does not gain access to VNs by controlling a resource acting as VR. ViNe node protection: Regular means for Internet attack protection should be applied to ViNe nodes, if Internet services are required. ViNe architecture does not provide means to enforce security policies among ViNe nodes. However, network security equipments can be placed between ViNe nodes and VR, in order to enforce site security policies on ViNe traffic. On a higher level, VNs can be defined and deployed based on security and connectivity policies determined by participating sites. ViNe-M is then responsible to let site administrators define policies and decide on ViNe creation based on all involved site specific policies. ViNe Prototype Implementation ViNe architecture does not require additional software on ViNe nodes for routing purposes. However, for ViNe node configuration and setup purposes, ViNe-M may require running simple scripts running on nodes. Those scripts are completely unnecessary if nodes are manually configured by local system administrators. The main, and only, component of ViNe-I is the VR. There are many ways of implementing a VR that provides the functionality described in previous sections. The prototype VR implementation process is detailed in next subsections. Options and techniques available when implementing parts of VR are discussed. The characterization of implementation choices is done in the next chapter. VR-software components A VR is the gateway of ViNe packets to VNs, and it is responsible to analyze ViNe packets forwarded by ViNe nodes and transfer it using the Internet to another VR, which is capable of delivering the packet to the destination node. Three main modules can be identified in 69

a VR: one that process incoming packets, another that is responsible for routing and finally the one responsible for delivering packets (Figure 3-11).

70 a VR: one that process incoming packets, another that is responsible for routing and finally the one responsible for delivering packets (Figure 3-11). A forth component, not directly related to ViNe packet processing, is the one enabling dynamic reconfiguration of VRs. This component interacts with ViNe-M and its design is presented in Chapter 5. Figure VR components. Packet interception module intercepts VN related packets form OS network stack. Packet injection module injects packets to the LAN where the destination node is connected. Routing module decides where to forward packets received locally (from packet injection module) or remotely (from other VRs). Configuration Module This module is responsible for configuring VR-software, updating routing tables (LNDT and GNDTs), setting TCP or UDP ports used in VR-to-VR communication and several other parameters such as number of worker threads. The first prototype statically reads all necessary parameters from a file during VR boot time. Dynamic reconfiguration requires changes in configuration files and reboot of the VR. Improvements to this module are incorporated along with ViNe-M design and implementation. Data structures maintained by the Configuration Module include VR configuration parameters, LNDT, GNDT and Network Security Key Table (NSKT). All the data can be dynamically changed by ViNe-M. Several VR configuration parameters described in Table 3-3, control VR behavior. 70

71 Table 3-3. VR configuration parameters. Parameter Description MyVSubNet MyVSubNetMask MyID MsgSize PktSize QueueSize TCPServerPort UDPServerPort MACEnabled Limited QueueVR PollProtocol ViNe subnet for which the VR is responsible. ViNe nodes served by this VR should be configured with IP addresses in this subnet. Network mask of ViNe subnet. VR identification, a 32-bit number. Typically the physical IP address of the VR. Maximum message size. Used for buffer allocation. Maximum packet size. Used for buffer allocation. Size of packet queue, where intercepted packets are buffered. TCP port used by a server process listening for communication from other VRs. UDP port used by a server process listening for communication from other VRs. Flag to enable or disable authentication of VR messages. Flag to have VR act or not as limited-vr. Physical IP address of the queue-vr (used only by limited-vrs). Protocol used for polling the queue-vr (used only by limited-vrs). 0 Use UDP to poll queue-vr in regular intervals. 1 Use TCP to poll queue-vr in regular intervals. 2 Use TCP channel, i.e., open a TCP connection to queue-vr and keep the connection alive. PollInterval Poll interval in milliseconds (only used when PollProtocol is 0 or 1). Table 3-4 lists the information found in an LNDT entry. LNDT is implemented as a hash table, with ViNe host addresses used as keys and corresponding ViNe ID as values. Table 3-4. LNDT entry information Parameter Description Addr ViNe host address. ViNeID ID (32-bit integer) of the ViNe the host is member. Table 3-5 lists the information found in a GNDT entry. GNDT is implemented as a hash table, with network addresses used as keys and a structure with the information below as values. Table 3-6 lists the information found in an NSKT entry. NSKT is implemented as a hash table, with VR physical address as keys and corresponding cryptographic key as values. 71

72 Table 3-5. GNDT entry information. Parameter Description Net mask VR Network address. Network mask. Physical IP of the destination-vr the VR directly connected to the subnet or the queue-vr serving the limited-vr connected to the subnet. VRProt VRPort VRType Protocol used when communicating with this VR. 1 UDP 2 TCP TCP or UDP port used when communicating with this VR. Type of VR. 0 Disabled 1 Regular 2 Queue Table 3-6. NSKT entry information. Parameter Description VR Key VR physical IP address. Cryptographic key used to authenticate messages from the VR. Packet Interception Module ViNe nodes use regular Internet protocols to forward IP packets to a VR, i.e., an entry in the ViNe node OS routing table instructs packets with destination in the ViNe space to be sent to the configured VR. ViNe packets are regular IP packets with regular IPv4 source and destination addresses in the header. Since L2 mechanism is used to transfer ViNe packets to a VR, they reach VR-host unmodified, and since the destination address is not that of the VR-host physical interface, packets need to be intercepted and copied to VR software user-level memory space. Packet capturing tools, such as libpcap [78], would allow ViNe-software to get copies of IP datagrams for processing, but those tools do not offer means of stopping the packets of interest from propagating the VR-host OS stack. 72

TUN/TAP devices, a virtual point-to-point or a virtual Ethernet network device, allow user-level applications to write IP or Ethernet frames to or from OS kernel network stack.

73 TUN/TAP devices, a virtual point-to-point or a virtual Ethernet network device, allow user-level applications to write IP or Ethernet frames to or from OS kernel network stack. Combined with software bridging, the use of a TAP device would allow VR-software to get IP datagrams for processing. Intercepting packets directly from OS network stack would allow the implementation of an efficient VR-software, as an OS kernel module. Processing packets at kernel-level would avoid many unnecessary memory copies. In the case of Linux, the Netfilter infrastructure [80] offers several hooks and APIs to write network processing modules. One of the capabilities offered in Netfilter is to allow user-level programs to intercept packets. This capability was used in the reference implementation of VR-software for initial research and assessment of ViNe approach. Performance comparison between the options above is presented in Chapter 4. Packet interception and interfacing with Netfilter APIs are handled through a program written in C language. Intercepted packets are handed to Routing Module (Figure 3-12). Figure Intercepting packets for VR software processing. Packets from ViNe nodes sent to a VR enter the Linux Netfilter infrastructure in the VR. Since the destination address of the packets does not match VR s address, the packet will traverse the forward chain of Netfilter. The forward chain is configured, by VR-software, with a rule to copy packets to a queue for user-level processing (ip_queue module). VR packet interception module interfaces with Netfilter ip_queue module to retrieve ViNe packets. 73

Packet Injection Module Once a ViNe packet is routed and reaches the final VR, the packet needs to be delivered to the destination ViNe node. This is exactly the opposite of the previous scenario.

74 Packet Injection Module Once a ViNe packet is routed and reaches the final VR, the packet needs to be delivered to the destination ViNe node. This is exactly the opposite of the previous scenario. Now, a userlevel program (VR-software, in particular the Routing Module) has a packet that needs to be written into VR-host physical network interface for the final delivery. Packet Injection Module is responsible for getting packets from Routing Module and transmitting them through the physical media copper, fiber or air. Packet injection is not particularly difficult as there are many libraries available offering the necessary functionality. In this case, libnet library [79] has been used (Figure 3-13). Figure Packet injection using libnet. IP packets generated at ViNe nodes are recovered (unpacked) and handed to libnet to be transmitted over the physical network media. Routing Module Routing Module is the most complex component of VR-software. This module holds data structures to properly and securely route ViNe packets. Java language is used to implement this module due to its rich set of libraries and support for convenient data structures. Links to packet injection and interception modules are established through Java Native Interface (JNI) [81] programming. There are three entry points for the Routing Module (Figure 3-14). The first one receives packets sent by ViNe nodes through Packet Interception Module. The second entry point receives encapsulated packets through VR-to-VR communication. Encapsulated packets can be 74

queued for later retrieval by limited-vrs or can be handed-over to Packet Injection Module. The third entry point receives configuration instructions from Configuration Module. Figure 3-14.

75 queued for later retrieval by limited-vrs or can be handed-over to Packet Injection Module. The third entry point receives configuration instructions from Configuration Module. Figure Routing Module. Intercepted packets are encapsulated and transmitted to the appropriate VR. Packets received from other VRs are queued for limited-vrs or delivered to the destination ViNe host. Behavior of the Routing Module is controlled by the ViNe-M that can change configuration parameters and routing tables dynamically. Routing Module processes incoming packets, from Packet Interception Module, following the steps below: Read the source and destination addresses from IP header. Access source IP information in LNDT. If the source address is not found in LNDT, the packet is discarded. The entry in LNDT indicates the ViNe ID that the source node participates. Access the GNDT with the corresponding ViNe ID. Look for the destination-vr. If the destination address is not described in the GNDT, the packet is discarded. ViNe header is added to the packet. The encapsulated packet is digitally signed. Currently keyed-hash Message Authentication Code (HMAC) is used. The encapsulated packet is sent over the Internet to the destination VR. 75

Encapsulated packets received from other VRs as result of ViNe routing process or retrieved from queue-vrs by limited-vrs are processed as follows: The packet is authenticated using HMAC.

76 Encapsulated packets received from other VRs as result of ViNe routing process or retrieved from queue-vrs by limited-vrs are processed as follows: The packet is authenticated using HMAC. Packets that fail authentication, including Internet packets without ViNe header, are dropped. Access ViNe ID of the packet from ViNe header. Strip off ViNe header. Access destination IP address from IP header. Check in LNDT whether the destination IP is a member of appropriate ViNe. If not, the packet is discarded. The packet is handed to Packet Injection Module for final delivery. Encapsulated packets are exchanged between VRs using UDP or TCP over the Internet. A VR can have an UDP server process, a TCP server process or both. A limited-vr does not have server processes as it cannot accept connections from other VRs. Instead, it retrieves encapsulated packets initiating connections to server processes running on designated queue-vrs (note that a limited-vr can have multiple queue-vrs). Encapsulated packet is essentially the original IP packets from ViNe nodes with ViNe header added (Table 3-7). ViNe header is illustrated in Figure Figure ViNe header. 76

77 Table 3-7. ViNe header fields. Field Description Size Type State ViNe ID VR Source ID VR Destination ID MAC Total length of the packet, including ViNe header. Type of ViNe packet: 0 Forward: ViNe packet forwarded and to be unpacked and delivered to the destination ViNe node. 1 Queue: ViNe packet to be queued for the limited-vr that is able to deliver to the destination node. 2 Get from queue: Request from limited VRs to retrieve queued packets. 3 Null: dummy packet, in general, used for heart-beat and to keep TCP connections alive. State of the ViNe packet: Bit 0 raw or signed with MAC (bit set). Bit 1 encrypted with secret key (bit set) or not. Bit 2 encrypted with public key (bit set) or not. ViNe ID that the packet belongs. 32-bit integer associated with the VR sending the packet. 32-bit integer associated with the VR receiving the packet. 128-bit word HMAC. ViNe Prototype This section reports preliminary performance results of ViNe Infrastructure obtained using a reference VR-software implementation. Netperf network benchmark program [82] was selected for measurements. Netperf supports running benchmarks multiple times, and output results with statistical parameters. In particular, the possibility of setting minimum number of runs (iterations), the desired confidence level and confidence interval was useful. VR performance To evaluate the VR performance the TCP throughput and round-trip latency between two hosts in different private networks routed by two VRs (each of which responsible for one host) directly connected was measured. VRs were connected through Gigabit Ethernet link, with round-trip time (RTT) around 100 ns and TCP throughput around 880 Mb/s (Figure 3-16). 77

Figure 3-16. VR performance experimental setup. Network performance of ViNe nodes that use VRs connected through local Gigabit links is measured to benchmark VR-software packet processing capability.

78 Figure VR performance experimental setup. Network performance of ViNe nodes that use VRs connected through local Gigabit links is measured to benchmark VR-software packet processing capability. Different CPU configurations were used to evaluate the VR-software dependency with CPU performance. For each experiment, a pair of machines with the same hardware configuration was used as VRs. Figure 3-17 shows the end-to-end Netperf results collected between two Xeon 2.4GHz based machines as ViNe nodes. Figure VR performance results. TCP throughput and round-trip latency between ViNe nodes connected through VRs with Gigabit connectivity. The performance is limited by the VR-software processing performance. Different CPU configurations are used in VRs. 78

79 VR-software is a multi-threaded program that can take full advantage of multi-processor environments. The results show that Hyper-Threading (HT) technology also improves the performance of VRs. Generating HMAC for each packet is expected to consume considerable processing time, so two measurements for each time of CPU were taken: with and without HMAC enabled. It is possible to notice that TCP throughput increases by approximately 50% when HMAC is disabled. Although the reference implementation is not optimized, VR performance using modern CPUs is sufficient to support wide-area traffic. ViNe performance To evaluate the ViNe performance in more realistic environments, the experimental setup illustrated in Figure 3-18 was deployed. The experimental setup has hosts in the University of Florida (UF), Purdue University (PU) and Northwestern University (NWU). Figure 3-18 also shows the measured TCP round-trip latency and unidirectional throughput of the physical links involved. Figure ViNe performance experimental setup. Several cases are considered depending on where firewalls are inserted in relation to VRs. 79

80 The following scenarios are considered: Case 1: all VRs have access to the public network, without firewall limitations. Case 2: as Case 1, but PU has a limited-vr, which opens a TCP channel to UF queue-vr. Case 3: as Case 1, but PU has a limited-vr, which opens a TCP channel to NWU queue- VR. Case 4: PU and NWU have limited-vrs, which open TCP channels to UF queue-vr. Case 5: PU and UF have limited-vrs, which open TCP channels to the NWU queue-vr. Table 3-8 summarizes the results. Case 0 represents the available physical performance. Cases 1 to 5 represent measurements with HMAC and Case 1a represents a measurement without HMAC. Case 1 is the most favorable scenario when all VRs are connected to the public network. Insertion of the VRs in the host-to-host route has a small impact on latency it increases by about 1~1.5 ms. Also, it was possible to push TCP packets at nearly the available throughput for UF-PU and UF-NWU communication. This exemplifies the case when network speed, not VR throughput, limits communication rates. The performance degradation in NWU-PU communication exemplifies the case when VR throughput, not network speed, determines bandwidth. By referring to Figure 3-18, one can realize that the measured bandwidth correspond to the processing limit of the machine acting as VR in PU. With more processing power (or lighter VR processing), it would be possible to achieve closer to the available throughput. This is also confirmed by Case 1a. It was possible to improve the throughput by approximately 50% when HMAC computation was disabled. For Case 1a, a UDP-datagram bandwidth of 90 Mbit/s, matching the available physical performance, was measured (UDP measurements are not shown in Table 3-8). 80

81 Table 3-8. ViNe performance experimental results. Average round-trip latency (in ms) and unidirectional TCP throughput (in Mb/s) measurements for different cases of WAN setup involving University of Florida (UF), Northwestern University (NWU) and Purdue University (PU). UF PU UF NWU NWU PU Latency Throughput Latency Throughput Latency Throughput a VRs behind a firewall must contact a queue-vr to retrieve packets. Cases 2 and 3 illustrate how the allocation of the queue affects performance. Case 2 assigns UF as the queue-vr for PU, while in Case 3, NWU VR is the one assigned. Case 2 shows poor performance because, in addition to the extra hops and queue-vr overhead, communication is re-routed through the slowest of VR-to-VR routes (i.e., UF-to-NWU). The Case 2 throughput of 12 Mbit/s between PU and NWU VRs is not the result of VR overheads degrading the physically available bandwidth of 63.1 Mbit/s between PU and NWU, but the effects of the lowest of the bandwidths between PU and UF and UF and NWU (i.e., 17.5 Mbit/s). In contrast, since PU has better connectivity to NWU than to UF, using the NWU VR as queue-vr enables Case 3 to achieve performance close to that of Case 1 for all communication. Cases 4 and 5 consider the scenario where only one VR is publicly accessible. In this setup, allocating the NWU VR as the queue-vr proved to be the best. Only UF-PU communication degraded, but it still exhibited reasonable performance. 81

82 CHAPTER 4 ON THE PERFORMANCE OF VIRTUAL NETWORKS Characterizing Network Virtualization The vast majority of network virtualization research focuses mainly on architectural features that address problems in the base network, providing enhanced overlay or virtual network environments. Thus, there are neither generic approaches nor data available to elucidate the performance limits of overlay networks (ONs) and enable a fair performance comparison of different approaches. Due to the ever increasing demand from distributed applications, ON performance characterization is of vital importance, especially for production deployments. A network virtualization layer is at the core of all ON designs. It is responsible for receiving or intercepting network messages generated by applications before they enter the base (or physical) network infrastructure, preparing the message for overlay routing (i.e., encapsulating into an overlay message format, encrypting, compressing, and digitally signing as needed), transporting the message using overlay routing mechanisms (the actual data transfer occurring in the physical network, in general through the Internet), and finally recovering and delivering the original message to the intended destination. Applications can experience degraded network performance, when compared to using the base network infrastructure, due to the time spent in the above described processing steps. The use of ONs is not even considered in several systems, for the belief that ONs perform poorly. This is justified by the fact that ONs are, in general, implemented as user-level software to be executed in regular computers, and in many cases using high-level languages (e.g., Java and C#), while high-performance networks are implemented using specialized hardware and software. The use of general purpose platforms is motivated by the easy development and 82

83 deployment of prototypes. Rich sets of libraries and data structures and multiplatform support encourage the use of high-level languages to reduce development cycles. Network performance research has shown that packet forwarding rate, a key aspect of network virtualization software, can be optimized by taking advantage of specialized hardware and firmware, or improving the OS network stack [83][84][85][86][87][88][89]. Applying similar approaches in user-level network processing is challenging, especially when using general purpose hardware and OS. In the following subsections, the performance of network virtualization components is characterized. This study can be used to understand the performance of existing ON solutions, indicate potential bottlenecks and guide the implementation of new systems. An important result is the performance improvements introduced to ViNe, making it one of the best performing userlevel virtual network approach reported to date. Experimental Setup To characterize the performance of ONs, four machines are used in the experiments (Figure 4-1). Two machines act as ON routers connecting two other machines, each of which is connected to independent gigabit Ethernet segments, representing two LANs. A third gigabit Ethernet segment, connecting the two routers, is used to represent a high-speed WAN in the Internet. The use of a gigabit LAN link to represent the Internet intends to push ON routers to their maximum capacity. 10-Gbit Ethernet is still an emerging technology with limited deployments and other research efforts have shown difficulties in handling such network demand [90][91][92]. This work conducts experiments on the widely adopted gigabit Ethernet they confirmed that modern commodity servers are able to fully utilize the available bandwidth, indicating that memory and IO connectivity do not limit the network performance. 83

Figure 4-1. Experimental setup. Four machines, each with four Intel Xeon 5130 based CPU cores, are used for experiments. Independent Gb Ethernet segments are used as LANs.

84 Figure 4-1. Experimental setup. Four machines, each with four Intel Xeon 5130 based CPU cores, are used for experiments. Independent Gb Ethernet segments are used as LANs. Each machine has two dual-core Intel Xeon 5130 processors, two Broadcom BCM5708- based integrated gigabit Ethernet interfaces and fully buffered 4GB DDR2 memory. The operating system is Linux (kernel ) with Sun Microsystems s Java Network performance is measured using the Netperf [82] benchmark program, configured to run as many experiments as needed for a confidence level of 99% and a confidence interval of 5%. For network measurements in Java, a port of the benchmark has been developed. Virtual Network Processing Figure 4-2 illustrates a generic ON processing system and identifies the flow of data while packets are processed and/or transformed by the building blocks of the system. An ON processing system can be present in all nodes participating in an ON (e.g., in P2P systems), or only in nodes responsible for ON routing, which are used as gateways to ONs by nodes (e.g., in site-to-site VPN setup). Network messages generated by applications can be captured in different layers of the OS TCP/IP stack. Interfaces to capture Ethernet frames, IP datagrams, or TCP/UDP messages exist. Application layer approaches interface directly with applications. Defining new network APIs would require applications to adapt to the new interfaces, but this approach offers low packet 84

85 interception overhead. On the other hand, if packets are captured at lower layers (Ethernet or IP), existing applications can be supported without modification, but the highest overhead is incurred as the entire OS stack is traversed before interception. Figure 4-2. Overlay network processing system. Packets are intercepted before entering the physical network infrastructure. Routing decisions are then made, and the packets are encapsulated, encrypted and/or compressed as needed. Encapsulated packets are sent through the physical infrastructure to the destination node, where they are decapsulated, decrypted, expanded and delivered. This system can be present in all nodes participating in an ON, or only in nodes responsible for ON routing, which are used as gateways to ONs by nodes. The TCP/IP stack is also used to transport ON messages. Intercepted packets, with necessary transformations, are encapsulated into ON messages and routed to the destination using regular Internet mechanisms in general, TCP and/or UDP. During ON routing it is decided to where, in the Internet, intercepted messages need to be forwarded. Before forwarding, compression and encryption can be applied to intercepted messages, and ON-related information can be added as a header. The original message is recovered just before it is delivered. Decapsulation, decryption and expansion are executed as needed. The recovered message needs to be injected into the TCP/IP stack of the destination node. As in packet interception, this can be accomplished in different layers of the stack. 85

86 Encapsulation Overhead ON information is, in general, added to messages generated by applications in the form of headers. If IP packets are intercepted and tunneled using TCP/IP transport, the messages traveling on physical layer will have two TCP/IP headers in addition to the ON header. When ON header size is zero, the ON overhead for the described example would be 52 bytes. The interception of Ethernet frames increases ON overhead to 66 bytes. The maximum throughput, experienced by applications, can be calculated using Equation 4-1. Throughput = MTU IP header Transport MTU + Eth header overhead ON overhead LineSpeed (4-1) where transport header (Transport header ) has 8 bytes in case of UDP and 32 bytes in case of TCP (in modern Linux kernels, due to the need for optional header fields to support window scaling, selective acknowledgements and timestamps), and IP header is 20 bytes long. Ethernet overhead (Eth overhead ) is 38 bytes: 14 bytes for the header, 4 bytes for frame checksum, 12 bytes for interframe gap and 8 preamble bytes. The numerator represents the length of data sent by applications the IP Maximum Transmission Unit (MTU) without headers. The denominator is the data length effectively transmitted in the physical layer, which includes Ethernet overhead. The maximum throughput decreases linearly at a rate of 0.65 Mbps per VN overhead byte. Encapsulation or ON overhead depends on where, in the network stack, messages are intercepted and also on the tunneling mechanism used. If IP packets are intercepted and tunneled using TCP/IP transport, the messages traveling on physical layer will have two TCP/IP headers in addition to the ON header. The maximum TCP and UDP throughputs that can be experienced by applications as a function of VN overhead are plotted in Figure 4-3. IP fragmentation, often overlooked, can occur due to the increase of in-transit message size caused by the encapsulation process. When MTU size messages are intercepted, they become 86

87 Throughput (Mbps) TCP UDP ON overhead (bytes) Figure 4-3. Maximum TCP and UDP throughput versus VN header overhead. The maximum throughput decreases linearly at a rate of 0.65 Mbps per VN overhead byte in gigabit Ethernet with 1500-byte MTU. Note that 1Gbps is never available due to TCP/UDP, IP and Ethernet headers. larger than MTU after encapsulation, resulting in the transmission of two IP messages: one with the size of MTU and an additional small message, causing performance degradation as shown in Equation 4-2. Throughput = MTU + Eth MTU IPheader Transportheader LineSpeed + max( 84, Ethoverhead + IPheader + ONoverhead ) (4-2) overhead where the expression max(84, Eth overhead + IP header +ON overhead ) represents the overhead of the additional frame due to the fragmentation. It is at least the minimum Ethernet frame size of 84 bytes: 64 bytes for the frame header and checksum and 20 bytes for inter-frame gap and preamble. The data length generated by applications is at most MTU without IP and transport headers. The data transmitted in the physical layer is one full frame (MTU with Ethernet overhead) and an additional frame to transport the exceeding data (essentially the VN overhead). This small frame may not fill the minimum frame length (64 bytes plus inter-frame gap and preamble bytes). If TCP is used as transport, Nagle s algorithm [93] can alleviate the problem by combining subsequent messages in bulk data transfers. Theoretically, IP fragmentation can cause over 30 87

88 Mbps of performance degradation. In practice, the effect of IP fragmentation is more damaging due to difficulties in handling small size packets. Table 4-1 compares the maximum TCP and UDP throughput experienced by applications with and without IP fragmentation for representative ON overhead sizes. 14-bytes overhead represents an additional Ethernet header, 20-bytes overhead represents an additional IP header, 28-bytes overhead represents additional IP and UDP headers, and 52-bytes overhead represents additional IP and TCP headers. Table 4-1. Maximum UDP/TCP throughput (Mbps) with (frag) and without IP fragmentation. ON overhead UDP UDP frag TCP TCP frag Packet Interception One simple way to intercept messages generated by applications is to receive packets directly. This requires defining ON APIs and writing applications that make use of the defined interfaces. Alternatively, system call interception can be used to capture messages from networkrelated system calls invoked by applications. The advantages of the described approaches include smaller VN headers as messages are captured before receiving OS generated TCP/IP headers, and reduced system processing time due to the early capture of messages. The drawback is that existing applications cannot be supported without modification and adapting applications to new APIs is not always possible. Capturing packets at lower layers of the stack opens the possibility of supporting the execution of existing applications that communicate through ONs without modification. This 88

89 characteristic leads to ONs designed to intercept messages at lower layers of the stack, in spite of higher ON overhead. Mechanisms for interception include the use of TUN/TAP devices [64], raw sockets and queue mechanisms of the Linux Netfilter infrastructure [80]. Advanced methods for packet capture have also been developed in the recent years, including PF_RING [94], ncap [95] and specialized hardware [84]. While recognizing the potential to improve packet interception capacity by the above listed mechanisms, this work concentrates on evaluating traditional methods without optimizations that are expected not to be present in common configurations. Packet interception throughputs for each mechanism measured using the localhost and Ethernet interfaces are shown in Figure 4-4. Interception throughput (packets/s) gigabit tun(lo) raw (eth) nf(eth) raw (lo) nf(lo) tun(eth) packet size (bytes) Figure 4-4. Packet interception performance of TUN/TAP devices, raw sockets and queue interface of Netfilter (nf). Gigabit curve represents the packet rate necessary to keep a gigabit line at maximum utilization. Measurements using the localhost (lo) interface and Ethernet (eth) interface are plotted. Small packets measurements are influenced by the inability to generate packets at line speed. Experiments using the localhost interface try to capture the interception performance when not limited by the speed on the wire. However, packet generator and injector programs compete for the interface, with negative effects on small sized packets. For small packets, measurements 89

90 using Ethernet interfaces expose the interception capacity limits of each method. Measurements using the localhost interface show the ability to intercept MTU-sized packets at rates higher than 1 Gbps. TUN/TAP devices and raw sockets intercepted packets larger than 400 bytes at the rate of fully utilized gigabit lines. Interception of packets using the Netfilter mechanisms performed the worst. Packet Injection Packet injection is used in the final stage of virtual network processing. After packets are routed through the ON infrastructure, they need to be delivered to the destination node. Packets can be injected at any layer of the network stack. Inter-process communication can be used by ON libraries to deliver messages to applications, and system call interception can be used to inject packets into the network stack. Packet Injection Throughput (packets/s) gigabit raw(lo) tun(lo) raw(eth) Packet size (bytes) Figure 4-5. IP packet injection performance of TUN/TAP devices and raw sockets. Packet injection to Ethernet networks using TUN/TAP devices requires the kernel to forward packet written into TUN/TAP devices to an Ethernet device. Since TUN/TAP injection rate is higher than kernel packet forwarding rate, packet loss occurs making it difficult for an accurate measurement. For this reason only raw sockets injection to Ethernet (eth) is shown. 90

91 Available mechanisms to inject IP packets or Ethernet frames into the destination node (or LAN) include the use of raw sockets and TUN/TAP devices. The measured IPv4 packet injection throughput for each approach is shown in Figure 4-5. The use of TUN/TAP devices on localhost interface shows a clear advantage over raw sockets, suggesting that for local delivery TUN/TAP devices are better suited. For implementations of ON routers, which need to deliver packets to other nodes in a LAN, raw sockets are more appropriate. Routing Routing is the act of finding a path from a source node to a destination node. In ONs, these paths are usually implemented by the transport layer (e.g., TCP or UDP) connections on the Internet. A key operation during routing is the IP address (or overlay node ID) lookup. Many data structures and algorithms for fast IP address lookup are available in the literature. However, many are intended for, or depend on, specialized hardware (e.g., network processors) making it difficult for them to be used in general purpose computers. Fortunately, ONs can choose addressing schemes that facilitate the address lookup process. For example, in the case of using IP addresses, fixed size prefixes can be defined. With that, simple arrays or hash tables can be used to store routing information. When using hash tables, the hashing operation requires special attention. If complex hash functions are used, the computation time of the hash can become the limiting factor of the address lookup process. For example, Distributed-Hash-Table (DHT)-based P2P systems use costly cryptographic hash functions. It is possible to compute the hash code of Java String object containing an IPv4 address in about 50 ns, while MD5 and SHA1 take 770 ns and 1210 ns respectively. 91

92 Figure 4-6 illustrates the average access time in a routing table implemented as an array of Java objects and as a hash table of Java objects. The experimental results show that routing process in ONs can be extremely fast, when appropriate data structures are used. 200 Access time (ns) Array Hashtable Number of elements Figure 4-6. Routing table access time using Java hash tables (String object hash code) and arrays. Access of tables with 1 million elements takes a few nanoseconds when using arrays. Virtual Links As TCP or UDP communication can be used to implement virtual links in overlay routing, it is important to understand their performance. The lossless low-latency environment provided by a LAN is ideal to push network software to its peak performance. In other words, if the base network offers a performance lower than what network virtualization software can process, the performance limiting factor would be the base network instead of the VN software. Figure 4-7 illustrates the throughput of TCP and UDP in a gigabit connection, using C and Java sockets interface. The effect of Nagle s algorithm [93] when bursting small TCP messages is clearly visible. UDP performance for small messages is substantially lower than what can be transported in a gigabit link. However, for MTU-sized messages, UDP offers better throughput due to its smaller header size. 92

93 Java sockets performed slightly worse than native sockets when dealing with small packets. No difference can be noticed for packets larger than 450 bytes in both UDP and TCP. Throughput (Mbps) C UDP(eth) Java UDP(eth) C TCP(eth) Java TCP(eth) Message size (bytes) Figure 4-7. UDP and TCP performance. For small messages Nagle s algorithm makes TCP better than UDP. C and Java sockets perform similarly, with a slight disadvantage for Java in the case of small messages. Cryptographic Operations Encryption time (ns) AES AES-CBC RC4 Blow fish Blow fish- CBC Algorithm DES DES-CBC 3DES 3DES- CBC Figure 4-8. Processing time of symmetric key encryption for different algorithms in Java. For each algorithm processing time to encrypt messages from 1 to 1500 bytes are represented in different colors. 93

94 Cryptographic operations are used in ON when confidentiality and/or authentication of messages are necessary. Two representative operations are secure hashing and symmetric key encryption/decryption. The processing times of symmetric key encryption operations with different algorithms using Java implementations are shown in Figure 4-8. Processing time of MD5 and SHA1 cryptographic hashing functions measured using Java implementations are shown in Figure 4-9. Time (ns) MD5 SHA Data length (bytes) Figure 4-9. Processing time of MD5 and SHA1 in Java versus length of data. MD5 shows faster response for all messages sizes. Cipher-block chaining (CBC) mode added considerable amount of time in processing compared to electronic codebook (ECB) mode. Cryptography operations performance is very sensitive to implementation. Although in Java measurements, the Blowfish algorithm performed the worst, measurements using code written in C showed disadvantage for triple DES (3DES). Compression Depending on the traffic pattern, compression has the potential of improving the overall data transfer performance. Factors that determine whether compression is beneficial or not are: compression ratio and the time required for compression and expansion. 94

95 Experiments were run using three different types of data: a text file, a binary executable file and a jpeg image file. All files are read into blocks of fixed size, which are then compressed. Figure 4-10 illustrates the time spent in compression and expansion as well as obtained compression ratio (computed as the relation between uncompressed size over the compressed size) Time (ns) Block size (bytes) Compress txt Decompress txt Compress bin Decompress bin Compress jpg Decompress jpg txt compression ratio bin compression ratio jpg compression ratio 0 Figure Processing time of Java-based compression and expansion of text, binary and image files, divided in fixed block sizes. Good compression ratio is observed for text and binary files, while already compressed image files cannot be further compressed. The expansion operation is faster than the compression operation. As expected, the best compression is obtained with a text file, while no compression can be observed in an image file. The processing time needed for compression showed to be higher than that necessary for encryption. It can also be seen in the graph that the expansion operation is considerably faster than the compression operation. 95

96 Discussion In order to fully utilize a gigabit link, the required packet throughput from virtual network processing systems is given by Equation 4-3. LineSpeed FrameRate = FrameSize + Ethernet preamble + Ethernet (4-3) InterframeGap In gigabit Ethernet networks, MTU-sized frames can be transmitted every second, implying that every packet needs to be processed in less than µs. The worst case scenario occurs when dealing with minimum size frames, in which case frames need to be processed every second; i.e., only 672 ns are available for network processing without incurring performance degradation. As observed in Figure 4-4 and Figure 4-5, packet interception and injection do not occur at line speed for small messages, therefore gigabit performance cannot be expected in these cases. Maximum performance is expected during bulk data transfer when MTU-sized messages are transmitted in bursts. Since TCP data transfers require acknowledgement messages in the reverse channel, the forwarding rate of small packets needs to be at least the same as the forwarding rate of MTU-sized messages. All tested packet interception methods were able to deliver line speed packet interception performance for packets larger than 500 bytes. Measurements using the localhost interface show a disadvantage when using the Netfilter infrastructure compared to raw sockets and TUN/TAP devices, indicating that capturing packets from the Netfilter infrastructure requires more processing, as confirmed by tests using slower CPUs. For small packets, although interception rate at line speed is not possible, enough performance to support full speed bulk TCP transfers was obtained. In terms of packet injection, TUN/TAP devices show a clear advantage over raw sockets when delivering packets locally. Both approaches showed enough performance to keep a gigabit 96

97 line busy when bursting packets larger than 200 bytes. However, using TUN/TAP devices to deliver packets to a LAN requires special care: packets written to a TUN/TAP device need to be forwarded to an Ethernet device. Since the packet rate in TUN/TAP devices is much higher than in Ethernet devices, packet loss in the OS kernel can happen during forwarding. Fast address lookups can be critical during the route processing in overlay networks. Complex data structures should be avoided when building ON routing tables. If hash tables are necessary, fast hashing functions should be used. If complex operations become absolutely necessary for route computation, cache data structures can help speed up address lookup time. Experiments show that if appropriate ON addressing schemes are used, routing information can be accessed in few nanoseconds. For virtual links, permanent TCP connections can perform better than UDP under small packets traffic. However, having a large number of open TCP connections (or sockets) is often not recommended, and UDP communication is better suited. UDP has also higher throughput under bulk data transfer. The fastest encryption algorithm in CBC mode spends about 10 µs to encrypt blocks of data of 1 to 1500 bytes in length. This is almost all the time allowed to process one packet. Since end-to-end confidentiality can only be achieved if encryption is performed at the application layer, encryption on ONs should be avoided when seeking performance. Interestingly, computation of digests using secure hash functions takes about the same time as symmetric key encryption in CBC mode. This indicates that the use of shared key encryption instead of keyedhash message authentication code (HMAC) for simple message authentication should be considered. 97

98 The processing time of compression, which is larger than the maximum time allowed for packet processing, makes its use not worthy in high speed networks. However, compression can be useful in environments with limited bandwidth and data with high compression ratios. IP Forwarding Performance The key to network virtualization software is the ability to intercept packets, encapsulate them and forward to the destination node where original packets are recovered and delivered (injected). While in-kernel processing intuitively offers the best performance, ONs are implemented by user-level programs, oftentimes using high-level programming languages such as Java and C#. Results in previous section show that VN components can, at least for bulk data transfers, deliver the necessary packet rate in gigabit Ethernet networks. In this section, potential IPpacket-forwarding performance of user-level network virtualization software is experimentally evaluated. To this end, IP packet forwarders were developed in C and Java. The developed Interception Thread { Loop { intercept an IP packet; encapsulate into an UDP message; forward to another router using UDP; } } Injection Thread { create an UDP server socket; Loop { receive an UDP message; recover the IP packet; deliver/inject the IP packet to the destination; } } Figure Pseudo-code of the developed IP packet forwarders. Different configurations with varying numbers of threads for interception/injection were evaluated. The effects of asynchronous IO through the use of Java NIO were also examined. 98

99 software runs on machines acting as routers (see Figure 4-1), while the remaining nodes exercise the software forwarders by running network benchmarks. Each IP forwarder consists of two threads: one thread is responsible for intercepting packets and forwarding them to the destination forwarder using UDP communication; another thread waits for UDP messages containing encapsulated IP datagrams, which are recovered and injected into the destination node (Figure 4-11). Raw sockets are used in packet interception and injection. Since Java does not offer interfaces for raw sockets, they are opened using C routines, and integrated using JNI. Transmitted Ethernet frames have double IP headers, one UDP header, one TCP header (or another UDP header if applications running on compute nodes use UDP) and one Ethernet header. The expected maximum TCP and UDP throughput experienced by applications are, using Equation 4-1: TCPthroughput UDPthroughput = 1 Gbps = Mbps = 1 Gbps = Mbps 1538 IP Fragmentation The IP forwarders developed for experimental and evaluation purposes use UDP as a transport, and encapsulate intercepted IP packets into UDP messages. Due to the additional 28 bytes of IP and UDP headers, when MTU-sized messages are intercepted, the forwarding process results in the IP layer fragmenting the message in two parts. In this case, the expected maximum TCP and UDP throughputs are, using Equation 4-2: TCPthroughput = 1 Gbps = Mbps max(84, ) 99

100 UDPthroughput = 1 Gbps = Mbps max(84, ) One solution to avoid IP fragmentation is to configure nodes with appropriate MTU size. Alternatively, the Internet Control Message Protocol (ICMP) can be used as per [96]. The developed IP forwarders minimize the need for IP fragmentation generating ICMP datagram too big messages, when packets that cannot be tunneled without fragmentation are captured. Table 4-2 compares the performance of Linux built-in IP forwarding with C and Java forwarders. C(frag) and Java(frag) represent forwarders that do not avoid IP fragmentation. No encapsulation is involved in pure IP forwarding executed by Linux kernels. Table 4-2. Maximum UDP/TCP throughput (in Mbps) and round-trip latency of minimum and MTU sized messages (in ms) experienced by applications when using the Linux builtin IP forwarding and the IP forwarders developed in C and Java. UDP TCP rtt(min) rtt(mtu) Linux C Java C (frag) Java (frag) Packet Interception vs. Copy Raw sockets and the Netfilter infrastructure allow copying packets from the kernel to userlevel programs. If no action is taken to drop the copied packets from the OS kernel, they are allowed to go through the network stack. This causes additional network traffic, which can degrade the overall network performance. Packets can be dropped using network firewall rules. During experiments, over 50% performance degradation was observed when copied packets were not dropped from the kernel. This is not a problem when using TUN/TAP devices, as the packets read are delivered to these devices by the kernel and do not traverse the network stack. 100

101 Java Forwarder Performance Tuning Although the forwarder implemented in Java was able to deliver near the maximum data throughput for UDP applications, significant performance loss was observed when dealing with TCP applications. The difference between UDP and TCP traffic is that during TCP bulk data transfer, TCP acknowledgement packets are generated in the receiving node, which does not occur in UDP bulk data transfer. Since TCP acknowledgment packets are small, and UDP measurements show good performance for MTU-sized packets, the problem appears to be in small packet processing. Experiments with small packets confirmed that C code can process around byte frames/s, while Java code can only handle around byte frames/s. Assuming the sender sends MTU-sized messages at full capacity, frames are sent every second. In this situation, the Java forwarder is not able to process the load of acknowledgment packets being sent by the receiver. After modifying the Java code to use its new I/O (NIO) interfaces and direct buffers, the throughput of minimum sized Ethernet frames was improved to around frames per second, which is over the required frame rate of This allowed applications to experience near maximum TCP throughput (916 Mbps). Effects of Virtual Network Processing Time To simulate ON operations on captured packets, a delay loop was introduced between packet interception and forwarding through UDP communication. UDP and TCP data throughputs experienced by compute nodes using the forwarders are shown in Figure

102 Maximum Throughput (Mbps) C-TCP C-UDP Java-TCP Java-UDP Processing delay (us) Figure Effects of network processing during packet forwarding. Packet processing times as low as 5 µs can cause network applications to experience performance degradation. Holding a packet before forwarding can cause network applications to experience performance degradation. Due to the time spent on packet interception and injection, the time allowed to process MTU size packets are smaller than µs. The ON performance is determined by the packet rate of network virtualization software. TCP applications are most affected. While UDP applications could tolerate up to 5 μs of virtual network processing on forwarders, TCP applications performance degrades due to lack of capacity to process TCP acknowledgement traffic. For longer delays due to processing on forwarders, the expected maximum data throughput can be easily predicted. For example, if a forwarder needs 50 μs to process one packet, applications can only transmit MTU size packets every 50 μs, with the maximum UDP throughput of 1472/50μs = MBps (or 235 Mbps, which is consistent with the measured throughput). 102

103 Using Worker Threads To evaluate the impact of multi-threading on ON performance, particularly the use of worker threads for packet processing, the Java forwarder was modified to use multiple worker threads. Time spent for packet processing is simulated using delay loops. Each worker thread has a packet buffer to where intercepted packets are directly written by the interception thread TCP Throughput (Mbps) Packet processing time (us) TCP Throughput (Mbps) Packet processing time (us) Figure Effect of using worker threads for packet processing. For small processing times, small number of worker threads performs better. As the processing time for each packet increases, large number of threads improves the network performance. Attempts to spawn more worker threads than the number of available CPU cores cause performance degradation. 103

104 Worker threads can be in idle or busy state. Packet interception only happens when at least one thread is in idle state. Threads are managed by the Java built-in thread pool. Figure 4-13 shows the experimental results for different number of threads. The top graph shows the TCP throughput perceived by nodes in function of the processing time spent by forwarders, from 0 to 50 µs. The bottom graph shows the TCP throughput of nodes for higher values of packet processing time. Since the machines used in the experiments have 4 CPU cores, the use of more than 4 worker threads degrades the performance. In addition, interception and injection threads compete for CPU cycles. The use of the maximum number of workers only helps when the processing time needed for each packet is high. Case Study: OpenVPN OpenVPN [97] is an open source SSL based VPN solution implemented mostly in C. It implements all components identified in Figure 4-2. Performance measurements of OpenVPN version 2.1 in site-to-site setup and for different operating configurations are summarized in Table 4-3. The basic OpenVPN configuration used was UDP transport, no compression and all cryptography operations in CBC mode. When not using cryptography, nodes were able to transfer data using UDP at the theoretical limit. TCP transfer suffers from the poor performance when dealing with small packets this indicates that OpenVPN was not able to provide the necessary packet rate for TCP acknowledgements. As expected, enabling compression had only negative effects. The following facts are revealed by Table 4-3: From the round-trip time measured using MTU size messages, it is possible to estimate that OpenVPN spends around 10 μs and 15 μs for MD5 and SHA1 operations. This is close to the results in Figure 4-9, obtained using Java. 104

105 Also from rtt using MTU size messages, it is possible to estimate that OpenVPN spends in average 45 μs for Blowfish-CBC encryption (better than the results obtained in Java Figure 4-8), 32 μs for AES-CBC encryption (worse than Java) and 221 μs for 3DES-CBC (substantially worse than Java). Given the processing time spent in cryptographic operations, OpenVPN would benefit from the use of multiple threads. TCP performance suffers from lack of throughput of small size messages. Table 4-3. Maximum UDP/TCP throughput (Mbps) and round-trip latency of minimum and MTU size messages (ms) experienced by applications when using OpenVPN to connect 2 LANs. Different configurations of encryption and message authentication are reported. (cipher, HMAC) UDP TCP rtt(min) rtt(mtu) none, none none, MD none, SHA DES, none AES, none BF, none AES, MD Improving ViNe Several aspects of the ViNe router (presented in Chapter 3) have been improved based on the studies presented in this chapter, as described below: All the evaluated packet interception mechanisms have been incorporated into ViNe interception module, with the possibility of changing the active mechanism dynamically. The first prototype supported only the Linux Netfilter infrastructure. The packet injection module has been rewritten in order to take full advantage of raw sockets. The first prototype uses the libnet library. Mechanisms to avoid IP fragmentation have been incorporated. The routing module has been rewritten so that faster data structures are used. Encapsulated (ViNe) messages are transmitted using the Java NIO package. Thread management has been changed to use thread pools to minimize thread creation overheads. The number of worker threads is also reconfigurable. 105

106 These improvements enable VRs to perform with low per-packet processing time, boosting VR routing capacity to over 800Mbps. Summary This chapter characterizes the performance of VN processing in general purpose computers under control of a general purpose OS. Existing VN solutions implement network virtualization as user-level software running on regular computers. For this reason, physical network performance research results do not necessarily apply because they assume the use of specialized hardware (e.g., network processors), real-time operating systems or improvements in the OS network stack. Network virtualization is, in general, based on tunneling. Consequently, encapsulation overhead sets the bounds for maximum performance. For simple UDP-based tunnels, 28 bytes are added to packets generated by applications. On a gigabit link, this represents at least 18.2 Mbps taken by the virtualization information. For some approaches that require more than 64 bytes of VN headers, over 50 Mbps are spent due to encapsulation. TUN/TAP devices and raw sockets showed good packet interception and injection throughput, enough to keep gigabit lines fully used for packets larger than 500 bytes. For small packets, interception and injection do not occur at gigabit rate a minor issue since small packets are mostly used by interactive applications where latency is more important compared to throughput. Experiments using IP packets forwarders developed in C and Java highlighted that there is only a small amount of time, on the order of few microseconds, allowed for the execution of network virtualization software without degrading the overall network performance. Due to synchronization overheads, the use of multi-threading is beneficial only when the packet processing time is high over 50µs, when forwarding throughput becomes lower than 106

107 250Mbps. Furthermore, the use of cryptographic operations can decrease significantly the performance of ONs. User-level network virtualization performance has a high dependency on the system processor speed. The experimental results reported in this chapter are specific to the processor used in the experiments: the 2 GHz Intel Xeon (Woodcrest core). However, the fundamental limits, such as the small amount of time available for network processing and encapsulation overheads, are processor independent. It is also important to note that the processor used is not the fastest available in the market. Even so, experimental results indicate that it is possible to have fully-featured network virtualization implemented in Java without significant performance degradation, provided that fast cryptographic algorithms are used. Another important result is that network virtualization software does not benefit from using multiple threads, especially when the time spent to process each packet is small, implying that a machine specialized in virtual network routing require low CPU-core count. P2P systems can also benefit from this result: it is possible to leave free CPU-cores for application processing (instead of virtual network processing), thus causing less interference on applications performance. As exemplified in the case study, with simple measurements using the ping tool, it is possible to have an idea on how much time is spent in overlay network processing. The study described in this chapter has shown that this simple measure can give valuable information about user-level ON software i.e., estimate the TCP or UDP throughput, number of threads to use, and trade-off performance with security (encryption). 107

108 Finally, the results of this study were of vital importance to improve the performance of ViNe. ViNe routing performance was improved from about 200 Mbps to over 800 Mbps, making it the best performing user-level managed virtual network approach reported to date. 108

109 CHAPTER 5 ON THE MANAGEMENT OF VIRTUAL NETWORKS Managed Network Broadly speaking, managed network refers to an infrastructure that combines network devices and tools with performance monitoring, diagnosis and reconfiguration capabilities, enabling pro-active operation, maintenance, administration and provisioning of networks. LAN network devices evolved from unmanaged network designs (e.g., broadcast coaxial cables and hubs) to switches and routers with rich management features. As more features became available, configuration and management became an error prone and complex process requiring very specialized human resources. While management aid tools exist, they tend to be sufficiently complex leading to network downtime due to misconfigurations. In addition, management tools are device vendor dependent and not designed for projects with multiple administrative domains. As presented in previous chapters, user-level overlay networks offer features to cope with the heterogeneity of resources across different administrative domains. However, as ON projects grow in size, the complexity of network management starts to interfere with its smooth operation. Therefore, a good management infrastructure is essential for ONs. There are two ON management design styles: Focus on overlay communication self-organization with optimizations that favor data search, access and storage on the overlay (e.g., structured P2P networks [49][50][51][52]). Focus on dynamic configuration of overlay network elements, enabling flexible management (e.g., X-Bone [43], ViNe [21]). Structured P2P designs encourage the participation of a large number of nodes, which can grow without control since no authentication and access control are enforced on participants. Performance tuning concentrates on optimizing the data distribution throughout the structured 109

110 overlay. This strategy, in general, results in the physical infrastructure data transfer capacity being underutilized. To address the described shortcomings, management add-ons are developed to control the unmanaged system. In the second design style, the overlay routing system is designed with management supporting features. Management is based on the use of interfaces offered by the ON system and all aspects of the overlays (e.g., creation of ONs, membership of hosts and performance tuning) are controlled by driving the ON system components. In the remaining of the chapter, the architectural design of a user-level virtual network management infrastructure is described in the context of the ViNe project. The solutions can be easily recast in other projects, provided that necessary management interfaces are available. Challenges in Network Management In a multiple administrative domain scenario, the following problems need to be addressed by network management tools: A management tool needs to communicate with every participating network device, which is not always possible. For security reasons, management ports of network devices are connected to isolated private networks in each domain. Administrative privileges on all network devices need to be released by all participating domains, so that tools are able to take the necessary actions. This is not accepted by almost any network usage policy. Existing management mechanisms are designed for LANs and enterprise networks (i.e., single administrative domains) and do not necessarily work across WANs. For example, VLANs cannot be deployed across domains, especially when different Internet providers are involved. The listed problems are addressed using a user-level virtual network infrastructure with management interfaces (as exemplified by ViNe-I) driven by a user-level management infrastructure, as follows: 110

111 User-level network management is about manipulating the behavior of the network virtualization software deployed in nodes responsible for overlay routing. The necessary connectivity is naturally established by the overlay routing. No access to management networks of participating domains is required. Access to core network devices is not required at all. Necessary mechanisms to securely access remote resources (machines running overlay routing software) are well established (e.g., grid middleware) and enables the management of VN components. Moreover, the overlay network activity can be monitored and controlled (e.g., overlay traffic can be blocked if suspicious activity is detected) in the physical network infrastructure. Overlay network projects are designed for WAN deployments, so management systems should also work in WAN environments. User-level network management does not require changes in the core network infrastructure, and hence does not make use of vendor dependent management mechanisms. User-level Virtual Network Management The management of virtual networks depends on functionalities that must be offered by the routing infrastructure. As in physical networks, where switches and routers are required to offer management features (e.g., VLAN capability, Simple Network Management Protocol capability and management interfaces), virtual networks require software components designed to work in a managed environment as described in Chapter 3 in the context of the ViNe infrastructure. Important areas of interest include security, configuration, operation, monitoring and tuning. Desirably, VN administration should be a shared task entailing decoupled actions by ON administrators, site administrators, VN self-management (performed by the grid middleware) and end users. For each participant, a different set of interfaces is required. For middleware, the programmability is important, i.e., interfaces that can be easily invoked. For administrators a complete set of features would offer the maximum flexibility to manipulate VNs. For end users, from whom specialized knowledge is not expected, complex processes need to be automated. above. The following subsections describe the required interfaces in each functional area as listed 111

112 Security All entities participating in the VN management process, which include VRs, administrators, end users, and middleware acting on behalf of them, must be properly identified and authenticated. Identification and authentication prevents anonymous (virtual) network access, and it is essential to enforce access control. In typical grid environments, identification is based on digitally signed certificates as defined in the public key infrastructure (PKI) [45]. ViNe takes advantage of the GSI [15], which defines host certificates (used to identify VRs), user certificates (used to identify individuals as administrators or end users), and proxy certificates (used to identify middleware acting on behalf of an individual). GSI also enables the establishment of VR-to-VR and user-to-vr secure encrypted channels, essential to the exchange of sensitive information, such as VN routing tables and peruser performance metrics. Table 5-1. ViNe management roles. Role Description VR ViNe Administrator (ADM) Site Administrator (SA) VN Administrator (VNA) End Users (EU) Information about internal control data structures of VRs (e.g., routing tables and buffers size) is only available to VRs and ViNe management components. Privileged user responsible for ViNe user management. Privileged user who can manipulate the local resources on a ViNeenabled site. User who creates and controls VNs. Regular user with limited privileges. For authorization and access control purposes, ViNe defines roles listed in Table 5-1. Middleware, which uses proxy certificates, assumes the role assigned to the entity that generated the proxy. 112

113 Configuration and Operation ViNe routing software has been implemented to be dynamically reconfigurable, and the management of VNs consists of changing the operating parameters of VRs. To facilitate the configuration and operation process, a basic set of high-level interfaces is offered as listed in Table 5-2. Management activities are carried out by sequencing and combining the basic operations. Table 5-2. ViNe management operations and corresponding ViNe roles authorized to execute each operation. Operation Role Description Register VR/ADM/SA /VNA/EU All ViNe entities are required to be registered with ViNe Authority (VA). VA oversees the global VN management as described in the ViNe-M subsection. List sites SA/VNA Get information about ViNe-enabled sites. List VRs VR Get information about active VRs. List hosts SA/VNA/EU Get information about active ViNe hosts. Create VN SA/VNA/EU Given a list of hosts, deploy a new VN. Shutdown VN VNA Deactivate a VN. Merge VNs VNA Merge 2 VNs. Split VN VNA Split a VN into 2 independent VNs. 2 mutually exclusive groups of hosts must be specified. List VN hosts VNA/EU Get information about the hosts that are members of a VN. List VNs SA/VNA/EU Get information about active VNs. Join VN SA/VNA/EU Enable a ViNe host to be member of a particular VN. Leave VN SA/VNA/EU Disables a ViNe host from being member of a VN. Get routing tables VR Retrieve ViNe routing tables from ViNe Authority. Update routing tables VR Retrieve ViNe routing table updates from ViNe Authority. Get performance metrics VR Retrieve performance metrics collected by VRs. The information is used by ViNe Authority when computing routing tables for VN creation. Manage Users ADM Assign roles to entities. Monitoring and Tuning ViNe allows site administrators to use existing network monitoring technologies to collect information about physical and virtual network traffic. Therefore, ViNe monitoring focuses on 113

114 ViNe-specific parameters, such as load of the machine running ViNe software, delay and bandwidth estimates on VR-to-VR communication, heartbeat of VRs, and traffic patterns. The monitored parameters open performance tuning opportunities as follows: VR load balancing: in sites with a large number of ViNe hosts, a single VR can become overloaded when handling traffic to and from all ViNe hosts. Additional VRs can be started automatically when the load of the active VRs becomes too high, lowering the amount of work of each VR. When VRs become idle, the total number of VRs can be reduced. VR-to-VR communication: VN performance is directly affected by VR-to-VR communication. VRs can monitor performance parameters such as bandwidth in use and delay using the packets flowing in the established channels. Since artificial probes are not used, bandwidth is not lost in monitoring activities. Delay and bandwidth estimates of each pair of VRs can be used to compute alternate routes when a particular path becomes congested. Heartbeat: ViNe Authority tracks the health of active VRs via heartbeats, which are sent by VRs at regular intervals. Alerts to site administrators are sent when VRs become inactive. Traffic pattern: Inspection of ViNe packets gives hints about the traffic. For example, TCP port 22 is usually used for SSH and port 443 is used for https transfers. Both have end-toend encryption executed in the application layer, and there is no need for ViNe to use secure channels when routing the describe types of traffic. ViNe can also monitor the compression rate of the packets and dynamically change the compression of communication channels to maximize the performance. ViNe Management Architecture ViNe management is based on dynamically reconfiguring operating parameters of VRs, with actions triggered by administrators, users and middleware (acting on behalf of administrators and users). VR reconfiguration is always controlled by a ViNe Authority, an entity overseeing the global VN management. By avoiding VN management controlled by VRs, ViNe limits the damages caused by potentially misconfigured VRs or attacks. Figure 5-1 illustrates the overall architecture of ViNe-M. 114

115 Figure 5-1. ViNe Management Architecture. The necessary VR configuration operations are controlled by a ViNe Authority in response to requests from users and administrators (or middleware action on behalf of users and administrators). Actions affecting resources local to a site (e.g., enabling a new host) can be directly processed by VRs in charge of the domain if necessary, VRs can request additional services to a ViNe Authority. ViNe Authority ViNe Authority (VA) maintains all the information necessary to manage ViNe deployments, including data related to ViNe-enabled sites, active VRs, active ViNe hosts and users. Static information is collected during ViNe registration process. To ViNe-enable a site, an administrator needs to submit the data about the site to VA, including physical network subnet, geographical location and contact information. Users, hosts and VRs are also required to be registered. Dynamic information, such as active VRs and their performance metrics, is collected from VR monitoring sensors. VA stores all the information relevant to ViNe deployments and can be used by VRs as the point of contact to recover their configuration state when ViNe software needs to be restarted. Address Allocation ViNe supports routing both unmodified physical IP addresses and ViNe assigned virtual IP addresses. Virtual addresses are taken from a block not active in the Internet (i.e., not routed). 115

116 Sub blocks of ViNe addresses are assigned by VA on a per-vr basis during the registration process VA makes sure that assigned blocks to VRs are mutually exclusive, unless VRs cooperate on the ViNe processing of a particular domain and need to share the same block. Addresses on ViNe hosts are assigned by running a ViNe host configuration program, which essentially makes a ViNe host to contact local VRs to obtain an address. VA is responsible for global ViNe address assignments (in blocks, per VR), while VRs are responsible for the local ViNe address assignments. When physical IP addresses are used, creation of a VN is limited to domains using mutually exclusive address blocks i.e., two domains using the same private address block cannot be connected. Mixing physical addresses and ViNe addresses on a VN is allowed. VN Creation and Tear-down The process of creating a new VN consists of the following steps: A user sends a VN creation request to VA. The request should contain a list of initial VN members. After authenticating the request, the VA computes the necessary routing tables to realize the requested VN. In this stage, authorization and access control actions can be enforced. The VA contacts the involved VRs through secure channels to transmit updates to GNDT and LNDTs. For security reasons, VRs only accept routing changes digitally signed by VA. Once VRs are reconfigured, the new VN is deployed and ready. VN tear-down consists of similar steps. In this case, the VA checks in its records the routing information about the VN to be destroyed, and contacts the involved VRs to remove all relevant routing entries from GNDT and LNDTs. VN Merging and Splitting As in VN creation, VN merge and split operations are accomplished through routing table manipulations. 116

117 Consider a ViNe deployment as illustrated in Figure 3-10, where two independent VNs (VN 1 and VN 2 ) are active across three domains. In order to merge the two VNs (to form a new VN 3 ) changes to the routing tables are necessary as illustrated in Figure 5-2. The GNDT entries of VN 1 and VN 2 are united and the necessary membership updates are executed in LNDTs. Figure 5-2. ViNe merge example. Compare the routing tables entries with the ones illustrated in Figure Merge and split operations follow the same steps described for VN creation and tear-down. A VA is able to merge VNs, given the VN identifications, by computing the new routing table entries and updating the relevant VRs. Split operation requires as input the list of hosts that are members of the VN in interest divided into two groups. VA generates the necessary updates to GNDT and LNDTs and reconfigures the involved VRs. 117

118 VN Membership LNDT in each VR controls the VN membership of ViNe hosts. VN membership changes can be requested by ViNe hosts (or an end user) by directly contacting the VR in charge of the connected domain. The VR contacts VA to request a clearance for the change, since VA performs all the necessary access control checks. VN membership changes can also be initiated by VA, in response to requests from VN administrators. 118

119 CHAPTER 6 VIRTUAL NETWORK SUPPORT FOR CLOUD COMPUTING Networking in Cloud Environments Cloud computing broadly refers to the use of managed distributed resources to deliver services to multiple users, often using virtualization to provision execution environments as needed by applications and users. The vision, shared with Grid and utility computing, is that users should be able to use and pay for computational services just as electrical appliances use the power grid. Satisfying the requirements of multiple users and applications is one of the challenges faced in providing computing as a service. Cloud computing and VM technologies enable an attractive solution where it is possible to shift a good part of the responsibility of resource configuration and application environment management to end users or application providers. In order to make this responsibility shift possible, it is imperative for end users to have administrative privileges on allocated resources. VM technology enables servers to be partitioned in slices running one or more VMs, where each slice is isolated from each other, making it possible for different users to act as administrators of their own VMs. Many cloud computing facilities take advantage of VM s properties to offer compute cycles as Infrastructure-as-a- Service (IaaS) [98][99][100][101][102]. Allowing end users to control their resources raises security concerns, especially in the networks connected to those resources. Different from grid environments, users can execute privileged network operations such as using raw sockets and enabling promiscuous mode on network interface cards (NICs), opening opportunities for undesired, possibly malicious, network activity. To avoid potential damages, cloud computing providers carefully control the network environment of deployed VMs imposing limitations on the VMs network traffic. 119

120 The connectivity limitations imposed on VMs can cause problems when deploying distributed applications on the cloud. This problem is more evident when computation involves resources across independent clouds (informally called sky computing), as in the case of highperformance computing applications requiring resources in excess of those available from a single provider. Network overlay technologies, as described in previous Chapters, could be considered as potential solutions since they were developed to overcome connectivity limitations on the Internet and enable the execution of distributed applications. However, these technologies either are inefficient or do not work under the new network restrictions imposed on cloud computing resources. In the following subsections the inter-cloud communication problem is characterized, and solutions to overcome the identified challenges are proposed, and implemented by extending ViNe. Network Protection in IaaS IaaS providers limit the network traffic to and from VMs that are under full control of end users. Since users can execute privileged network operations, many undesirable activity opportunities exist, including the following: Change of IP address: a misconfigured IP address can interfere with the provider s infrastructure and also with VMs owned by different clients. Change of MAC address: MAC address conflicts can interfere with physical infrastructure of providers and/or VMs owned by different clients. Promiscuous mode: users can configure NICs in promiscuous mode and gather information about the network (examining broadcast and address resolution protocol (ARP) traffic), which can be used for malicious activity. Raw sockets: with the use of raw sockets it is possible to launch many known network attacks based on IP-address spoofing, proxy ARP, and flooding among others. Protection mechanisms are based on traffic monitoring, inspection of packets, network address and port translations (NAPT) and packet filtering. Network protection can be enforced 120

121 on the host servers of VMs and/or network infrastructure devices. Network devices implement rules that apply to both physical and virtual machines, while host servers can enforce rules specific to hosted VMs. The following are the most commonly used mechanisms: Network address and port translations: in many cases, users want VMs to have a presence on the public Internet. To avoid the assignment of public IP addresses to VMs, providers offer public Internet presence using NAT devices. One advantage of NATs is that server farms do not need to be present on the public network, facilitating the protection of servers in a datacenter. Not assigning public IP addresses directly to VMs protects against misbehaving VMs that might create conflicts with other machines in the public network. Sandboxing: containing the VM network within its host server is a technique used in many IaaS deployments. Instead of bridging VM NICs to the physical network, they are connected to a local network only available within the host server called host-only networks. Sandboxing prevents VMs from directly seeing the network traffic in the physical infrastructure. Access to the outside of the sandbox world is enabled by the host server or a privileged VM on the same host using network address and port translations, proxy ARP and routing/firewalling mechanisms. Packet filtering: in addition to the traditional packet filtering present in the physical infrastructure, VM-specific filtering is enforced on the hosting servers. The hosting server can easily detect anomalies in VM network traffic such as spoofed IP addresses and/or MAC addresses. It is common practice to inspect packets generated by VMs and check whether the source IP address and MAC address fields match the ones assigned during VM deployment. Packets without matching source addresses are dropped, effectively avoiding attacks based on spoofing. User-level Network Virtualization in IaaS ON processing systems can be present in all nodes participating in an ON or only in nodes responsible for ON routing, which, in the latter case, serve as gateways to/from subsets of the nodes in the overlay. The protection mechanisms previously described affect the functionality and performance of network virtualization systems, and in some cases make them unusable. Network address and port translations have a negative impact on the network performance. For example, VMs deployed on Amazon EC2 have public Internet presence enabled by NAT. Experiments using two EC2 VMs on the same public subnet showed that sending a packet from 121

122 one to the other involves communication through six intermediate nodes, increasing latency and lowering communication bandwidth. Sandboxing also makes packets go through more intermediate nodes than mechanisms that directly connect VMs to a LAN segment (e.g., bridged networking). Experiments showed that there are three hops between two EC2 VMs on the same private subnet (one would expect no hops between a pair of VMs since they appear to be connected on the same LAN segment). Sandboxes using host-only networks effectively destroy the notion of a LAN segment, bringing serious problems to systems that depend on data link layer (L2) mechanisms. For example, a VM would not be able to work as a VPN gateway in a site-to-site VPN setup since its clients (other VMs on the cloud) are unable to reach the VPN gateway using L2 communication. Packet filtering also limits the ability of VMs to act as routers. Routing is achieved by receiving and forwarding packets following routing rules, implying that routers must send and receive packets containing addresses in the headers that do not match their own addresses. Due to source address checks, cloud-deployed VMs are unable to perform routing functions - and routing is of key importance for ONs. Many ON implementations make use of additional IP addresses on participating nodes packet filtering on VM hosts can also disable activity on the additional virtual NICs. Enabling Sky Computing To run distributed applications on resources across independently administered cloud infrastructures, it is essential for all nodes to be able to communicate with each other. The network-connectivity problem and its solutions have been actively studied in different contexts (e.g. grid computing, P2P networks, and opportunistic computing). As discussed previously and further elaborated in this section, existing solutions do not necessarily apply to the problem of 122

123 cloud intercommunication, making it difficult to find a solution that efficiently supports sky computing. The desirable features of a network infrastructure for sky computing include: Full connectivity: applications are typically designed and implemented considering the LAN model for communication e.g. processes can initiate connections (client processes) and/or wait for connections (server processes). Full connectivity is required among VMs in spite of the presence of connection-limiting factors on the physical infrastructure. Ability to run unmodified applications: exposing new network programming models causes existing applications to stop working, and modifying and adapting applications to function according to a new model is often impossible or impractical. For example, while new P2P applications can take full advantage of P2P infrastructures, existing MPI applications require changes in the MPI library (to be P2P-compatible) and potentially in the application code to interface with the new library functions. Easy management: cloud computing can expose end users to virtual cluster management. Users who are not network specialists would have difficulties configuring and managing virtual networks if the process is not sufficiently automated. Performance: network virtualization overheads can be prohibitively high. An acceptable solution must add minimal overhead and not become the bottleneck when executing distributed applications. Each existing ON solution was designed under different network environment assumptions, none of which anticipated the new challenges brought about by cloud computing. Table 6-1 qualitatively compares existing ON solutions that could be considered to enable sky computing. The table summarizes the discussion presented in Chapter 2 in the context of clouds, highlighting the difficulties faced by each solution in order to satisfy the above listed requirements under restricted cloud network environments. Although there is no single solution that satisfies the listed requirements, many projects offer features attractive for sky computing. In the following subsections, the ViNe approach is extended to appropriately support sky computing. Conceivably, the final solution to address the cloud networking problems can be recast in the context of other approaches. 123

124 Table 6-1. Qualitative comparison of existing ON solutions ability to address sky computing requirements. Technology Connectivity Unmodified Applications Management Performance API-based [20][73] [103][104] VPN [97] P2P-based [54][55] [56][59] [105][106] [107] VNET [34] VIOLIN [40] ViNe [21] In general, no special network operations are required, and the solutions should work under cloud network restrictions. VPN solutions do not offer Internet connectivity recovery and require L2 communication which is restricted in clouds. P2P operation is, in general, not affected by cloud network restrictions. Solutions that offer IP interface over P2P routing suffer from cloud network restrictions as additional IP addresses are used for ON operation. VNET does not offer Internet connectivity recovery and in addition it depends on L2 communication which can cause VNET to not work in cloud environments. VIOLIN does not offer Internet connectivity recovery, and deployments may become restricted to a single site. ViNe uses L2 communication which can cause ViNe to not work in some cloud network environments. Applications need to be modified and recompiled, something that is not always possible. Applications run unmodified. P2P networks require applications to adapt to P2P APIs. Solutions that interface unmodified applications to P2P infrastructure exist [55][59]. Applications run unmodified. Applications run unmodified. Applications run unmodified. Run-time environment for network operation needs to be deployed, and it is the end users responsibility. VPN configuration and operation can become too complex for non-experts. The self-organizing nature of P2P networks facilitates the deployment and operation. Recent research focus on securely controlling the membership of P2P networks [107]. VNET requires configuration that can be complex for non-experts. VNET requires a program running on the VMs host servers, which is not possible in cloud environments. VIOLIN requires manual configuration that can be complex for non-experts. VIOLIN requires a program running on the VMs host servers, which is not possible in cloud environments. ViNe requires configuration that can be complex for non-experts. Minimal overhead is expected as applications interface directly with new libraries. Relatively highoverhead is observed in the communication through VPN tunnels. Due to the deployment of P2P libraries in every node, high overhead is observed in both intra- and inter-site communication. High ne twork virtualization overhead has been reported [34]. High network virtualization overhead has been reported [40]. Virtualization overhead-free intra-site communication and low overhead for inter-site communication. 124

125 The ViNe approach is flexible enough to allow deployments where the number of VRs ranges from one VR per broadcast domain to as many as the participating nodes. In terms of IP addressing, it is possible to either assign to nodes an additional address reserved for ViNe operation or use the existing addresses without modification. This is an important advantage of ViNe compared to approaches that require use of virtual IP addresses when considering support for sky computing since packets to/from virtual IP addresses can be potentially filtered by cloud providers. The natural mode of operation of ViNe is to deploy one VR per broadcast domain. Data link layer (L2) mechanisms are used for VR-to-node communication, exactly in the same way as host-gateway communication takes place. The advantage is that virtualization overhead is observed only on inter-domain communication, implying that nodes can enjoy the full speed of the LAN to which they are connected for intra-domain communication. TinyViNe Middleware Architecture and Organization A ViNe overlay consists of ViNe routers (VRs) (nodes running ViNe routing software) and ViNe hosts (regular nodes without ViNe software which, in order to reach other hosts on the overlay, communicate with VRs using L2 mechanisms). There are two main obstacles to the deployment of a ViNe overlay in a cloud: Sandboxing disables L2-based host-to-vr communication. Packet filtering disables forwarding/routing ability of VMs deployed on clouds, thus preventing them from working as VRs. A naïve solution to these problems is to configure all nodes as VRs. In this case, host-to- VR L2 communication does not exist (all nodes are VRs of their own) and VR-to-VR communication takes place using regular TCP/IP mechanisms so that the restrictions described above are circumvented. This solution brings the following undesirable side-effects: 125

126 Network virtualization software has to run in all nodes, potentially using valuable compute cycles and negatively impacting applications performance. Network virtualization overhead is seen in both inter- and intra-site communication. Configuration and operation of ViNe may become too complex for non-experts to handle. TinyViNe has been developed to address the above-described problems. A TinyViNe overlay consists of TinyVRs (nodes running TinyViNe software as shown in Figure 6-1) and FullVRs (similar to regular VRs of ViNe overlays). TinyViNe essentially re-enables the host-to-vr communication lost due to cloud environment restrictions, and allows cloud-deployed VMs to participate in the overlays. Avoiding L2 Communication L2 communication can be blocked in cloud environments due to sandboxing. A possible (chosen) solution is to replace L2 communication with TCP or UDP tunnels. To implement this solution, network virtualization software becomes necessary to intercept and tunnel the desired traffic, which is possible making use of the building blocks depicted in Figure 4-2. It is important for this piece of software to be lightweight so to minimize virtualization overhead. Also, only inter-site traffic should be intercepted so that intra-site communication can occur at the full speed offered by resource providers. In general, virtualization overhead can be unnoticeable on intersite communication where data throughput is low compared to that of LAN communication. The necessary TinyVR-FullVR tunnel processing is as follows: Intercept packets that originally would be sent via L2 communication. This can be accomplished by manipulating the OS routing table (using route command) of the machine to direct traffic that is destined for overlay nodes to an interception module. TUN/TAP devices, which offer interfaces for user-level applications to manipulate IP packets or Ethernet frames, are commonly used to implement packet interception modules. Encapsulate them so that the original message is kept intact a simple IP-in-IP encapsulation would suffice, i.e., the intercepted packet (with includes the IP header) becomes the message (payload of the new packet) to be transmitted. 126

Transmit the encapsulated message using regular TCP or UDP mechanisms. No complicated process is necessary in order to know the address to where the message should be sent.

127 Transmit the encapsulated message using regular TCP or UDP mechanisms. No complicated process is necessary in order to know the address to where the message should be sent. The messages are always sent to a FullVR designated during the startup of the network virtualization software. On the receiving end of the tunnel, original messages are recovered (decapsulated). Decapsulation occurs naturally the OS network stack is responsible for removing the additional headers, and the received message is exactly the original packet generated by the source. Inject the packet into the destination TCP/IP OS stack. This is typically accomplished making use of raw sockets. The described functionality can be achieved by a stripped-down version of a FullVR, called TinyVR, as illustrated in Figure 6-1. A TinyVR implements just the essential to enable cloud nodes to establish communication with FullVRs. Figure 6-1. TinyVR a stripped-down version of a FullVR. It is responsible for intercepting packets that require ViNe routing. The intercepted packets are forwarded to FullVRs using TCP/IP connections. For delivery, FullVRs use TCP/IP connections to forward packets to TinyVR, where packets are decapsulated and injected for final delivery. FullVRs are the same as the VRs originally designed and implemented in ViNe, with modifications to accept connections from TinyVRs. TinyVRs are configured to open connection channels (TCP or UDP) to a FullVR (typically, but not necessarily on the same LAN segment), where traffic that requires ViNe routing is transmitted. To avoid unnecessary virtual network processing, no encryption, compression or overlay routing are performed on TinyVRs. Note that nodes connected on the same LAN segment can take full advantage of LAN-performance. 127

128 TinyVR-to-FullVR communication occurs with relatively low overhead (i.e., TCP/IP stack and encapsulation overhead less than 5% in terms of bandwidth and around 400 microseconds latency using Intel Core2-based servers on gigabit Ethernet), considering that this overhead incurs only on communication crossing LAN boundaries. Avoiding Packet Filtering The IP addresses of VMs deployed in the cloud need to remain the ones assigned by the resource provider. Many solutions that depend on network virtualization software running on all participating nodes require an additional IP address (called virtual IP) for virtual network operation. The reason is that the physical IP address needs to be used for regular communication and tunneling of virtual network traffic: if only one IP address is used, there is the challenge of disambiguating which traffic belongs to the regular Internet and which traffic belongs to virtual networks. TinyVRs do not need to directly communicate with other nodes (TinyVRs), since tunneling ViNe traffic to designated FullVRs is sufficient to have packets delivered to the destination. FullVRs are fully responsible for finding the appropriate paths on the ViNe overlay. Communication among TinyVRs only occurs through FullVR tunnels, and direct IP packet delivery among TinyVRs never takes place, effectively avoiding the ambiguity problem. Clearly, TinyVRs and FullVRs both conform to source address checking filters, and overlay networking can operate smoothly on the cloud. TinyViNe Overlay Setup and Management TinyVRs delegate time consuming virtual network processing that requires complex configuration to FullVRs. When starting TinyViNe software on a cloud node (in essence, TinyViNe-enabling a node by making it a TinyVR) the main configuration parameter necessary 128

129 is the address of one FullVR this is significantly simpler than the parameters necessary to configure a FullVR. The TinyVR configuration file needed to TinyViNe-enable a node can be generated automatically if the IP address of the node is known. Given an IP address, it is possible to identify the cloud environment the node belongs to, and determine the best FullVR for the node. This enables the TinyViNe auto-configuration capability to be implemented in the form of an intelligent HTTP-based download server, which follows the steps described below: Each VM to be TinyViNe-enabled uses regular HTTP mechanisms to request TinyViNe software from the download server (via scripting or user-interaction through a browser or command line interfaces such as wget or curl). The download server invokes a PHP script that extracts the IP address of the client, and verifies that the detected IP address is within valid address ranges i.e., belongs to recognized cloud providers. The PHP script then generates the necessary TinyViNe configuration which includes the selection of an appropriate FullVR and a shell script with the necessary commands to start TinyViNe code on the client. Finally, all the necessary files are packaged and returned to the client. TinyViNe can be enabled on VMs by simply downloading the TinyViNe package from the download server and executing the bootstrap shell script that is included within the downloaded package. No special skills and configuration interventions are required from end users. Figure 6-2 illustrates an experimental TinyViNe deployment. First, 3 VMs were manually configured to work as FullVRs. The vision is to have VRs deployed by infrastructure providers while TinyVRs are under end users control. The reasoning is that (1) infrastructure providers can fine tune and manually customize FullVRs (an unnecessary management overhead for end users), (2) FullVRs need to setup once since they support multiple independent virtual networks, (3) better resource utilization is achieved compared to having different VRs deployed for or by 129

each and every user, and (4) in ViNe-enabled providers (i.e., have FullVR directly connected to their LAN), TinyVRs have access to FullVRs with low latency and high bandwidth.

Once the FullVRs are available and the TinyViNe download server has been deployed, TinyViNeenabling a node is trivial: as previously discussed, end users (or VM scripts on their behalf) need only to

130 each and every user, and (4) in ViNe-enabled providers (i.e., have FullVR directly connected to their LAN), TinyVRs have access to FullVRs with low latency and high bandwidth. This onetime effort should not be a management bottleneck for the TinyViNe deployment. Once the FullVRs are available and the TinyViNe download server has been deployed, TinyViNeenabling a node is trivial: as previously discussed, end users (or VM scripts on their behalf) need only to download the TinyViNe package from the download server and execute the script received with the package. Figure 6-2. TinyViNe deployment. The ViNe infrastructure addresses connectivity limitations on the physical infrastructure, while TinyViNe enables participation of cloud nodes on ViNe overlays. Although ViNe routers can be instantiated by end users (in the form of additional virtual machines), the vision is that ViNe will be deployed and managed by infrastructure providers, since only a one-time effort by Information Technology (IT) staff is needed to support independent virtual networks for different users. TinyViNe deployment is trivial as simple as including in the startup script of each VM instructions to download and run TinyViNe code. Evaluation In clouds that offer compute cycles as Infrastructure-as-a-Service, users are exposed to APIs to programmatically control the creation and operation of virtual execution environments. Typically, interfaces are provided to upload VM images, manipulate the configuration of VMs 130

VXLAN Overview: Cisco Nexus 9000 Series Switches

White Paper VXLAN Overview: Cisco Nexus 9000 Series Switches What You Will Learn Traditional network segmentation has been provided by VLANs that are standardized under the IEEE 802.1Q group. VLANs provide