Designing a distributed NFV based LTE-EPC

Designing a distributed NFV based LTE-EPC Project Dissertation - I submitted in partial fulfillment of the requirements for the degree of Master of Technology by Pratik Satapathy (153050036) Under the guidance of Prof. Mythili Vutukuru Indian Institute of Technology Bombay Mumbai 400076 (India) 17 October 2016

Abstract A sudden surge in smartphone usage and rise of cloud computing era has increased our network bandwidth consumption by many folds. This has resulted in significant rise of load in the LTE packet core infrastructure. This drives the need for translating the LTE EPC core to a virtualized platform using Network function virtualization (NFV)[10] where scaling and innovation are easy. NFV ensures flexibility in deployment and significant reduction in cost. To design such a scalable system we begin with analyzing NFV- LTE-EPC v1.0[7]: an open-source monolithic NFV based prototype for LTE EPC. There exists different strategies to scale a network applications in NFV. We analyze feasibility and relevance of these well known strategies on the monolithic implementation. With a preliminary design at hand we proceed to analyze the various implementation options and challenges offered by them. Implementation decisions were taken on their suitability to the existing system and various performance parameters. We present NFV-LTE-EPC v2.0: a distributed design to NFV-LTE-EPC v1.0[7] that provides a reliable and scalable implementation. We evaluate our system to show the efficacy of control and data plane scaling performance.

Contents 1 INTRODUCTION 1 2 BACKGROUND 3 2.1 LTE..................................... 3 2.2 LTE EPC Procedures............................ 4 2.3 Network Function Virtualization...................... 5 2.4 NFV-LTE-EPC v1.0 : An NFV Based monolithic prototype for LTE EPC 6 2.5 Scaling a system.............................. 7 2.6 Related work................................ 8 2.7 Contribution................................. 9 3 DISTRIBUTED NFV BASED LTE EPC 10 3.1 Design.................................... 10 3.1.1 Design of overall system...................... 10 3.1.2 Design of load-balancers...................... 12 3.1.3 Design of shared data-store..................... 14 3.2 Implementation............................... 14 3.2.1 Load balancing:........................... 15 3.2.2 State separation:.......................... 18 4 EVALUATION 21 4.1 Experimental setup............................. 21 4.2 Control plane evaluation.......................... 22 4.3 Data plane evaluation............................ 26 4.4 Evaluation comparison on different key value stores............ 28 i

Contents ii 5 CONCLUSION AND FUTURE WORK 32 References 33

List of Figures 2.1 LTE architecture with EPC and RAN.................... 3 2.2 UE state diagram for LTE EPC procedure................. 3 2.3 High-level NFV concept diagram...................... 5 2.4 LTE EPC architecture followed in NFV-LTE-EPC v1.0 [7]........ 6 3.1 Distributed architecture of single epc component MME / SGW / PGW.. 11 3.2 The complete distributed LTE EPC architecture for a two worker system. 11 3.3 A typical LVS-DR setup.......................... 13 3.4 UE assignment to fixed udp clients at SGW................ 17 3.5 UE data packet flow illustration....................... 18 4.1 Control plane throughput scaling...................... 23 4.2 CPU utilization of SGW component.................... 24 4.3 CPU utilization of MME component.................... 25 4.4 Latency................................... 26 4.5 Dataplane throughput scaling........................ 27 4.6 SGW CPU utilization............................ 28 4.7 Comparison of control plane throughput across various datstores..... 29 4.8 Comparison of control plane latency across various datstores....... 29 4.9 Comparison of control plane scalability across various datstores..... 30 iii

Chapter 1 INTRODUCTION The proliferation of smart phones has driven unprecedented boost in traffic in recent years. Such bandwidth surge requires prompt capacity expansion. Factors such as always-on connectivity, the rise of cloud computing and the growth of IoT devices (projected at 26 billion by 2020) [1] have implications on the amount of control signaling in mobile networks. In some US markets, the peak network usage recorded was as high as 45 connection requests per device per hour [1]. In LTE, this increased signaling traffic will have a significant impact on the performance of the EPC core. Current LTE EPC systems are specialized hardware based system. To scale a such a system, the older system needs to be discarded and replaced with a higher capacity system. This requires significant investment. Therefore, we focus on providing a Network Function Virtualization (NFV) based solution for LTE-EPC. NFV based applications are software implementations of network functions that can be deployed on commodity high volume generic hardware systems. These systems can easily re-purposed for other NFV applications. Sadagopan N S [7] provides a monolithic, NFV based prototype for LTE EPC. This system takes advantage of commodity high volume servers to provide comparable performance to the specialized hardware implementations. However, the monolithic architecture poses a significant disadvantage of having stateful EPC components. When one of the components goes offline due to failure or maintenance, the saved state for subscribers is also lost forcing the subscriber to reconnect. This re-connection overhead adds up to the total load of the system. We present NFV-LTE-EPC 2.0: a distributed design to NFV-LTE-EPC 1.0 [7] to make 1

2 it reliable and horizontally scalable. The stateless distributed design for LTE EPC overcomes the state loss problem by keeping states in a reliable data store outside the component. This state can also be shared to sibling servers so that they can use the state information in case of failure of the former. We also have a load-balancer at the front balancing the incoming load to the back-end servers of that particular component. This design ensures a single point interfacing with the other components and enables adding more back-end servers to ensure horizontal scalability. We provide evaluation of our system on control plane and data plane performance. We demonstrate near linear scaling by showing increased throughput of the system when computing resources and incoming load are increased. Rest of the thesis is structured as follows: To begin with, we provide a brief background on LTE EPC architecture, LTE EPC procedures and NFV technology. We briefly discuss the already developed NFV based monolithic LTE EPC system by Sadagopan N S [7]. We provide some analysis on different scaling strategies. We discuss related work in the field of distributed architectures for LTE EPC and explain the relevance of our current work. We provide detailed design of our distributed system and importance of several design choices in chapter 3. We describe detailed implementation of each component and the overall system in chapter 3. In chapter 4 we present an evaluation of the system in terms of control plane performance, data plane performance and scaling efficiency. We conclude and suggest future works to the system in chapter 5.

Chapter 2 BACKGROUND We provide some light into the architecture and operation of LTE-EPC and NFV in the following subsections. 2.1 LTE LTE (Long Term Evolution) is the standard for modern day s 4G cellular networking. Figure 2.1 shows the LTE EPC architecture. Figure 2.1: LTE architecture with EPC and RAN Figure 2.2: UE state diagram for LTE EPC procedure We present a brief background on the LTE EPC architecture and functions of various components. As demonstrated in figure 2.1 4G LTE networks consist of two components, RAN (Radio Access Network) and EPC (Evolved packet core). RAN layer consists of 3

2.2 LTE EPC Procedures 4 enodebs which are mobile tower infrastructure responsible for serving the user wirelessly. EPC manages control signals, that are necessary to manage devices and route data. EPC is divided into control plane and data plane elements. MME (Mobility management entity), HSS (Home subscriber server), PCRF (Policy and Charging Rules Function) are control plane elements. S-GW (Serving Gateway) and P-GW (Packet Data Network gateway) are data plane elements. MME handles all control signal operations and therefore, the most stressed entity. It also maintains connection to enodebs through S1AP interface, HSS through S6 interface and serving GW through S11 interface as per 3GPP standards. HSS serves as the database for UE (User Entity) authentication information and PCRF serves as the database for UE charging and policy related information. S-GW and P-GW are routers that handle data traffic. 2.2 LTE EPC Procedures When a device comes to the area covered by an EnodeB, it gets registered with the network and henceforth operates in two modes : Active and Idle. Figure 2.2 shows a state diagram of UE state changes. We briefly go through the procedures. (1) Attach/Reattach: When a device powers on or wants to send a packet at Idle mode, it sends an attach or service request. It gets authenticated from MME and establishes a session with S-GW, and goes into Active mode. (2) TAU(Tracking area updates): When in Idle mode, UE sends periodic location updates to MME. (3) Paging: While a device is in Idle mode, a downlink packet is sent to the device by a paging request. (4) Handover: A device tears down connection with the old EnodeB and sets up connection with a new EnodeB. (5) Detach: A device tears down the tunnel setup at SGW and PGW.

2.3 Network Function Virtualization 5 2.3 Network Function Virtualization Virtualized network functions (VNFs) are software implementations of network functions that can be deployed on a network function virtualization infrastructure (NFVI). Network function virtualization infrastructure (NFVI) is the totality of all hardware and software components that build the environment where VNFs [10] are deployed. The figure 2.3 describes a general NFVI set-up. In the following paragraph, we discuss the motivation behind choosing NFV. Figure 2.3: High-level NFV concept diagram Specialized hardware have failed to follow Moore s law but general purpose commodity hardware have been showing sustained increase in computing performance. This encourages us to use commodity hardware bundle with software network functions. In the following paragraph, we discuss some pros and cons of NFV. NFV set-up is vendor independent so we can mix hardware vendors to mitigate vendor specific failures. In NFV extra headroom of one network application is readily available for other application to use. A diverse set of skilled support team is not required as hardware is very general compared to the use of specialized hardware. NFV accelerates innovation. A new network protocol can be tested as soon as it gets implemented in software. NFV is easily scalable. Addition of a blade to a blade server can scale it with minimal interference to network applications. NFV with general purpose hardware may not provide the performance of specialized hardware but with recent research involving Intel DPDK, integrated memory controller

2.4 NFV-LTE-EPC v1.0 : An NFV Based monolithic prototype for LTE EPC 6 and rapid increase in processing power, it can outweigh the benefit of specialized hardware. 2.4 NFV-LTE-EPC v1.0 : An NFV Based monolithic prototype for LTE EPC We introduce an NFV based monolithic prototype NFV-LTE-EPC v1.0 [7] by Sadagopan N S. This solution was developed by previous researchers in our research group at IITB. This solution was developed on the principle of NFV where each component of LTE EPC was implemented in software as virtualized network function. It is 3gpp compatible and contains basic LTE EPC procedures such as Attach, Detach in control plane and uplink and downlink packet forwarding in data plane. Figure 2.4 presents the architecture of [7]. The thesis Sadagopan N S [7] discusses design and implementation of VNFs in detail. As it is a monolithic implementation, it can not support multiple nodes for a component to achieve horizontal scalability. The system does not provide reliability as all the state stored in MME, SGW and PGW are lost when that particular node goes offline due to any failure. Figure 2.4: LTE EPC architecture followed in NFV-LTE-EPC v1.0 [7] Therefore, as an enhancement to NFV-LTE-EPC v1.0 [7] we propose NFV-LTE-EPC v2.0 which can provide both reliability and scalability. Details are captured in chapter 3.

2.5 Scaling a system 7 2.5 Scaling a system Scaling is the capability of a system to withstand a growing load when more resources are added to the system. Scaling primarily has two approaches: (1) Vertical scaling Vertical scaling is the method of upgrading/replacing the older system with a higher capacity system. This type of scaling is limited by the maximum up-gradation capacity of a system after which the whole system needs to be replaced. (2) Horizontal scaling Horizontal scaling scales by adding more smaller commodity computing system to increase the capacity of the overall system. Horizontal scalability is more cost effective. However, the system requires the need to redesign the running application to suit the distributed environment. The most general approach for horizontal scaling is to have a load balancer at the front end which can provide a single point of contact. Behind the load balancer a number of worker replicas are present. These workers can be stateless or stateful. Likewise, the load-balancer can be stateful to store some mapping to support sticky sessions or can be stateless and perform round-robin load distribution. In some load-balancing scenarios there arises the need of sharing persistent/non-persistent states between replicas. In the context of LTE EPC packet core, horizontal scaling is the appropriate solution as the mobile traffic growth is increasing day by day and only horizontal scaling can suffice the ever increasing load on these systems. In a typical LTE EPC implementation there are primarily three elements namely MME, SGW and PGW which need constant scaling. In a typical LTE EPC procedure like Attach/Detach, each of MME, SGW and PGW stores some state for a particular UE. The saved state is repeatedly used by that component when another request arrives from that particular UE. The states contain UE related context most importantly tunnel header information for control plane operation and data plane operation, NAS security keys and various other 3gpp standard parameters. For distributing such a system we need to take care of the consistency of the saved state across replicas and latency trade-offs. There are number of ways in which load balancing can be done.

2.6 Related work 8 1. State partition: The strategy of this approach would be to partition UEs across different replicas of EPC components. For example, a group of UEs gets statically allocated to an MME /SGW /PGW worker. All future request from the same UE should come to that MME /SGW /PGW replica. This mapping of UE to MME /SGW /PGW replica has to be stored at the load balancer so that every time the UE sends a request it always reaches the same replica. This mapping can be IMSI to MME /SGW /PGW node identifier. The mapping can also be represented by a hash function. In this approach there arises no need to share states between replicas. This approach has the risk of losing all the states of a replica if the replica goes offline. 2. Session split: This approach introduces a shared reliable data storage which gets synchronized with MME /SGW /PGW replicas only when a certain LTE procedure completes. Load balancer are configured such that they send every packet of a single procedure for a particular UE to a fixed replica. At the beginning of another procedure the synchronized state is again loaded back to a replica to proceed with that procedure. 3. Stateless load balancing: In this approach the load balancer can send request to any replica of a component (MME/SGW/PGW). Replica synchronizes saved state with datastore after processing of every message. In this scenario load balancer does not have to keep any state. However, frequent synchronization with data-store can increase latency for an LTE procedure to finish. In section 2.6 we analyze some of the related works that are designed in one of the above approaches. 2.6 Related work NFV and distributed system implementation in Evolved packet core have been interesting and active areas of research. We discuss some of the related works done in distributing Evolved packet core. SCALE [2] introduces a distributed EPC which works in the first approach we discuss. Each UE is mapped to a particular instance of MME. A back-up instance is also present for each MME and UE mapped to primary are synchronized with the back-up instance. Here the mapping is stored by consistent hashing at the load balancer. The state present

2.7 Contribution 9 in the replica is not available to any other node except its back-up. When the back-up and the primary fails, the state is lost completely. Takano et al. [9] and Gopika et al. [6] propose a solution which follows an approach that is a combination of stateless load-balancing and stateful load-balancing. When a UE requests an Attach, it is free to be allocated to any of the MME replicas. Once Attach succeeds the UE is statically bounded to that replica. A mapping of UE to MME replica is maintained at the load balancer to enforce the static binding. Saved UE context is synchronized to a separate data store to ensure recovery in case of failure. This design concentrates distributing only MME component whereas SGW and PGW can also be bottleneck as MME scales up. 2.7 Contribution We present NFV-LTE-EPC 2.0 which tries to specifically eliminate the problem of losing state information in all the components of EPC i.e MME, SGW and PGW by saving their state in a reliable key value store unlike [9] and [6]. The saved state can be retrieved by another replica to continue operation. We present a scalable design for MME, SGW and PGW by having multiple back-end workers behind a load balancer. Our design facilitates round robin distribution of loads across replicas and mapping of UE to worker only have to be stored until the end of that procedure. We develop a prototype to demonstrate advantages of our architecture. We discuss the design choices made in each component. We evaluate the performance and scalability of our prototype.

Chapter 3 DISTRIBUTED NFV BASED LTE EPC In this chapter we present the design considerations and implementation details of the distributed LTE EPC solution i.e. NFV-LTE-EPC 2.0. We have discussed various design approaches of scaling and state partitioning strategies in section 2.5. In our design we choose the session-split design approach as it provides us equal trade-off between state partition (1st approach discussed in section 2.5) and the stateless load balancing approach (3rd approach discussed in section 2.5). Using this approach we get the benefit of synchronizing state only after a procedure completes thus reducing overall latency of state synchronization. We also do not loose UE states as states are stores outside the replicas in a reliable data store if the replica fails after a procedure is completed. We describe design and implementation strategies of the complete systems in detail in the following sections. 3.1 Design In this section we describe the overall design of our distributed architecture. We then proceed to discuss design of each component. 3.1.1 Design of overall system To transform the monolithic design to a scalable design, we need to replace the monolithic EPC components with a distributed version of the same. Figure 3.1 shows a clustered version of an EPC element. In this design, architecture components namely MME, SGW and PGW are replaced by clusters consisting of a load balancer (LB), a shared data 10

3.1 Design 11 Figure 3.1: Distributed architecture of single epc component MME / SGW / PGW store (DS) and a number of back-end worker servers. Overall architecture of the scalable EPC with distributed clusters in place of monolithic components is shown in Figure 3.2. Control-plane path has been highlighted in red and data-plane in black. This shows how each cluster interfaces with other components of the EPC system. Figure 3.2: The complete distributed LTE EPC architecture for a two worker system In the following sections we discuss designs of each component.

3.1 Design 12 3.1.2 Design of load-balancers Each cluster (MME/SGW/PGW) contains a load-balancer as a front end element which acts as an interface to the other EPC components. The primary purpose of the load-balancer is to distribute incoming traffic to the worker nodes. In a single LTE procedures there are many request/response pairs. We call these request/response pairs as sub procedures. The LTE EPC protocol implementation requires that the sub procedure of a particular procedure e.g Attach should have access to the previously stored states. One way to achieve this is by synchronizing each sub-procedure state to the data store. As this approach has performance trade-off, we chose the design in which all sub procedures of a particular procedure e.g Attach are directed to a particular back-end server in the cluster. Choice of load balancer: The major classes of load balancers are divided into (1) Layer-4 load balancing (performs balancing using layer 4 protocols) and (2) Level-7 load balancing (performs load balancing on the basis of application layer protocols and data in the message). We could achieve our requirement using layer-4 load balancing. As a layer-4 load balancer is faster than a layer -7 load balancer we chose to use the first option for our system. The required layer-4 load-balancing can be achieved by: (1) IPtables rules based load balancing. (2) Linux virtual server (LVS) load balancing. We chose the Linux virtual server (LVS) load balancer as it has pre-configured load balancing algorithms and it is already part of the stable release of Linux kernel. Addition of newer load balancing algorithm to it can be easily done by inserting new kernel modules. The LVS load balancer operates in 3 modes: (1) LVS-NAT : Works on the principle of s-nating and d-nating each incoming and outgoing packet. (2) LVS-TUN : Works on the principles of ip-in-ip tunneling. (3) LVS-DR : This is a direct return method of load balancing where the

3.1 Design 13 incoming traffic gets distributed to workers via the load-balancer but gets directly returned to the client by-passing the load-balancer. LVS-DR mode provides better performance and less chance of overloading the load balancer so we chose the LVS-DR mode as our desired configuration. The LVS-DR mode has the limitation that the back-end servers should be present on the same LAN segment as of the Load-balancer node. How LVS-DR works: Figure 3.3: A typical LVS-DR setup Figure 3.3 shows a typical LVS-DR setup. The setup consists of a load balancer otherwise called as director and some back-end servers. This director serves as an interface to incoming clients. The load balancer uses an ip other than its real ip also known as VIP (Virtual IP) which is exposed to clients. Load balancing rules and algorithm are configured on the director using IPVS administrator module. Once a packet reaches the director, it gets routed to one of the back-end servers via MAC based forwarding. As the packet still has destination address of VIP and not that of the back-end servers, it has to be redirected to the self at the back-end server using iptables. The source address of the packet

3.2 Implementation 14 is still unchanged so a reply follows a path directly to that particular client, by-passing the director module. 3.1.3 Design of shared data-store As per the load balancing strategy, we have to make the back-end server stateless. This can be achieved by having all state data at a location shared by all the back-end servers in that cluster. A shared reliable key-value store can fit into this requirement. We require state sharing only between back-end servers of a particular cluster/component. Hence it is better to have separate key-value store for each cluster i.e MME cluster, SGW cluster and PGW cluster. On the other hand, HSS is a single entity and has certain persistent states. HSS operation does not require entire DBMS features so the mysql storage for HSS was replaced with a key-value store that resides on the same system. Various scalable key-value stores have their own API to perform any operation such as data save or retrieval. To establish a common interfacing for these key-value stores, a client library that provides uniform API to all the key-value stores was designed as a part of thesis Jash Dave [4]. Choice of key-value store: Different EPC components have different storage needs. HSS requires a persistent data storage which will be accessed once per each user Attach/Detach. SGW and PGW require to access the storage more frequently hence need high availability and parallel access. A detailed analysis finding suitable key value store was done as a part of thesis Jash Dave [4]. The analysis involved distributed key value stores such as RAMCloud [5], Redis [3] and LevelDB [8]. 3.2 Implementation Here we discuss implementation details of each of the component that was used to achieve a scalable design.

3.2 Implementation 15 3.2.1 Load balancing: We discuss about how load balancing was done at each of the clusters i.e MME cluster, SGW cluster and PGW cluster. MME cluster load balancing: MME load balancing consists balancing only control traffic. For MME workers to be stateless we have to ensure that the local state gets pushed to the data store at the end of the session. In our implementation we classify the sequence of all sub-procedures of an attach request to be in one session. Similarly, we classify the Detach procedure to be in one session. With these session definitions we could afford to push state data to the data store only when that session ends. The above notion can only work when the sub-procedure of a procedure e.g. Attach are made to hit always the same worker till the procedure ends for a particular UE. As the sub-procedures in Attach are known to be on the same SCTP connection, We could configure the load-balancer to work in a round robin fashion for each new connection. Any packet that is part of an existing connection will get directed to the same MME worker. The LVS load balancer uses a hash of 5-tuple to keep track of established SCTP connection and their destination worker. With the implementation of this strategy we have to synchronize data with datastore only when the Attach procedure ends or the Detach procedure ends. When a UE performs attach successfully, it can perform detach even if the worker which performed the attach is no longer online. The load balancer can direct the request to one of the online workers. The detach request can easily be processed by another MME worker by extracting the store saved at the shared key-value store. However, there still is a chance that a worker fails during procedure without reaching the end. In this case, the intermediate state is lost and the procedure has to be re-initialized. If in a particular scenario we require load balancing to happen based on some parameter like IMSI code present in the message, we could do a deep packet inspection at load balancer to achieve it. SGW cluster load balancing SGW load balancing consists of two types of traffic (1) Control traffic (2) Data traffic

3.2 Implementation 16 Load-balancing control traffic: For SGW worker to be stateless, we have to push SGW worker states to the data store. The LTE Attach procedure has 2 sub procedures at SGW. Therefore, at end of each subprocedure we have to push the state to datastore. We can reduce the number of state synchronization if we ensure that each of the sub-procedure for a particular UE lands up in the same SGW worker node. UDP based round robin load balancing in LVS can be configured to maintain UDP sessions (4 tuples of UDP connection). Using this we ensure that request from a particular UDP client always follows the same path. At MME we assign sub procedures of attach request of a particular UE to a particular UDP socket. We can group requests from a particular UE by hashing on IMSI of that UE. Hence all message in the same UDP session hit the same SGW replica. With this we need to synchronize the state at the end of the two sub procedure (the complete attach operation). However, when the worker of SGW cluster fails after processing only one of the sub procedure, the intermediate state is lost and the procedure has to be re-initialized. Load-balancing data traffic: SGW is responsible for forwarding data packets from UE based on tunnel ids setup in the control plane. These tunnel ids are saved in the shared data store of SGW cluster. If each packet is sent with round robin approach to different workers then SGW worker has to extract the saved state of each such packet. To counter this trade-off we use the same concept of UDP session. When for a sequence of packets, the 4 tuple of UDP connection remains same, we call it a single UDP session. We enforce the LVS UDP based Round robin load balancer to route data packets from a particular UE to the same SGW worker. Now we can keep a cache of the shared data store key values until the end of that UDP session. In this way we do not have to extract state each time a packet arrives. We ensure that for a data session the UDP 4 tuple between enodeb and SGW remains the same. We use hash of the tunneled ip address of the packet to assign it to the same UDP client every time keeping the source address same. The destination address is always the SGW cluster load balancer. In this way for a particular UE we could ensure that UDP 4-tuple remains same. The UDP load-balancer hence directs packets of same UE to a particular SGW worker. The same implementation was used when downlink traffic comes from PWG worker

3.2 Implementation 17 to SGW load balancers. PGW cluster loadbalancing PGW load balancing consists of two types of traffic (1) Control traffic (2) Data traffic Load-balancing control traffic: Attach operation at PGW consists of only one sub-procedure. For PGW worker to be stateless, we push the saved state to data store at the end of the Attach procedure. When the worker of PGW cluster fails after processing Attach procedure, another worker can continue to process the other LTE procedures such as a Detach request by extracting the last saved state from data store. Load-balancing data traffic: Figure 3.4: UE assignment to fixed udp clients at SGW PGW is responsible for forwarding data packets of UEs based on tunnel ids setup in the control plane. These tunnel ids are saved in the shared data store of PGW cluster. If each packet is sent with a round-robin approach then SGW worker has to extract the saved state of each such packet. To counter this trade-off we used the same strategy that was explained for SGW. Data traffic from a particular traffic can now be enforced to reach a particular PGW worker in a single UDP session by keeping the 4 tuple between SGW worker and

3.2 Implementation 18 PGW load balancer same. Figure 3.4 illustrates this idea. The UDP load-balancer hence directs packets of same UE to a particular PGW. By keeping the information in the cache of the worker we do not have to retrieve state from data store each time a packet arrives. Figure 3.5: UE data packet flow illustration Same implementation has been followed for downlink traffic from sink to PGW load balancers. In this way we ensure that a particular UE s every data packet travels through the same back-end servers to reach sink. Figure 3.5 shows a possible flow of packets of three UE groups. 3.2.2 State separation: As discussed in earlier sections MME cluster, SGW cluster and PGW cluster should have stateless workers in order to scale horizontally. This requires a shared data store (key-value store) in each of these clusters. As discussed in section 3.2.1 state synchronization is done at the end of an LTE procedure in MME, SGW and PGW workers. In control plane operation of LTE EPC the saved states are pushed periodically to the shared datastore. This enables any back-end workers in the cluster to service a particular UE. As an example: a UE performs an Attach procedure which was serviced by MME worker1. When UE tries a Detach operation, any of the back-end workers can pull the saved states for that UE from the data store and proceed to perform Detach.

3.2 Implementation 19 In data plane operation the UE context information is safely present in the shared datastore. When a UE sends data packets to SGW worker, the worker can pull the UE context from shared data store and perform packet forwarding. The NFV-LTE-EPC 2.0 system was integrated to key value data stores using a library API by Jash Dave [4] that provides interfacing to various key value stores. Strategies to push/pull UE context: Different optimization strategies in data synchronization were used in achieving optimum performance in MME, SGW and PGW workers. LTE EPC state information at MME: Per UE state information at MME consisted of two key value pairs: (1) MME S1AP UE ID GUTI (2) GUTI UEContext These two key value pairs were merged into a single key value pair in accordance to de-normalization. The resulting key-value pair is: MME S1AP UE ID encapsulated[guti, UEContext] This optimization helps in reducing number of push to a single put request to the datastore. As the total size of data in the denormalized key value pair is below the MTU limit, this is a viable option. LTE EPC state information at SGW: Per UE state information at SGW consisted of four key value pairs: (1) S11 CTEID SGW IMSI (2) S1 UTEID UL IMSI (3) S5 UTEID DL IMSI (4) IMSI UEContext

3.2 Implementation 20 IMSI UEContext pair is merged to the other 3 key value pairs as per the principle of denormalization to reduce number of key value pair to three. Following are the resulting key value pairs. S11 CTEID SGW encapsulated[imsi, UEContext] S1 UTEID UL encapsulated[imsi, UEContext] S5 UTEID DL encapsulated[imsi, UEContext] These key value pairs are of diverse nature and can not be merged further. Hence they were pushed to the data store by using multi-put option available in key value store library API by Jash Dave [4]. Designs with parallel push to datastore using parallel pthreads were also implemented. Another design to use thread pools instead of parallel thread creation was also tried. As the multi-put based design outperformed other design options, it was chosen as the final design. LTE EPC state information at PGW: Per UE state information at PGW consisted of three key value pairs: (1) S5 CTEID UL IMSI (2) UE IP ADDR IMSI (3) IMSI UEContext IMSI UEContext pair is merged to the other 2 key value pairs to reduce number of key value pair to two. The following are the new key value pairs. S5 CTEID UL encapsulated[imsi, UEContext] UE IP ADDR encapsulated[imsi, UEContext] Similar to SGW we could observe that the resulting key value pairs are of diverse nature and can not be merged further. Hence they were pushed to the data store by using multiput options available in key value store library API by Jash Dave.

Chapter 4 EVALUATION NFV-LTE-EPC 2.0 is a distributed design for LTE EPC aiming for horizontal scaling. The design facilitates addition of more computing resource behind the load balancer. In this section we try to prove our claim of horizontal scaling. We try to evaluate on different performance parameters and answer the following: (1) Can we achieve horizontal scaling? (2) Is scaling efficient enough to provide linear increase in throughput? (3) Does a scaled system use most of its resources? (4) Does a different datastore affect performance of our design? We perform experiments for control plane performance scalability and data plane performance scalability to answer queries on scaling performance. We test our prototype by integrating various key-value stores to see how different data store affect our system performance. 4.1 Experimental setup The complete experimental setup requires 15 virtual machines to be setup for the new LTE EPC design. Experiment was carried out on virtual machines of KVM virtualized system. All virtual machine used in the experiments were provisioned from the system specified in the table 4.1. 21

4.2 Control plane evaluation 22 CPU RAM HARD DISK SPACE NETWORKING Table 4.1: System specification: Intel(R) Xeon(R) CPU E5-2670 v3 @ 2.30GHz (Architecture : x86 64, Cores : 48, Hyperthreading enabled ) 64 GB 2 TB 1 Gbps The set of experiments conducted require components namely RAN, SINK, PGW (3 instances), SGW (3 instances), MME (3 instances), HSS, load balancer for MME, load balancer for SGW, load balancer for PGW, shared key value store. Each of the component are hosted on separate virtual machines with specification specified in table 4.2. Table 4.2: Individual component specification COMPONENT CPU CORES RAM DISK OS RAN, SINK 8 8 GB 10 GB Ubuntu 14.04 MME, SGW, PGW, HSS 1 2 GB 10 GB Ubuntu 14.04 MME LOAD BALANCERS SGW LOAD BALANCERS PGW LOAD BALANCERS 2 4 GB 10 GB Ubuntu 14.04 LEVELDB CLUSTER 8 8 GB 20 GB Ubuntu14.04 RAMCLOUD CLUSTER 8 8 GB 20 GB Ubuntu14.04 REDIS CLUSTER 8 8 GB 20 GB Ubuntu14.04 4.2 Control plane evaluation In this section we evaluate control plane performance of our distributed architecture. RAN acts as a simulator for EnodeB+UE and generates load for the LTE EPC setup. RAN repeatedly performs Attach and Detach operation concurrently for a number of UEs. We keep on increasing number of concurrent UEs to generate more load and capture the number of registrations/second (control plane throughput). Experiment is performed for

4.2 Control plane evaluation 23 14 values of UE counts varying from 1 to 200. Each experiment is run for a duration of 30 seconds (approximately 50,000 registration cycles). Each such set of experiments is repeated for a scaled setup of 2 workers and 3 workers. Throughput evaluation: We start the experiment with one instance of MME, SGW and PGW. For control plane evaluation we use LevelDB cluster as the shared key value store. We keep on increasing number of instances of MME/SGW/PGW upto 3 worker instances. Figure 4.1 shows a graph with Y axis as number of registration/second and X axis as number of concurrent UEs for the following 3 setups. (1) One worker of MME/ SGW/ PGW (2) Two workers of MME/ SGW/ PGW (3) Three workers of MME/ SGW/ PGW 4500 4000 3500 Registration / second 3000 2500 2000 1500 1000 500 1 worker 2 workers 3 workers 0 0 20 40 60 80 100 120 140 160 180 200 Number of concurrent UE Figure 4.1: Control plane throughput scaling Figure 4.1 demonstrates that number of registrations per second increases proportionately as we increase number of worker instances and saturates at about 90-100 concurrent UEs. Saturation throughput achieved at the 3 setups: (1) One worker of MME/ SGW/ PGW : 1750 registrations/sec (2) Two workers of MME/ SGW/ PGW : 3059 registrations/sec (3) Three workers of MME/ SGW/ PGW : 4267 registrations/sec

4.2 Control plane evaluation 24 Experiment observation: From the monotonic increase of throughput across the three setups we observe that the system is capable of providing more throughput on scaling. From the above data we can calculate the scaling factor for a two worker system : 1.74 and a three worker system: 2.43 with 1 worker system as the base line. This experiment states that although the system is capable of scaling, it does not scale linearly with respect to number of worker instances. 100 90 80 70 CPU utilization (%) 60 50 40 30 20 10 1 worker 2 workers 3 workers 0 0 20 40 60 80 100 120 140 160 180 200 Number of concurrent UE Figure 4.2: CPU utilization of SGW component CPU utilization evaluation: We evaluate the CPU utilization of our system. We monitor the most two stressed component SGW and MME. Observation was taken on average CPU utilization of SGW and MME during the experiment. Average CPU utilization was obtained by averaging CPU utilization values for each second for 30 seconds duration and repeated for 14 values of UE counts. Figure 4.2 shows CPU utilization percentage of SGW for all three setups. X axis shows number of concurrent UEs and Y axis shows CPU utilization of SGW instance. We can see that CPU saturates when we approach 90-100 concurrent UEs.

4.2 Control plane evaluation 25 100 90 80 70 CPU utilization (%) 60 50 40 30 20 10 1 worker 2 workers 3 workers 0 0 20 40 60 80 100 120 140 160 180 200 Number of concurrent UE Figure 4.3: CPU utilization of MME component Figure 4.3 shows CPU utilization percentage of MME for all three setups. X axis shows number of concurrent UEs and Y axis shows CPU utilization of MME instance. We can see that MME CPU utilization also nears saturation. Experiment observation: In each of the setup SGW reaches saturation CPU utilization and is the bottleneck resource in the setup. As discussed in section 3.2.2 SGW has three key value pairs to be pushed into the datastore via multiput processing where as MME only has a single key value pair to be pushed to the datastore. This extra state synchronization between SGW and data store causes SGW to saturate before MME. Latency evaluation: We monitor the latency (duration for a complete registration cycle). This experiment was performed for 14 UE counts in the range 1 to 200. Each experiment was performed for 30 seconds and repeated for all 3 setup.

4.3 Data plane evaluation 26 0.12 0.1 0.08 Latency in seconds 0.06 0.04 0.02 1 worker 2 workers 3 workers 0 0 20 40 60 80 100 120 140 160 180 200 Number of concurrent UE Figure 4.4: Latency Figure 4.4 shows graph between latency (duration for one registration process to be complete) and number of concurrent UEs. Latency keeps on increasing when we keep increasing Number of concurrent UEs. In section 4.3 we explore on designs where our systems is integrated to other key value stores to draw a comparison between the latency of different setup. 4.3 Data plane evaluation In this section we evaluate data plane performance of our distributed design. We use iperf3 load generation tool to generated data plane load for our system. We maintain a fixed number of concurrent UEs that is suitable to help spread the traffic to all instances of SGW/PGW workers of the system. We perform each experiment for a duration of 40 seconds. We keep on increasing the input load to the system and record data throughput at sink. Throughput evaluation: For throughput evaluation we take observations for 12 input load values ranging from 80 Mbps to 480 Mbps at sink. We record bandwidth at sink at 3 seconds interval and average it to get our average bandwidth. We repeat this experiment for three scenarios i.e one worker of MME/ SGW/ PGW, two workers of MME/ SGW/ PGW and three workers of MME/ SGW/ PGW. Figure 4.5 shows data plane throughput in the three pre-mentioned scenarios. X axis shows input load in Mbps and Y axis shows load observed at sink in Mbps.

4.3 Data plane evaluation 27 350 300 Dataplane throughput in Mbps 250 200 150 100 1 worker 2 workers 3 workers 50 50 100 150 200 250 300 350 400 450 500 Input Load in Mbps Figure 4.5: Dataplane throughput scaling Average saturation throughput achieved at the 3 setups is mentioned below: (1) One worker of MME/ SGW/ PGW : 183 Mbps (2) Two workers of MME/ SGW/ PGW : 315 Mbps (3) Three workers of MME/ SGW/ PGW : 352 Mbps Experiment observation: Scaling factor between setup 2 and 1 is 1.72. The system is capable of scaling but it still does not provide linear scaling. Similar scaling could not be achieved for setup 3 as the load generator bottlenecks at 364 Mbps. Improving the load generation is taken to be a part of future work. CPU utilization evaluation: We monitor the most stressed component in data plane operation i.e SGW for CPU utilization. We recorded CPU utilization values at 1 second interval for 30 seconds and averaged to get utilization for an input load. We repeated the same for 12 more input load values ranging from 80 Mbps to 480 Mbps.

4.4 Evaluation comparison on different key value stores 28 90 80 70 CPU utilization (%) 60 50 40 30 1 worker 2 workers 3 workers 20 50 100 150 200 250 300 350 400 450 500 NUmber of concurrent UE Figure 4.6: SGW CPU utilization Figure 4.6 shows CPU utilization of SGW at various data loads with X axis showing input load in Mbps and Y axis showing CPU utilization percentage. Experiment observation: We could see that for setup 1 and 2, CPU saturates at 77 percent when the peak throughput is reached. But for scenario 3 CPU remains 50 percent idle as the dataload generator hits bottleneck. 4.4 Evaluation comparison on different key value stores To explore other design choices of key value stores, our LTE EPC distributed design is evaluated by integrating with Redis [3] cluster and Ramcloud [5] cluster. We repeat the control plane experiments that were performed with LevelDB cluster and compare the results. A the dataplane performance does not vary with different key value stores because of the design, we do not include the evaluation for them. We start with experiments on setup 1 (1 worker of SGW/MME/PGW). We perform experiments by integrating our design with: (1) In memory Hashmap (2) LevelDB [8] (3) Redis [3] (4) Ramcloud [5]

4.4 Evaluation comparison on different key value stores 29 Throughput and latency comparison: Figure 4.7 shows a comparison of control plane throughput (registration/sec) across the above mentioned datastores. At the X axis we have number of concurrent UEs and at Y axis we have number of registrations per second. We could observe that in-memory-hash-map based design performs best as it stores EPC states locally. Although this design is not practical, it establishes a reference line for performance. Among the reliable key value stores LevelDB [8] based design performs better in comparison to Redis [3] and Ramcloud [5]. 4500 4000 single system Hashmap single system leveldb single system ramcloud single system redis 3500 Number of registration per sec 3000 2500 2000 1500 1000 500 0 0 20 40 60 80 100 120 140 160 180 200 Number of concurrent UEs Figure 4.7: Comparison of control plane throughput across various datstores Figure 4.8 has latency in seconds in Y axis and number of concurrent UEs in X axis. 0.35 0.3 single system Hashmap single system leveldb single system ramcloud single system redis 0.25 Latency in secs 0.2 0.15 0.1 0.05 0 0 20 40 60 80 100 120 140 160 180 200 Number of concurrent UEs Figure 4.8: Comparison of control plane latency across various datstores

4.4 Evaluation comparison on different key value stores 30 Experiment observation: With above observation and comparison we can say that different key value stores has some impact on our design and as per the observation LevelDB [8] performs better at single worker setup. In this we can observe that LevelDB [8] has the lowest latency among the key value stores and Redis [3] has the highest latency. To establish clear comparison on throughput and scalability of designs we perform control plane experiment for all 9 scenarios: LevelDB [8] integrated with 1 worker / 2 workers / 3 workers Redis [3] integrated with 1 worker / 2 workers / 3 workers Ramcloud [5] integrated with 1 worker / 2 workers / 3 workers 7000 6000 Scaling across different Key value stores Redis cluster Ramcloud cluster LevelDB cluster 5000 4356 4000 3000 3108 2422 2000 1000 714 880 1750 1373 1634 1651 0 1 worker 2 workers 3 workers Figure 4.9: Comparison of control plane scalability across various datstores Figure 4.9 contains registrations/seconds (control plane throughput) at Y axis. X axis contains number of workers in the design. We compare saturation throughput of all nine combinations. For scalability we consider throughput at 1 worker design as the base factor. We observe that LevelDB based design delivers higher throughput. With 2 worker nodes scalability is at 1.77 and for 3 worker nodes limits at 2.49. Redis [3] based design throughput is least among peers. It provides 1.92 scaling for 2 workers and 2.31 scaling for 3 workers.

4.4 Evaluation comparison on different key value stores 31 Ramcloud [5] based design delivers higher throughput than Redis [3] cluster. In consideration of scaling performance Ramcloud [5] provides 1.85 scaling for 2 workers and 2.75 scaling for 3 workers. This analysis concludes that LevelDB integration provides higher throughput performance but Ramcloud based design performs comparatively better in providing a linearly scalable solution.

Chapter 5 CONCLUSION AND FUTURE WORK We conclude by providing a brief summary of our contribution and some future work for stage - II thesis. We presented a distributed architecture NFV-LTE-EPC v2.0 that can preserve EPC components state in a reliable shared data store. In case of failure, other replicas of the EPC component can retrieve state from the shared store and can continue operation. We also introduced load balancers in front of MME / SGW / PGW replicas to provide single point of interfacing with other components. We can add multiple workers behind the load balancer transparent to other EPC components to withstand more load. We explored integration with key value store such as Redis [3], Ramcloud [5] and LevelDB [8] to find the most suitable data store option for a scalable design. Future work: Future plan for stage II thesis involves integration with NETMAP or Intel DPDK to provide higher data forwarding capacity. We also plan to enable auto scaling capabilities to our distributed design which can provision new EPC components based on higher loads. 32

References [1] An, X., Pianese, F., Widjaja, I., and Günay Acer, U., 2012 Sep., Dmme: A distributed lte mobility management entity, Bell Lab. Tech. J. 17, 97 120. [2] Banerjee, A., Mahindra, R., Sundaresan, K., Kasera, S., and Van der Merwe and Sampath Rangarajan, J., 2015 Dec., Scaling the lte control-plane for future mobile access, in Proceedings of the Eleventh ACM International Conference on Emerging Networking EXperiments and Technologies (CoNEXT) [3] Carlson, J. L., 2013, Redis in Action (Manning Publications Co., Greenwich, CT, USA). ISBN 1617290858, 9781617290855 [4] Dave, J., 2016, In memory key value data store, Master s thesis (IIT Bombay, India). [5] Ousterhout, J., Agrawal, P., Erickson, D., Kozyrakis, C., Leverich, J., Mazières, D., Mitra, S., Narayanan, A., Ongaro, D., Parulkar, G., Rosenblum, M., Rumble, S. M., Stratmann, E., and Stutsman, R., 2011 Jul., The case for ramcloud, Commun. ACM 54, 121 130. [6] Premsankar, G., 2015 7, Design and Implementation of a Distributed Mobility Management Entity (MME) on OpenStack, Master s thesis (Aalto University, School of Science Degree Programme in Computer Science and Engineering, Espoo). [7] Sadagopan, N., 2016, An NFV based prototype LTE EPC, Master s thesis (IIT Bombay, India). [8] syndtr, 2016, leveldb, https://github.com/syndtr/goleveldb [9] Takano, Y., Khan, A., Tamura, M., Iwashina, S., and Shimizu, T., 2014, Virtualization-based scaling methods for stateful cellular network nodes using elastic core architecture, in IEEE 6th International Conference on Cloud Computing 33