UNIVERSITY OF CALGARY. Flexible and Scalable Routing Approach for Mobile Ad Hoc Networks. by Function Approximation of Q-Learning.

Size: px

Start display at page:

Download "UNIVERSITY OF CALGARY. Flexible and Scalable Routing Approach for Mobile Ad Hoc Networks. by Function Approximation of Q-Learning."

Allison Newman
6 years ago
Views:

1 UNIVERSITY OF CALGARY Flexible and Scalable Routing Approach for Mobile Ad Hoc Networks by Function Approximation of Q-Learning by Mohamad Elzohbi A THESIS SUBMITTED TO THE FACULTY OF GRADUATE STUDIES IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE GRADUATE PROGRAM IN COMPUTER SCIENCE CALGARY, ALBERTA MAY, 2016 Mohamad Elzohbi 2016

2 Abstract Wireless mobile devices are rapidly spreading to the extent that it is hard to find a person not exposed to such technology. These devices could be connected directly or indirectly by wireless channels to form a mobile ad hoc network (MANET). Finding a route for flow from a source to a destination in a network is known as routing. Dynamic topology and unstable link states are the main problems facing routing in MANETs. This thesis employs reinforcement learning, namely Q-learning to develop a routing mechanism. Features inspired from the network are used in approximating the Q-function to form a new intelligent routing metric. This way, the routing process concentrates on specific routes instead of network-wide broadcasting. Accordingly, it is possible to achieve flexibility and scalability in routing. Advantages of the proposed routing technique have been highlighted by conducting experiments in two MANETs environments, namely hand-held devices based MANETs and VANETs. Keywords: Mobile ad hoc networks, reinforcement learning, Q-learning, routing protocol, VANETs, function approximation, reusability. ii

3 Acknowledgements Instead of asking Google how to acknowledge?! and stepping out of the patterns that might switch a genuine acknowledgement into a doubtful convention. I would like to express my feelings to those who helped me in a model-free way! First, I want to express my tremendous Appreciation for my supervisor Professor Reda Alhajj who guided me moving from one state to another following his reward model till I successfully reached my goal. He enlightened me on how to set my priorities in order to maximize my total academic rewards by reasonably choosing the right features and avoiding the misleading ones instead of acting greedily. Second, to my co-supervisor Professor Jon Rokne for his continuous support, to my examination committee Professor Jalal Kawash and Professor Mohamad Helaoui for their great comments, to the Department of Computer Science staff, to all my friends and colleagues who supported me I would like to say: Thank you! Unfortunately, my vocabulary is not scalable to satisfactorily acknowledge my parents! LOVE describes the whole story...! iii

4 Table of Contents Abstract ii Acknowledgements iii Table of Contents iv List of Tables vi List of Figures vii List of Symbols viii 1 Introduction Motivation Ad hoc On-Demand Distance Vector (AODV) Routing Method Q-Learning Based Routing Mechanisms Overview of the Proposed Approach Objective Main Contributions Thesis Organization Background and Related Work Mobile Ad hoc Networks Overview of Some MANET Routing Approaches Uniform routing protocols Non-Uniform routing protocols Machine Learning Based Routing Approaches Reinforcement Learning Reinforcement Learning and Markov Decision Process MDP Value iteration Policy iteration Q-learning Function Approximation The Methodology: Tuning Q-Learning by Function Approximation for Better Routing in MANET s From Practical Scenario to Protocol Description Building on AODV Protocol Implementing features Availability Based Feature Location based feature Stability based feature Discount Feature Q learning Approximation The Markov Decision Process Model Updating Q-values Illustrative Demo Experimental Results and Analysis Simulation Environment iv

5 4.1.1 JiST SWANS SWANS Considered Features Simulation Strategy Learning Rate and Convergence Mobility Models Random Way Point STRAW Experiments with Typical MANETs Packet Delivery Ratio Control Overhead End-to-end Delay Speed Effect Test Experiments with VANETs Packet Delivery Ratio Control Overhead End-to-end Delay Summary Summary, Conclusions and Future Research Directions Summary Conclusions Future Research Directions Enhancements Usage Bibliography v

6 List of Tables 4.1 Comparison between JiST, ns2 and Glomosim in terms of Event throughput [5] Comparison between JiST, ns2 and Glomosim in terms of Memory Footprint [5] Parameters of the used simulation environment Parameters of the used simulation environment Parameters for the experiment which checks speed effect on routing Parameters used in the experiment conducted to check effect of pause time on the routing protocol Parameters used in the experiments related to VANETs vi

7 List of Figures and Illustrations 1.1 A sample infrastructure wireless network A sample ad hoc network AODV routing protocol Reinforcement Learning mechanism Sample example networks of various MANET Types Q-routing Illustrating policies and values DP vs TD vs MC [34] Values vs Q-Values Value Functions Approximation Scenarios in RL [34] Block diagram of feature based route selection Illustrating how State/Action == Action/State Example Q-values tables An example MANET Network Illustrating the exploration approach Illustrating the exploitation approach Flow chart for the exploitation approach The cost function for α = The cost function for α = The cost function for variable value of α Packet Delivery ratio for normal and extreme conditions Control Overhead for normal and extreme conditions End-to-end Delay for normal and extreme conditions Packet Delivery Ratio for Variable Speeds End-to-end Delay for normal and extreme conditions Packet Delivery Ratio for Variable Pause Time End-to-end Delay for normal and extreme conditions North End Boston Map Packet Delivery Ratio for normal and extreme conditions Control Overhead for normal and extreme conditions End-to-end Delay for normal and extreme conditions vii

8 List of Symbols, Abbreviations and Nomenclature Symbol U of C QoS TD RL DP MC MDP MANET VANET AODV FAODV Definition University of Calgary Quality of Service Temporal Difference Reinforcement Learning Dynamic Programing Monte Carlo Markov Decision Process Mobile Ad hoc Networks Vehicular Ad Hoc Network Ad hoc On-demand Distance Vector Feature-Based Ad hoc On-demand Distance Vector (Our proposed extension) RREQ RREP RERR JiST SWANS STRAW DSR OLSR PDR Route Request Messages Route Reply Messages Route Error Messages Java in Simulation Time Scalable Wireless Ad hoc Network Simulator STreet RAndom Way-point Dynamic Source Routing Optimized Link State Routing Packet Delivery Ratio viii

9 Chapter 1 Introduction The discovery of radio communication wireless networks has enabled people to communicate in an easier and more flexible manner. People mostly aimed to achieve this without relying on wire constraints, costs and physical breakages. In fact, people seek comfort in moving towards a wireless communication platform, which provides more widely available, mobile and smoothly accessible network. However, they should be aware of the drawbacks and problems that may arise from switching to a mobile network. These drawbacks include security issues, routing and speed problems, among others. Therefore, designing protocols and techniques effective enough to overcome these problems has attracted considerable attention in academia and industry. As the scope of this thesis is concerned, I have realized that proposing a routing protocol which could be adapted to various types of wireless networks leading to better QoS has been a serious challenge facing the research community. This thesis contributes a novel routing approach for mobile ad hoc wireless networks. The proposed routing approach builds on top of and extends an existing popular approach by making it more effective, flexible and scalable. This introductory chapter includes a statement of the problem and motivation, an overview of the proposed routing approach, a list of contributions, and a brief coverage of thesis organization. 1.1 Motivation Wireless networks are generally divided into two types: infrastructure-based and self-configurable ad hoc networks as illustrated in Figure 1.1 and Figure 1.2, respectively. In infrastructurebased wireless networks, nodes or clients will send packets across a central access point, which acts as a server that deals with routing problems. Clients do not forward packets of each 1

10 other, thus they do not need to consider complex routing problems. Even if the destination is closer to the source than the server, packets will be forwarded to the server, which will then forward them to the desired destination(s). Figure 1.1: A sample infrastructure wireless network Figure 1.2: A sample ad hoc network In general, most wireless networks nowadays are centralized, starting from cellular networks available from mobile phone providers to small local area networks available at home or in the office where a router acts as the main access point that forwards packets. Ad 2

11 hoc networks are self configurable peer-to-peer kind of networks that don t rely on a central station or server to deal with routing issues. Instead, each node acts as an access point, which can send, receive, and forward packets between peers and across the network. Initially, ad hoc networks were mostly used in emergency situations where no infrastructure was available, such as natural disasters and in a military field deployment. This was due to their flexible structure, which is easy to configure and deploy. However, ad hoc networks have some limitations over infrastructure-based type of networks. In the former it is required to worry about the lifespan of the network due to its unpredictable random topology, in addition to some other problems such as battery life and bandwidth limitations. Lacking a central node will also be an issue in optimally coordinating the routing process in this kind of topology. Thus, one main challenge related to routing may be identified as providing a high quality of service (QoS) to clients of mobile ad hoc networks (MANETs). A static route from a source node to a destination node is not practical in a dynamic topology such as MANETs. As described in the literature, several protocols have been proposed trying to solve the routing problem in such networks. They are mainly divided into three categories: Proactive, Reactive, and Hybrid protocols. Proactive protocols periodically maintain a table that contains a list of the nodes available in the network environment. This table is send to neighbors for periodic updates. This will capture an overview of the network topology and accordingly a route will be established based on existing known topology. However, this method has some drawbacks. For instance, high mobility of nodes in such a topology may be enough to make the current table in a node untrustable. Time and space complexity of the system depend on the growing number of nodes in the network. In reactive protocols, route request messages are produced and sent over the network to find a destination on-demand. One popular example of reactive protocol is the Ad hoc On-Demand Distance Vector (AODV) [30] which also forms the basis for the work described 3

12 in this thesis. This type of routing protocol eliminates the need for discovering the current topology. However, there is an associated cost, which is mainly the extra time needed to find a route to the destination. In MANETs, delivery time is a serious issue due to the high mobility of nodes. Finally, hybrid protocols combine advantages of both proactive and reactive protocols. But, it is required to be aware when to go for a hybrid protocol because in some contexts it may inspire the disadvantages of both, proactive and reactive protocols. This might lead in large topologies to a significant increase in the space complexity leading to insignificant enhancement of discovery time. Figure 1.3: AODV routing protocol Ad hoc On-Demand Distance Vector (AODV) Routing Method The Ad hoc On-Demand Distance Vector (AODV) [30] protocol is one of the main reactive protocols designed and proposed to tackle the routing problem of mobile wireless ad hoc networks. As illustrated in Figure 1.3, in this mechanism, a node needing to transmit 4

13 data packets to a destination will broadcast route request messages (denoted RREQ) to its neighbors. Every recipient node will forward these messages to its neighbors and update the route information accordingly. Each node which receives a RREQ message will keep an entry in its routing table in order to prevent loops and for recalling the destination. When a message reaches the destination, a route reply (denoted RREP) will be unicasted 1 through the located route in the network. Then, actual data transmission will be established based on the latter route. When an active node is disconnected, a route error message (denoted RERR) will be sent across routing nodes to inform them that there is a breakage and the link should be recalculated. One main problem related to this protocol is associated with time because finding a route will take long time; and as mentioned earlier, nodes in MANETs are highly dynamic, i.e., time is very critical and important in this context. Another problem is attributed to broadcasting RREQ, which will increase overhead of the network in the discovery process, and as a result the network will be flooded with control messages. To overcome these problems, AODV has been well studied and tested and hence has been preferred as a base for new extensions instead of designing a new routing protocol from scratch. Existing enhancements and modifications described in the literature include, route maintenance [1], bandwidth usage improvements [40], among others. Machine learning based extensions and enhancements for AODV protocol were considered as well [17] Q-Learning Based Routing Mechanisms Reinforcement learning (RL), is a subfield of machine learning, which has been extensively used in the literature to develop effective solution for various research and practical problems. One realization of RL mechanism is illustrated in Figure 1.4. For handling the routing problem, Q-learning in particular, was used for establishing a more stable route to reach a destination, e.g., [12][38]. However, the topology is dynamic, and 1 Unicast stands for the process of sending a packet to one destination 5

Figure 1.4: Reinforcement Learning mechanism model-based reinforcement learning cannot be applied in such a context. Instead, model-free based approaches have been used.

14 Figure 1.4: Reinforcement Learning mechanism model-based reinforcement learning cannot be applied in such a context. Instead, model-free based approaches have been used. Q-learning is a type of Temporal Difference (TD) learning and is considered as a modelfree based reinforcement learning approach. One attraction of Q-learning is its capability to discover an optimal policy which is learned by experience gained from exploring a given domain. The basic Q-learning model utilizes a lookup table to keep Q-values associated with the explored domain. However, size of the lookup table becomes a problem when scalability is the target. An alternative to lookup tables is approximation. Q-learning was first used in routing by the work described in [8], where the authors used nodes as states. Some limited the states to 1-hop neighbors or 2-hop neighbors [12]. In this context, a lookup table that contains states of the whole network is not realistic due to space limitations of mobile nodes. This forms one of the main motivations for my research described in this thesis. More details on AODV, reinforcement learning and Q-learning may be found in Chapter 2. 6

15 1.2 Overview of the Proposed Approach To overcome limitations associated with existing Q-learning based routing protocols, the work described in this thesis proposes a more practical, flexible and scalable approach to calculate and combine multiple routing measures in a function approximation of Q-learning. This is achieved by considering a linear combination of reliable features where Bellmen s equation is used for updating weights of features while giving preference to active ones. As a result, it is possible to state that the proposed routing approach makes better and more effective usage of Q-learning because it integrated Q-learning to achieve scalability, which is a major and serious concern in MANETs. States of a network don t need to be known to an agent (holding a message) in order to traverse them. Instead, states of a network will be averaged in one value that represents expected future state of the network. This way, in order for a source node to make a decision it doesn t need to view the whole topology of the network. Q-values will be updated based on active features of nodes. Consequently, it will be possible to smoothly identify a node which will be more likely chosen as the next hop to connect the source to the desired destination. Finally, all components and various aspects of the proposed approach are covered in Chapter Objective The main objective of this work is to propose a practical, flexible and scalable mechanism that improves routing in MANETs by choosing the best route from a given source to a desired destination based on active features of network nodes. Adapting to network changes by using an approximation of the Q-learning approach highly helps in reducing memory usage, processing, latency and overhead especially in small mobile devices. This leads to visible and attractive contributions as described in the sequel. 7

16 1.4 Main Contributions The main contributions of the research described in this thesis are: 1. Enhancing the AODV routing protocol to be more efficient for decision making in terms of sensing network active measures and integrating them in the approximation process, which helps in deciding on an optimal route. 2. Providing a brief overview about the environment to nodes in the network by deriving a mathematical model that can adapt and converge to an optimal decision policy. 3. New efficient and scalable algorithm that can be modified to fit given standards and adapt to a dynamic topology. This is achieved by integrating Q-learning into a feature based model. 1.5 Thesis Organization The remainder of this thesis is organized as follows. Chapter 2 elaborates on the background on MANETs and the approaches developed to solve the problem of unstable routes. It also discusses machine learning techniques and more specifically the ones directly related to the work described in this thesis. Chapter 3 describes in details various aspects of the proposed routing mechanism. Chapter 4 presents the framework that we have used to test the proposed routing protocol. It reports interesting and encouraging results along with the discussion necessary to highlight various attractive characteristics of the proposed routing approach. Chapter 5 provides a summary, conclusions and future research directions. 8

17 Chapter 2 Background and Related Work Routing in networks is a vital process that has attracted considerable attention in the research community in academia and industry. Researchers have concentrated on finding effective routing mechanism capable of finding optimal path from a source to a destination. There are many routing algorithms developed based on a variety of techniques, including machine learning concepts. Each routing algorithm has its attractive characteristics, which have made it acceptable by the research community. The AODV may be considered to be the most popular routing algorithm, and several researchers have based their work on this algorithm by adding some extensions or by concentrating on certain parameters. This tuneup process has enriched AODV leading to more attractive versions characterized by more effectiveness and efficiency. The routing mechanism described in this thesis falls in this category where Q-learning has been used leading to a more flexible approach. This chapter provides an overview of the main routing techniques and the necessary background required to understand the proposed routing approach. First, we present a brief background on MANETs and the routing approaches described in the literature, which tackle the routing problem, including reinforcement learning based approaches. Then we discuss advantages and disadvantages of existing routing techniques. This leads to a conclusion highlighting how the routing solution proposed in this thesis provides a good fit for enhancing routing in MANETs. 2.1 Mobile Ad hoc Networks A Mobile Ad hoc Network (MANET) is a collection of wireless mobile nodes with no central administration. Nodes can move independently, connect to neighbors, receive traffic and 9

forward messages across multi-hop nodes to reach a desired destination. MANETs are used where the use of mobile nodes is essential in the absence of central coordination.

18 forward messages across multi-hop nodes to reach a desired destination. MANETs are used where the use of mobile nodes is essential in the absence of central coordination. Nodes in MANETs are not guaranteed to stay in a network cluster and cannot instinctively substitute the central coordination. Thus, nodes behave independently but benefit from the existence of other nodes in their vicinity as mediators to help in sending messages to some target destination(s) available in the network. Nodes are autonomous and have to forward messages across the network by following the best possible routing policy with minimum knowledge of current network architecture. Some interesting, demanding and rapidly emerging application areas and customized versions of MANET are illustrated in Figure 2.1, and could be described as follows. Figure 2.1: Sample example networks of various MANET Types Vehicular Ad hoc Networks (VANETs): Nodes in these networks are vehicles. Researchers have recently realized the importance of incorporating some intelligence in the traffic network in order for vehicles to communicate, connect, avoid accidents, etc. Routing in VANETs is more challenging than other kinds of MANETs because of high mobility and speed associated with their wireless 10

19 nodes. Smart Phone Ad hoc Networks Project run by MITRE [36], a non-profit organization sponsored by the US government for research and development purposes. It is a framework and proof of concept implementation of a functional Mobile Ad-Hoc Mesh Network (MANET) for the Android smart phone platform where no cellular carrier or any other central infrastructure is needed. In this environment, smart phones can pair with each others to form a MANET based on WIFI or Bluetooth. Military MANETs: These have emerged after MANETs have recently gained popularity in military and tactical applications. This domain suffers from the lack of infrastructure in battle fields and emphasizes the importance of secure communication [9]. Internet based mobile ad hoc networks (imanets) where any type of MANET can be connected to the Internet by integrating Internet base stations. Nodes which are neighbors of Internet stations will act as gateways for other nodes in the ad hoc network for the latter nodes to find their way of transmitting their data through the Internet [14]. Because of the very dynamic topology in MANETs, routing becomes a challenging problem that needs careful attention to develop effective working routing solutions. This has received considerable attention in the research community and many routing approaches have been developed, though each has its characteristics, advantages and disadvantages. In the next section, we provide a brief overview of some main MANET routing approaches described in the literature. 11

20 2.2 Overview of Some MANET Routing Approaches The main constraint in MANETs is the very dynamic associated topology to the extent that it is almost unknown and should be discovered. Many approaches were proposed for designing routing protocols capable of effectively handling the routing process in MANETs. As mentioned in Chapter 1, routing approaches described in the literature could be mainly categorized into three groups in term of design, namely proactive, reactive, and hybrid. These could be further classified into two main categories based on network structure, namely uniform and non-uniform protocols. Characteristics of each of these categories are described in the remainder of this section Uniform routing protocols Uniform routing protocols are characterized by not adapting any hierarchical distinction of nodes in the network. Instead, all nodes are treated uniformly, i.e., all nodes in the network have the same role. Further, uniform protocols are either proactive or reactive. OLSR [21] is an example of uniform proactive protocol, whereas DSR [23] and AODV [30] may be classified as reactive uniform protocols Non-Uniform routing protocols Non-uniform routing protocols may be mainly seen as modifications and extensions of uniform protocols. For instance, AODV reactive protocol has formed good basis for the development of non-uniform protocols. In such protocols, nodes in the network are treated differently according to factors, such as bandwidth, mobility, location, node degree, battery, energy, among others. The desire to design such protocols is mainly to decrease control overhead and broadcasting issues associated with uniform protocols. This forms the basis for more distributed route discovery, and hence helps in overcoming limitations associated with MANETs. Actually, not all nodes have the same ability to respond to messages in the 12

21 same manner. In the work described in [32], the authors divided non-uniform protocols into two main subcategories, namely hierarchical and location-based protocols. Hierarchical network structure can be observed in wired multi-level networks. It is mainly important to decrease space complexity of routing algorithms and work in a distributed manner. Many efforts were dedicated to implement hierarchical networks in MANETs. For instance, HSR proposed by Pei et al. [29] is a well known hierarchical cluster-based protocol. In this protocol, a network is divided into multi-level clusters based on relationships between its components. Each group has a center which represents the group in different specifications, such as motion, speed, etc. A center acts as the coordinator of transmissions within its cluster. Another cluster-based protocol is the Cluster Based Routing protocol proposed by Jiag et al. [22]. This protocol can be considered as a hybrid approach, which combines proactive and reactive protocols. A wireless network is divided into clusters and a cluster head is elected to deal with routing. One cluster head can communicate with other cluster-heads through gateway nodes located intermediately between two or more clusterheads. A multi-agent software system based routing approach proposed by Manvi et al. [27] integrated multi-agent techniques into the process in order to build a backbone structure network. Static and mobile agents were used for the following tasks: (1) collecting information from neighbors, (2) classifying nodes according to specific features, (3) discovering the network, and (4) sending data packets to the desired destination. Their algorithm may be considered as a modification of AODV with multi-cast routing extension. Location based protocols depend on geographic locations of nodes. In this kind of protocols, it is assumed that nodes are able to know locations of other nodes. Low cost and extensive adoption of GPS devices in mobile phones, laptops, and vehicles make this routing protocol more practical. Location Aided routing (LAR) [24] is one of the most popular position-based protocols. It was mainly proposed for reducing the overhead of broadcasting 13

22 control messages. After a node finds the destination location, control messages can be directed towards the located destination node. This decreases bandwidth usage and reduces overhead problems. One LAR extension may be found in the work described in [4], which is based on and outperforms the original AODV protocol. Zone routing protocol is an example of a hybrid hierarchical location based protocol. For instance, the zone-based hierarchical protocol proposed by Haas et al. [19] mainly divides a network into several zones, and then chooses a representative in each zone to act as a router for the zone. Finally, there are other routing approaches described in the literature. They mainly combine the above mentioned approaches and integrate new ideas that may lead to better handling of the routing problem in MANETs. For instance, the approaches described in the next section integrate some machine learning techniques into their proposed solutions that can effectively handle the routing process in MANETs. 2.3 Machine Learning Based Routing Approaches MANETs are generally characterized by a highly dynamic behavior that rapidly changes network configuration over time. In order to adapt to such conditions, machine learning techniques have been adopted in several contexts for handling a variety of MANET aspects, such as data routing [8], intrusion detection [20], bandwidth and delay management [16], among others. Some machine learning techniques may be seen as algorithms simulating behavior of certain creatures like humans, insects, etc. implemented in a machine environment. In other words, a machine learning algorithm may inspire from the behavior of humans, ants, genes, etc. or even some natural concepts in a way that allows a machine to explore, detect patterns, and make predictions to achieve a certain target or to develop a particular solution in a logical way. Of course, all this depends on expert involvement who is expected to build 14

23 a model and make it understandable for a machine by developing and implementing the algorithms. In other words, a machine acquires intelligence under the control and guidance of a human who is mostly an expert, and hence the level of machine intelligence depends on level of expertise of the expert involved in the process. Authors of the work described in [17] listed some machine learning techniques which could be used for tackling the routing problem in MANETs; these include reinforcement learning, swarm intelligence and mobile agents. Swarm intelligence is a type of machine learning technique inspired from the behavior of ants. The main interest in this protocol is to add a distributed nature to the system. AntNET [10] and AntHocNet [11] are good examples of swarm intelligence based approaches for MANETs. An overview of reinforcement learning is included in the next section to help the reader understand the basic concepts and techniques underlying the routing mechanism proposed in this thesis Reinforcement Learning Reinforcement learning (RL) is a type of machine learning process which was inspired from the psychological behavior of human beings and animals who try to avoid punishment and maximize reward received in response to their actions towards achieving a target in a given environment. In general, an agency is deployed in reinforcement learning where one or more agents explore an environment by transitioning from one state to another by performing actions. Value functions will be calculated for each state and will be updated in an exploration phase to maximize the reward to be received after performing some actions which are anticipated to be steps in the right direction towards achieving a goal. The goal is to establish an optimal policy for agent(s) to exploit in order to maximize their total reward. In reinforcement learning, an environment may be represented as a Markov Decision Process (MDP) which will be introduced in the next section. 15

Figure 2.2: Q-routing Boyan et al. [8] proposed a reinforcement learning approach in order to improve network connectivity by controlling packet routing decisions and mobility of nodes.

24 Figure 2.2: Q-routing Boyan et al. [8] proposed a reinforcement learning approach in order to improve network connectivity by controlling packet routing decisions and mobility of nodes. They tried to optimize performance of ad hoc networks by using Q-routing (see Figure 2.2) based on Q- learning, a very popular and commonly used reinforcement learning technique. More details on Q-learning are given later in this chapter. Another distributed reinforcement learning approach was developed by Wu et al. [12] who targeted vehicular networks. They proposed an enhancement of the reactive protocol AODV by integrating Q-learning in the process. They considered packets as agents and nodes as states. Further, an agent gets a reward when it reaches the destination and accordingly updates Q-values. Later, they enhanced their proposed protocol by incorporating Fuzzy Logic to handle imprecise and uncertain link status information. Their revised approach is described in [38] where they evaluated whether a wireless link is good or not without deriving or describing the underlying mathematical model. Finally, researchers considered some other forms of reinforcement learning techniques to handle routing in MANETs; these include approaches like Dual RL (DRQ-routing) [25], TPOT RL (TPOT-RL-routing) [35] and Collaborative Reinforcement Learning (CRL) [15]. Al-Rawi et al [31] listed and discussed more reinforcement learning applications to rout- 16

25 ing in distributed wireless networks. Researchers also tried to benefit from other machine learning approaches in a non-distributed manner where it is assumed that information of the whole network is available. These machine learning approaches include genetic algorithm, neural networks, decision trees, and rule learners [17], among others. 2.4 Reinforcement Learning and Markov Decision Process In reinforcement learning, agent(s) in an environment must learn by exploring the environment in order to maximize expected reward. A real life example of deploying reinforcement learning could be realized by watching how a child or a pet learns. A child or a pet tends to do whatever induces reward and would avoid actions which may lead to punishment. This way and after enough exploration they will be able to predict the outcome prior to doing an action. The environment in reinforcement learning is generally modeled as a Markov decision process (MDP) MDP A Markov decision process is a 5-tuple mathematical framework (S, A, T, R, γ) which depicts the behavior of an agent in a given environment, where: S is a set of states. A is a set of actions that can be performed by agents. T (s, a, s ) is a transition function. It is defined as the probability distribution of landing on state s when a given agent in state s performs action a. R(s, a, s ) is a reward function which determines immediate reward received after transition from state s to state s based on action a. The reward value may be a positive or a negative real number. 17

26 γ is a discount factor, where 0 < γ 1 represents a discount from the reward when landing on the next state. Figure 2.3: Illustrating policies and values In MDP, the outcome only depends on current state and the associated action. In other words, history is not considered in state transitions because MDP follows a Markov process where the future is independent of history. The target is an optimal policy π which maximizes expected reward when followed for each state s and associated action a. The policy concept is illustrated in Figure 2.3. Many algorithms have been proposed to solve an MDP problem and find an optimal policy in terms of dynamic and linear programing when the transition and reward functions are known. To find an optimal policy, it is required to define the value function for a specific policy first. Assume that V π is the expected total reward received after executing a policy π starting from state s. It is computed as shown in Equation 2.1. V π (s) = E[R (s) + γ(r (s ) + γ(r (s ) +...)) π, s] (2.1) Where R (s) is the immediate reward and E[(R (s ) + γ(r (s ) +...)) π, s] is the expected future reward for being in a state s and following a policy π. This can be denoted by s S T (s, a, s )V(s π ) to form the Bellman s equation as follows: V(s) π = R (s) + γ T (s, a, s )V(s π ) (2.2) s S 18

27 Optimal value function V(s) represents optimal expected total reward. It is computed as: V (s) = max π V π (s) Solving for the optimal value function V(s) will lead to: (s) = R (s) + γ max V where R (s) is the immediate reward and γ max a can choose to maximize future expected reward. a s S s S The optimal policy π is determined as follows: T (s, a, s )V (s ) T (s, a, s )V (s ) is the best action an agent π = arg max a s S T (s, a, s )V (s ) In the next sections there are some dynamic programming (DP) solutions described for calculating the optimal value function V (s) and optimal policy π, where the problem is broken down into subproblems and the process iterates to find the optimal solution Value iteration Value iteration makes use of Bellman s equation to simultaneously update V (s) in order to converge to V(s). At every iteration in Algorithm 1, the value function of the current state is evaluated by looking one step forward starting with an arbitrary value. Then the value function updates itself while heading toward the optimal value function until it converges. The process converges when the value functions of all states stop changing. At this stage, the optimal value function V for each state will be found. Then, it is possible to substitute in the equation above and find an optimal policy π. This leads to the best action an agent must take to act optimally. 19

28 Algorithm 1 Value Iteration Algorithm Inputs: MDP Model: S is set of states A is set of all actions T (s, a, s ) is transition function R(s, a, s ) is reward function Outputs: V Optimal value function π Optimal Policy Initialize V (s) arbitrarily k 0 repeat k k + 1 for s S do V k (s) = max a T (s, a, s )(R(s, a, s ) + γv k 1 (s )) s S end for until V k (s) stops changing for s S do V (s) = V (s) π = arg max a T (s, a, s )V end for return V, π s S Policy iteration The other standard algorithm to calculate optimal policy is called policy iteration. By invoking Algorithm 2, first policy π is initialized randomly. Second, Bellman s equation is solved to obtain value functions for the whole set of states for the current fixed policy π. Then, policy π is updated according to the calculated value functions. The process repeats over and over until the policy converges. Algorithm 2 simultaneously updates π in order to converge to π(s). Due to the recursive nature of the value iteration algorithm and the need to solve a linear set of equations in policy iteration, policy iteration will work faster than value iteration for a small number of states. In other words, policy iteration may not be a good option to favor in cases where there exists a large set of states. The main obstacle for using reinforcement learning in MANETs is that these dynamic 20

29 Algorithm 2 Policy Iteration Algorithm Inputs: MDP Model: S is set of states A is the of all actions T (s, a, s ) is transition function R(s, a, s ) is reward function Outputs: V Optimal value function π Optimal Policy Initialize π (s) arbitrarily repeat π π for s S do V π (s) = max a T (s, a, s )(R(s, a, s ) + γv π (s )) s S end for for s S do π (s) = arg max a s S end for until π = π return V π as V, π as π T (s, a, s )V π (s) programing techniques can be used only if there exist an estimate of the reward and the transition probability distribution is known. However having these requirements available is only possible in case the investigated system has a known corresponding model. Unfortunately, in MANETs it is not possible to have an idea about the environment to be explored since it is very dynamic. Therefore, it is preferred to use model-free based reinforcement learning techniques, such as Temporal Difference (TD) learning which still assumes a fixed policy and learns from every experience. In TD, there is initially a bad or inaccurate estimate due to insufficient exploration of the environment. But as time passes, the estimate improves with a given learning rate until it converges to the optimal policy. TD learning uses the sampling behavior of Monte Carlo algorithm. However, unlike Monte Carlo (MC), TD doesn t need to finish a whole episode in order to estimate future reward. Instead it approximates its current estimate based on previously learned estimates (bootstrapping) as it is the case in dynamic programming, but 21

30 Figure 2.4: DP vs TD vs MC [34] without the need to know all states. The three techniques, namely DP, TD and MC are illustrated in Figure Q-learning Q-learning was first introduced by Watkins in 1989 [37]. It is an off-policy model-free based TD reinforcement learning algorithm that doesn t require any information about the model, which is the network topology in our case. It can be used directly for online learning as it doesn t need a model. Q-learning uses value iteration in bootstrapping. But unlike value iteration, Q-learning uses Q-values which are state-action pairs (see Figure 2.5). Using Q-values in Bellman s equation and sampling over actions helps in updating them without the need of a transition model. 22

31 Figure 2.5: Values vs Q-Values As a reinforcement learning algorithm, Q-learning still uses a given MDP framework (refer to Algorithm 3). Agents can travel from one state to another by performing an action. A reward is assigned to an agent based on the state it reaches. The goal of an agent is to maximize total reward while traveling through different states. Action-Value function values (known as Q-values) are initially set arbitrarily. Every transition to a state after performing an action and observing a reward leads to Q-value update. The algorithm uses Q-value iteration update. Q-value updates still make use of Bellman s updates as follows: Q(s, a) Q(s, a) + α[r + γ maxq(s,a ) Q(s, a)] }{{} target The Q-function will be periodically corrected to converge to an optimal function in order to have an optimal policy while agents are exploring the environment. Q-values of each (state, action) pair are stored in a look up table. After enough exploration, an agent exploits the knowledge learned by choosing a (state, action) pair that maximizes Q-value. This would lead to performing the optimal policy. Unlike other algorithms mentioned earlier, Q-learning updates the Q-function in the direction of an estimated maximized target since there is no information about the model and the true value. However, it still proved to converge to an optimal policy. 23

32 Algorithm 3 Q-learning Inputs: MDP Model: S is set of states A is set of all actions γ is discount factor α is learning rate r is immediate reward Initialize Q(s, a) arbitrarily Observe current state s repeat select action a observe reward r and state s Q(s, a) Q(s, a) + α[r + γ max a Q(s, a ) Q(s, a)] s s until termination As number of states increases, the number of the learned Q-values increases in the Q- tables. This will exponentially increase the usage of memory resources. Thus, it is not practical anymore to be used in continuous states. For this reason, function approximation has been adapted to make it possible to apply Q-learning in large and continuous state spaces. The next section elaborates on some function approximation approaches used in Q-learning in order to approximate the Q-function to be used in continuous states. By mapping nodes in MANETs into MDP states the number of nodes may become very high and even may exponentially increase as the network grows. To avoid exponential growth, Wu et al. [38] concise the state space of Q-tables to only 1-hop and 2-hop neighbors. In this case, the Q-function will be limited to a small component of the network, and the space complexity will increase in dense networks to O(n 2 ), where n is the maximum number of neighbors a node can accept Function Approximation Function approximation is one of the primary concepts studied in machine learning [18], artificial neural networks [33], pattern recognition [7], and many other mathematical and 24

33 scientific fields. It may be considered as an instance of supervised learning. The main idea in function approximation is to derive a function for approximating an unknown function or a hard to cope with function hovering around the true value. Approximation theory is widely used in areas where it is required to solve problems with unknown solutions of functions like in signal processing and other engineering areas [26], as well as in computer science. In reinforcement learning, function approximators are used to approximate value functions by feeding in states or state-action pairs. This leads to a situation which depends on states and actions that have already been seen to generalize a function that can deal with states and actions that had never been seen. The function will update itself over and over until it approach the correct parameter vector that fits and well describes the environment in order to behave optimally. The approximation process scenarios of value functions (including Q-value functions) are illustrated in Figure 2.6. Figure 2.6: Value Functions Approximation Scenarios in RL [34] Many function approximators can be used such as linear combination of features, neural networks, decision trees, nearest neighbor, etc. Function approximation can be applied to the action-value function iteration algorithm when there exist a very large set of states and a model of the environment. A small portion of the states can be studied using value iteration. Values calculated for this portion can be used in an attempt to find a least square fit. This helps in minimizing residual errors and in 25

34 deriving a linear regression function that estimates the original function. Gradient descent is used to find minimum error. Assume the target is to approximate Q-value using a linear combination of features where w i is the weight multiplied by feature value f i : Q est (s, a) n w i f i i=1 In order to measure how good a given function approximates the Q function, it is required to measure total least square error by assuming somebody has already declared the true Q- value Q(s, a). The total squared error is computed as follows: LS t = = n (Q(s, a) Q est (s, a)) 2 i=1 n (Q(s, a) i=1 i=1 (2.3) n w i f i ) 2 In order to find the minimal error, it is required to determine the partial derivatives with respect to the weights of the least squares equation: LS t w j = (Q(s, a) n w i f i ) f j i=1 j = 1,..., n At this stage, the gradient of the LS t function can be defined by the vector: LS t = Then, the weights must be adjusted in the direction of the minimum in order to approach the local minimum: LS t w 1 LS t w 2. LS t w n w j w j + α[q(s, a) n w i f i ] f j i=1 26 j = 1,..., n

35 where α is gradient descent step size and n is the number of features. Since the least squares equation (2.3) is quadratic, it is guaranteed that the updates are toward the global minimum. Function approximation can be applied in an online manner as well to approximate the Q-function in Q-learning where no model of the states is known. It deals with the case when number of states in Q-learning increases with time. More memory is needed to handle additional number of Q-values calculated for these states. In the equation, Q-learning target substitutes the true value which cannot be known except in case of a model. Finally, online gradient descent may be utilized to find minimum errors. w i w i + α[r + γ maxq(s,a ) Q }{{} est (s, a)] f i target i = 1,..., n Applying unmodified Q-learning approach will be impractical due to space and time limitations associated with MANETs which are characterized by dynamic and low energy nodes. We proposed a Q-function approximation based on linear combination of network features to be used in MANETs. We describe each state using a vector of features that represent its important properties. This way, instead of learning Q-values in a table representation, we learn only weights already used to combine features. Features must be chosen in a way that decreases the overlapping of states that must have different values. Space complexity of the proposed routing algorithm is O(n), where only weights must be learned and each node only collect values from direct neighbors. The proposed methodology is described in the next chapter. 27

36 Chapter 3 The Methodology: Tuning Q-Learning by Function Approximation for Better Routing in MANET s Mobile devices, e.g., phones, watches, tablets, laptops, etc. have become part of our daily life to the extent that it is hard to find someone who is not connecting to a mobile network one way or the other. Indeed, it is very common these days to have mobile devices installed in trucks, cars, devices for remote health monitoring, etc. In fact, following the widespread of social media platforms people have shifted more towards preferring virtual communication even when they coexist in the same vicinity, and here comes the importance of wireless communication for mobile devices. Even some people are addicted to stay connected. This stimulated researchers to develop more practical routing protocols for mobile, dynamic and volatile environments. Interest in the area is booming. It is driven by the clear shift from stationary to mobile devices, which provide more flexibility. The method described in this chapter proposes an effective routing mechanism for mobile ad hoc networks. The proposed routing approach is based on the machine learning technique, Q-learning which has been extensively studied and used in the literature. We have augmented Q-learning by using function approximation as described in this chapter. First, a motivating scenario is described to highlight attractive characteristics of the proposed protocol. Then, we present the procedures and strategies followed to implement the proposed feature based approximator. We cover all aspects of our method, related to the environment, methods, design and the implementation. 28

37 3.1 From Practical Scenario to Protocol Description Consider the following scenario which motivates the routing metric model and protocol described in this chapter. Imagine yourself lost in a maze and searching for the exit. The only two things you know regarding this situation are: (1) the exit is open and there is wind blowing inside, and (2) you could hear a voice of somebody standing at the exit and calling you. Here, wind and sound are features characterizing this problem. Assume, there are alternative routes to choose from in order to reach the exit. To decide on the better route, it is important to consider the two features, namely wind and sound. First you need to listen carefully to the sound in order to figure out with best possible accuracy the direction from where it is coming. Second, you need to consider the direction from where the wind is blowing. Assume, you decided to follow the route from where you sensed louder voice but low blowing wind, i.e., you assigned higher weight to sound. However, after walking along this route you hit a dead end road. This will lead to punishment for the sound feature. Accordingly, next time you should assign higher weight to blowing wind compared to sound. You should repeat this process by adjusting the punishment and reward values associated with the two mentioned features (wind and sound) each time you go along a route and fail to find the exit. At the end, you will find the best combination of the two features which will lead you to the exit. In this scenario, you are learning on the run without saving any information about the past because you have a moving average of the past to predict the future. The proposed algorithm has been mainly inspired from characteristics of the above described scenario. To make this possible, it is essential to correctly realize the mapping between the two domains. Here a maze corresponds to a mobile ad hoc network and routes in a maze correspond to paths in a network. Further, in this analogy searching for an exit in a maze corresponds to a packet moving towards a target node in a network. 29

38 A network has features which correspond to the two features in a maze, namely sound and wind. Actually, it is not necessary to have the same number of features in the two domains, e.g, only two in each domain. Instead, we assume that we are dealing with a set of features in each domain without any need to have the two sets of equal size. This way, characteristics of various routes are inspired in each domain based on alternative combinations of its features. In the two domains, it is important to keep trying new alternative combinations of features by adjusting weights assigned to various features until a target is satisfied. In other words, it is possible to fail sometimes and receive punishment while searching for a successful route. Alternatively, reward is granted and weights of the features are adjusted accordingly, i.e., the process alternates between exploration and exploitation. 3.2 Building on AODV Protocol After a careful analysis of the literature and from the above description, we have realized the need to concentrate on features as main actors in shaping and distinguishing the new routing protocol. Further, we have decided to build on an existing successful approach instead of developing something from scratch where we may risk reinventing the wheel. Indeed, reusability has been well emphasized in academia and industry. It encourages concentrating more effort on enhancing new aspects after carefully deciding on existing working and successful components to be integrated in a new mechanism We decided to base our study on AODV protocol because our careful analysis of the protocols described in the literature revealed AODV as the most popular and widely cited protocol for routing in MANETs. In other words, the protocol proposed in this thesis may be classified as an extension of AODV by integrating Q-function approximation using a linear set of active features of the network. For this purpose, we have formed a feature vector by selecting features which are expected to be reasonably correlated to improvement in quality of the communication service. Our proposed solution can be built on top of protocols other 30

39 than AODV. Actually, other reactive, proactive and even hybrid protocols can benefit from the proposed metric model. This has been explicitly discussed in the future work section in Chapter 5. Accordingly, we may simply state that in this thesis we have used AODV as a proof of concept. As we will explain later, the proposed routing algorithm is scalable in the sense that it is possible to add additional features to the already constructed feature vector. In other words, we do not claim that features included in the current feature vector form the only set of features relevant for the current version of the proposed routing mechanism. It is rather possible to adjust the ranges of features or smoothly add new features which may be believed more relevant than existing ones. The latter features may even contribute better to Q-values which will be learned and will guide the routing decision making process. It is important to recall that the topology of MANETs is very dynamic. Further, the target is to find a route from a source node to a destination node by keeping in mind that the source has no idea about the destination unless it is in its own neighbors list. To achieve this target, AODV broadcasts route request packets (RREQ) through its neighbors to discover a route to the destination node. When the destination (or a route to the destination) is located, its realizing node will unicast a route reply packet (RREP) to inform the source node about the discovery. Then a route will be established. However, a discovered route may not remain valid for long time due to the dynamic nature of MANETs. When a route becomes invalid, route error packets (RERR) will be generated to convey this information to other nodes along the route. This will initiate a new discovery process. The main drawback or problem with AODV is the uncontrolled broadcasting of control messages to all neighbors. This will lead to high control overhead and end-to-end delay payoff. In other words, nodes in AODV do not apply any planning for route discovery and do not know how the surrounding environment is changing. Our algorithm overcome this by adding some soft-sensors which empower the protocol with the ability to sense the 31

40 surrounding environment by considering some relevant features inspired directly from the current network. This will help in narrowing down routing scope, and hence reduce the cost associated with expensive global broadcasting. Indeed this approach is very useful for any application characterized by a dynamic topology that requires a general knowledge of the surrounding environment. In the next sections, we will introduce the main features utilized in this study. We will also elaborate on how to extend the feature vector by considering more features which may be later on identified as relevant. Finally, we will describe how the feature vector has been integrated in the Q-learning equation. Figure 3.1: Block diagram of feature based route selection 3.3 Implementing features As depicted in Figure 3.1, it is possible to discover a more stable and feasible route to the destination by considering some important features that insure stability of the route between the source and the destination. The features considered in this study have important role in 32

41 describing the network. They will be integrated into the Q-function in order to act in a nearto-optimum manner. Before describing how features will be used, next they are introduced and divided into four categories Availability Based Feature This feature represents the availability of a node as a good choice to be an intermediate node. Availability of a node as sender or receiver depends on the available bandwidth ratio as described next. Available Bandwidth Ratio: This represents available bandwidth as an important feature for detecting availability of neighbor nodes for transmitting and receiving messages. The source node will try to avoid sending messages through a route where there is network congestion. In this case, we will avoid bandwidth bottleneck, and instead spread the bandwidth across the network. Available bandwidth will be calculated as follows. BW curr = n S p T BW av = BW max BW curr where BW curr is the current bandwidth, T is the period for calculating the bandwidth (it is set to 0.5s), n is the packet count for every period, and S p is the size of packets in bits. Available bandwidth factor can be defined as: BF = BW av BW max Location based feature This feature represents anticipated location of a node in order to predict best probable route that would likely lead to the destination. Node degree ratio: This represents the number of available chains of nodes for a route. A 33

42 route with more connected nodes indicates a direction in which it is more probable to find the destination. NF = N nb N max where N nb is number of neighbors and N max is maximum number of neighbors. In the experiment, we consider N max as the whole set of nodes Stability based feature This feature is realized based on network mobility and reflects dynamic changes in a network. Mobility ratio: This represents stability of a route and dynamicity of a network by considering nodes and their neighbors. A less mobile route is a good indicator that the destination will be more likely found from a static route, not from a very dynamic route as expected in a mobile network. MF = 1 S curr S max where S curr is current speed and S max is maximum allowed speed Discount Feature This feature will not be considered in the feature vector, but it will be considered as discount factor. Battery/Power Ratio: This represents amount of energy available at a node. Having available power of a node low will increase the chance that the node might not be available in the upcoming period (near future), and hence will not be considered an intermediate node. This helps in choosing nodes with higher availability over those with lower availability, and hence plays important role in distributing energy consumption across nodes. BF f i = NF MF 34

43 γ = P F Feature selection is based on what is believed as correlated in a general MANET structure. In VANETs -as an example- battery is not an issue, so battery factor can be ignored or discarded. Other features might be added, such as Radio signal strength. Even it is possible to add some features related to specifications of the device itself. This might be helpful because different devices have different specifications. Some features might receive more attention in the equation and hence will dominate other values. Here it is worth mentioning that non-linearity is an option if it is believed that it enhances the approximation. Based on what has been mentioned, it is obvious that our algorithm is scalable. In the next section, we will elaborate more on how to integrate the feature vector in Q-learning, and how to use Bellman s equation. 3.4 Q learning Approximation In this section, we will explain how we have used a linear combination of features to approximate Q-values. First, we will introduce the environment as a Markov Decision Process (MDP). Then we will describe the reward model, and how we have used Bellman s equation to update weights of the used features. Finally, after providing an overview about exploration and exploitation phases in the proposed protocol, we will introduce the simulation environment. 35

44 3.4.1 The Markov Decision Process Model As a reinforcement learning based algorithm, the proposed routing protocol will model a network as a Markov Decision Process (MDP) where: Environment: covers the whole ad hoc network. States are represented by the set of nodes available in the network. Actions are neighbor nodes available for a current node to consider as possible next hop. Step Size is chosen as a dynamic value related to the iteration number in order to attain a faster convergence. It is defined as α = 0.05, where n is the n+1 iteration number. We will elaborate on our choice of α in Chapter 4. Discount Factor is related to the power factor γ = P F, which gives an indication about stability of a route in terms of battery life. This will ensure that a chosen route is more stable as more energy is available. The reward Model: Reward depends on the precursor list of each node considered by AODV protocol. This means that reward is the size of the precursor list. The precursor list in an AODV node, say n, is a list of nodes which are currently routing through node n. In other words, node n saves addresses of nodes which use it as a next hop to reach a desired destination. This is accepted as the reward for improving the weights of the features. As a result of this, features which have more effect on finding routes in intermediate nodes will be considered more important than other features in the future. The construction of the precursor list in AODV proceeds as follows: If a packet has successfully reached a desired destination, the destination will uni-cast back a route reply packet (RREP). Each node receiving a RREP message will update and add the route to its precursor list. 36

45 A route will be deleted from the precursor list when a related route error packet is received by the current node. A route e will be deleted from the precursor list in case the current node does not hear about route e during (ACTIVE ROUTE TIMEOUT) time. Exceeding this time period indicates that the route is not fresh anymore. R R + 1 R(s) = R R 1 R R 1 (RREP) recieved (RERR) recieved t > TIMEOUT Updating Q-values In order to apply Q-learning in the routing process, we will follow two phases: namely exploration & exploitation as explained in the sequel. Exploration Exploration Distribution: Exploration may be considered as the only choice towards convergence. This is true since there is no previous information about the environment because the model is missing. Accordingly, exploring the environment will enhance reward estimation while heading towards the optimal policy. In our case, exploration occurs during the packet receiving stage. In feature approximation, an agent should know about features of the next observable state as well as the maximum Q-value based on actions of the next state. This information is essential to update Q-values of the current state-action pair. It is possible to map nodes to actions instead of having them as states. This is true because as illustrated in Figure 3.2 in our case neighbors are actions which are considered as next hops to forward packets to, 37

46 Figure 3.2: Illustrating how State/Action == Action/State but they play states role as well. This means the following: every node action/state will calculate its Q-value based on maximum Q-values received from its different neighbor nodes. Actually, first a node updates the weights of its own features and calculates its Q- value Q(*,n). Then it passes its Q-value to its neighbors who will update their own weights accordingly. As a result, it can be easily realized that the proposed algorithm is distributed. Updating Q-values will occur while receiving HELLO and RREQ Messages. Both packets will carry Q-values of neighbor nodes which will be used in the update process. As mentioned above, we approximate the Q-learning equation in real-time by using a linear set of features. Q-values will be computed by multiplying weights with features as shown in the following equation: Q(s, a) = w 1 BF + w 2 NF + w 3 MF We will continue to make use of Bellman s equation in Q-learning for updating weight w i of each feature. This reflects the case where we minimize the error in the direction of our estimate of the true value. Updating the weights is actually a stochastic gradient descent applied on the squared errors of the linear equation proposed for estimating the Q-value. The equation may be expressed as follows: 38

47 where w i is the weight of feature i. w i w i + α[r + γ maxq(s,a ) Q(s, a)] f }{{} i target A feature will be updated by moving toward expected total reward and minimizing error rate. This way, only one Q-value is updated per node instead of updating an array of actions (neighbor nodes), where each action has its own Q-value. Each node still needs to hold an array of Q-values corresponding to each action, i.e., neighbor. In the above equation which computes w i, the term r + γmaxq(s, a ) represents the target that we plan to achieve. Every time we move close to this target by averaging based on what we already learned. This way, expectation of reward will improve as we explore the environment until we reach convergence. As a result, we will decrease the weight of every feature which will be identified as unimportant. On the other hand, weights of important features will increase. Finally, the proposed approach is scalable because it is possible to add more features in the future when we face a more complicated system, such as minimizing radio noise, incrementing signal strength, or detecting type of the device. A Q-value of the current action will be attached to the HELLO and RREQ messages sent to its neighbors, i.e., states. When a node receives a HELLO / RREQ message, it will update its precursor list as in normal AODV. In addition, the latter node will maintain a Q-values table for its neighbors by adding each Q-value to represent being in a state (Current Node) and taking an action (Next Hop). The Q-values table size is dynamic and depends on the number of neighbors of the current node. Finally, the exploration process is handled by Algorithm 4, and the Q-values table is illustrated in Figure 3.3. Exploitation We have conducted an exploitation scenario in an attempt to enhance AODV. Other exploitation scenarios are proposed as pat of the future work section in Chapter 5: 39

48 Algorithm 4 Function Approximation - Exploration Algorithm Inputs: N is set of neighbor nodes γ is discount factor α is step size r is size of the precursor list F is set of features Q N is set of neighbor Q-values Q 0 n 0 repeat observe reward r α 0.05 n+1 Q max max N Q N for all f F do w i w i + α[r + γq max Q] f i Q temp Q temp + w i f i end for Q Q temp n n + 1 until termination In our exploitation scenario, the action with maximum Q-value will be chosen greedily as next hop while balancing between exploration and greedy exploitation. To achieve this, we will calculate a new value called Appreciate which will act as a probability of greedy exploitation. We have defined Appreciate function as follows: Appreciate = N i=1 [Q max Q i ] (N 1) Q max where N is total number of neighbor nodes, and Q max is maximum Q-value. An Appreciate value ranges between 0 and 1, where 0 is returned when all Q values are the same, approaching 1 in case of high variation between Q-values. In the next section, we describe a demo to illustrate the exploration and exploitation phases. 40

Figure 3.3: Example Q-values tables 3.4.3 Illustrative Demo In this section, we present an example to illustrate some aspects of the Q-learning process as described in this chapter.

49 Figure 3.3: Example Q-values tables Illustrative Demo In this section, we present an example to illustrate some aspects of the Q-learning process as described in this chapter. Given a network which is static and not mobile in an attempt to make it easier to sense the learning process from a one dimensional point of view. When the network changes, the same procedure will be applied to consecutive snapshots of the new topology by considering each snapshot as a static topology. In other words, we assume each snapshot is a static network which lasts only for a small specific period. Convergence should be achieved for a better estimation of reward. We will elaborate on convergence in the next chapter. For now lets assume that convergence will occur at some point during the process. Figure 3.4: An example MANET Network Consider the MANET network shown in Figure 3.4. This topology was generated ran- 41

Figure 3.5: Illustrating the exploration approach domly using SWANS++. Here, Q-values will be updated based on the sent RREQ and HELLO messages. Figure 3.

6: Illustrating the exploitation approach At the very beginning, and in order to attain convergence, nodes in the network must explore.

50 Figure 3.5: Illustrating the exploration approach domly using SWANS++. Here, Q-values will be updated based on the sent RREQ and HELLO messages. Figure 3.5 shows the communication process between nodes in the network during the exploration phase. Figure 3.6: Illustrating the exploitation approach At the very beginning, and in order to attain convergence, nodes in the network must explore. When there are no Q-values present, and based on the outcome from the Appreciate function, original AODV will be invoked in the route request and message delivery process as shown in Figure 3.5. Exploration is applied for some time, and then the system switches to exploitation. To understand the exploitation process, assume node 1 wants to send a packet to node 7. Here, the Appreciate greedy approach will be employed. If the value of Appreciate is small 42

Introduction to Mobile Ad hoc Networks (MANETs)

Introduction to Mobile Ad hoc Networks (MANETs) 1 Overview of Ad hoc Network Communication between various devices makes it possible to provide unique and innovative services. Although this inter-device