Novel Function Approximation Techniques for. Large-scale Reinforcement Learning

Size: px

Start display at page:

Download "Novel Function Approximation Techniques for. Large-scale Reinforcement Learning"

Amice Dixon
6 years ago
Views:

1 Novel Function Approximation Techniques for Large-scale Reinforcement Learning A Dissertation by Cheng Wu to the Graduate School of Engineering in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy in the field of Computer Engineering Advisor: Prof. Waleed Meleis Northeastern University Boston, Massachusetts April 2010 in which submitted to GPO

3 NORTHEASTERN UNIVERSITY Graduate School of Engineering Thesis Title: Novel Function Approximation Techniques for Large-scale Reinforcement Learning. Author: Cheng Wu. Program: Computer Engineering Approved for Dissertation Requirements of the Doctor of Philosophy Degree: Thesis Advisor: Waleed Meleis Date Thesis Reader: Jennifer Dy Date Thesis Reader: Javed A. Aslam Date Chairman of Department: Date Graduate School Notified of Acceptance: Director of the Graduate School: Date iii

4 Contents 1 Introduction Reinforcement Learning Function Approximation Function Approximation Using Natural Features Function Approximation Using Basis Functions Function Approximation Using SDM Our Application Domain Dissertation Outline Adaptive Function Approximation Experimental Evaluation: Traditional Function Approximation Application Instances Performance Evaluation of Traditional Tile Coding Performance Evaluation of Traditional Kanerva Coding Visit Frequency and Feature Distribution iv

5 2.3 Adaptive Mechanism in Kanerva-Based Function Approximation Prototype Deletion and Generation Performance Evaluation of Adaptive Kanerva-Based Function Approximation Summary Fuzzy Logic-based Function Approximation Experimental Evaluation: Kanerva Coding Applied to Hard Instances Prototype Collisions in Kanerva Coding Adaptive Fuzzy Kanerva Coding Fuzzy and Adaptive Mechanism Adaptive Fuzzy Kanerva Coding Algorithm Performance Evaluation of Adaptive Fuzzy Kanerva-Based Function Approximation Prototype Tuning Experimental Evaluation: Similarity Analysis of Membership Vectors Tuning Mechanism Performance Evaluation of Tuning Mechanism Summary Rough Sets-based Function Approximation Experimental Evaluation: Effect of Varying Number of Prototypes v

6 4.2 Rough Sets and Kanerva Coding Rough Sets-based Kanerva Coding Prototype Deletion and Generation Rough Sets-based Kanerva Coding Algorithm Performance Evaluation of Rough Sets-based Kanerva Coding Effect of Varying the Number of Initial Prototypes Summary Real-world Application: Cognitive Radio Network Introduction Reinforcement Learning-Based Cognitive Radio Problem Formulation Application to cognitive radio Experimental Simulation Simulation Setup Simulation Evaluation Function Approximation for RL-based Cognitive Radio Summary Conclusion 106 Bibliography 111 vi

7 List of Figures 2.1 The grid world of size 32 x The implementation of the tiling The fraction of test instances solved by Q-Learning with traditional Tile Coding with 2000 tiles The implementation of Kanerva Coding The fraction of test instances solved by Q-Learning with traditional Kanerva Coding with 2000 prototypes The frequency distribution of visits to tiles over a sample run using Q-learning with Tile Coding The frequency distribution of visits to prototypes over a sample run using Q-learning with Kanerva Coding The fraction of test instances solved by Q-Learning with adaptive Kanerva Coding with 2000 prototypes The frequency distribution of visits to prototypes over a sample run using Q-learning with adaptive Kanerva Coding vii

8 3.1 The fraction of easy and hard test instances solved by Q-learning with adaptive Kanerva Coding with 2000 prototypes The illustration of prototype collision. (a) adjacent to no prototype; (b) adjacent to an identical prototype set; (c) adjacent to unique prototype vectors Prototype collisions using traditional and adaptive Kanerva-based function approximation with 2000 prototypes Average fraction of test instances solved (solution rate) (a) 8x8 grid; (c) 16x16 grid; (e) 32x32 grid, and the fraction of of state-action pairs that are adjacent to no prototypes and adjacent to identical prototype vectors (collision rate) (b) 8x8 grid; (d) 16x16 grid; (f) 32x32 grid by traditional and adaptive Kanervabased function approximation as the number of prototypes varies from 300 to Sample membership function for traditional Kanerva Coding Sample membership function for fuzzy Kanerva Coding Average solution rate for adaptive fuzzy Kanerva Coding with 2000 prototypes (a) Distribution of membership grades and (b) prototype similarity across sorted prototypes Illustration of the similarity of membership vectors across sparse and dense prototype regions Average solution rate for adaptive fuzzy Kanerva Coding with Tuning using 2000 prototypes viii

9 3.11 The four-room gridworld Average solution rate for adaptive Fuzzy Kanerva Coding with Tuning in the four-room gridworld of size 32x The fraction of hard test instances solved by Q-learning with adaptive Kanerva Coding as the number of prototypes decreases Illustration of equivalence classes of the sample The fraction of equivalence classes that contain two or more state-action pairs over all equivalence classes, the conflict rate, and its corresponding solution rate and collision rate using traditional Kanerva and adaptive Kanerva with frequency-based prototype optimization across all sizes of grids The fraction of prototypes remaining after performing a prototype reduct using traditional and optimized Kanerva-based function approximation with 2000 prototypes. The original and final number of prototypes is shown on each bar Average solution rate for traditional Kanerva, adaptive Kanerva and rough sets-based Kanerva (a), in 8x8 grid; (b), in 16x16 grid; (c), in 32x32 grid Effect of rough sets-based Kanerva on the number of prototypes and the fraction of equivalence classes (a), in 8x8 grid; (b), in 16x16 grid; (c), in 32x32 grid Variation in the number of prototypes with different numbers of initial prototypes with rough sets-based Kanerva in a 16x16 grid ix

10 5.1 The CR ad hoc architecture The cognitive radio cycle for the CR ad hoc architecture Multi-agent reinforcement learning based cognitive radio Comparative reward levels for different observed scenarios Block diagram of the implemented simulator tool for reinforcement learning based cognitive radio The performance of small topology The performance of the real-world topology with five different node densities Average probability of successful transmission for the real-world topology with 500 nodes x

11 List of Tables 2.1 The average fraction of test instances solved by Q-learning with traditional Tile Coding The average fraction of test instances solved by Q-learning with traditional Kanerva Coding The average fraction of test instances solved by Q-learning with adaptive Kanerva Coding The average fraction of test instances solved by Q-learning with adaptive Kanerva Coding The average fraction of test instances solved by Q-Learning with adaptive fuzzy Kanerva Coding The average fraction of test instances solved by Q-Learning with adaptive Kanerva Coding Sample of adjacency between state-action pairs and prototypes xi

12 4.3 Percentage improved performance of rough sets-based Kanerva over adaptive Kanerva xii

13 Abstract Function approximation can be used to improve the performance of reinforcement learners. Traditional techniques, including Tile Coding and Kanerva Coding, can give poor performance when applied to large-scale problems. In our preliminary work, we show that this poor performance is caused by prototype collisions and uneven prototype visit frequency distributions. We describe our adaptive Kanerva-based function approximation algorithm, based on dynamic prototype allocation and adaptation. We show that probabilistic prototype deletion with prototype splitting can make the distribution of visit frequencies more uniform, and that dynamic prototype allocation and adaptation can reduce prototoype collsisions. This approach can significantly improve the performance of a reinforcement learner. We then show that fuzzy Kanerva-based function approximation can reduce the similarity between the membership vectors of state-action pairs, giving even better results. We use Maximum Likelihood Estimation to adjust the variances of basis functions and tune the receptive fields of prototypes. This approach completely eliminates prototype collisions, and greatly improve the ability of a Kanerva-based reinforcement learner to solve large-scale problems. Since the number of prototypes remains hard to select, we describe a more effective approach for adaptively selecting the number of prototypes. Our new rough sets-based Kanerva-based function approximation uses rough sets theory to explain how prototype

14 2 collisions occur. Our algorithm eliminates unnecessary prototypes by replacing the original prototype set with its reduct, and reduces prototype collisions by splitting equivalence classes with two or more state-action pairs. The approach can adaptively select an effective number of prototypes and greatly improve a Kanerva-based reinforcement learners ability. Finally, we apply function approximation techniques to scale up the ability of reinforcement learners to solve a real-world application: spectrum management in cognitive radio networks. We use multi-agent reinforcement learning approach with decentralized control can be used to select transmission parameters and enable efficient assignment of spectrum and transmit powers. However, the requirement of RL-based approaches that an estimated value be stored for every state greatly limits the size and complexity of CR networks that can be solved. We show that function approximation can reduce the memory used for large networks with little loss of performance. We conclude that our spectrum management approach based on reinforcement learning with Kanerva-based function approximation can significantly reduce interference to licensed users, while maintaining a high probability of successful transmissions in a cognitive radio ad hoc network.

15 Chapter 1 Introduction Machine learning, a field of artificial intelligence, can be used to solve search problems using prior knowledge, known experience and data. Many powerful computational and statistical paradigms have been developed, including supervised learning, unsupervised learning, trialand-error learning and reinforcement learning. However, machine learning techniques can struggle to solve large-scale problems with huge state and action spaces [12]. Various solutions to this problem under have been studied, such as dimensionality reduction [2], principle component analysis [22], support vector machines [14], and function approximation [17]. Reinforcement learning, one of the most successful machine learning paradigms, enables learning from feedback received through interactions with an external environment. Like the other paradigms for machine learning, a key drawback of reinforcement learning is that it only works well for small problems, and performs poorly for large-scale problems [43]. 1

16 CHAPTER 1. INTRODUCTION 2 Function approximation [17, 40] is a technique for resolving this problem within the context of reinforcement learning. Instead of using a look-up table to directly store values of points within the state and action space, it uses examples of the desired value function to reconstruct an approximation of this function and compute an estimate of the desired value from the approximation function. When using many function approximation techniques, a complex parametric approximation architecture is used to compute good estimates of the desired value function [34]. An approximation architecture is a computational structure that uses parametric functions to approximate the value of a state or state-action pair. Using a simple approximation architecture design often make estimates diverge from the desired value function, and makes agents perform inefficiently [10]. Unfortunately, a complex parametric architecture may also greatly increase the computational complexity of the function approximator itself [11]. Furthermore, large-scale problems can remain hard to solve in practice, even when the complex architecture is applied. The key to a successful function approximator is not only the choice of the parametric approximation architecture, but also the choices of various control parameters under this architecture. Until recently, these choices were typically made manually, based only on the designer s intuition [11, 27]. In this dissertation we address the issue of solving large-scale, high-dimension problems using reinforcement learning with function approximation. We propose to develop a novel parametric approximation architecture and corresponding parameter-tuning methods for achieving better learning performance. This framework should satisfy several criteria: (1)

17 CHAPTER 1. INTRODUCTION 3 it should give accurate approximation, (2) the approximation should be local, that is, appropriate for a specific learning problem, (3) the parameters should be selected automatically, and (4) it should learn online. We first review related work on reinforcement learning and function approximation, describe their characteristics and limitations, and give examples of their operation. 1.1 Reinforcement Learning Reinforcement learning is inspired by psychological learning theory from biology [46]. The general idea is that within an environment, a learning agent attempts to perform optimal actions to maximize long-term rewards achieved by interacting with the environment. An environment is a model of a specific problem domain, typically formulated as a Markov Decision Process (MDP) [32]. A state is some information that an agent can perceive within the environment. An action is the behavior of an agent at a specific time at a specific state. A reward is a measure of the desirability of an agent s action at a specific state within the environment. The classic reinforcement learning algorithm is formulated as follows. At each time t, the agent perceives its current state s t S and the set of possible actions A st. The agent chooses an action a A st and receives from the environment a new state s t+1 and a reward r t+1. Based on these interactions, the reinforcement learning agent must develop a policy π : S A which maximizes the long-term reward R = t γr t for MDPs, where 0 γ 1 is a discounting factor for subsequent rewards. The long-term reward is the

18 CHAPTER 1. INTRODUCTION 4 expected accumulated reward for the policy. This implementation of reinforcement learning embodies three important characteristics: a human-like learning framework, the concept of a value function, and online learning. These three characteristics distinguish reinforcement learning from other machine learning paradigms, but they can also limit its effectiveness. The human-like framework defines the interaction between agents and the external environment in terms of states, actions and rewards, which allows reinforcement learning to solve the types of problems solved by humans. These types of problems tend to involve a very large number of states and actions. Unfortunately, the performance of reinforcement learning is very sensitive to the number of states and actions. A value function is a function which specifies the accumulated rewards that an agent expects to receive in the future. While a reward determines the immediate and short-term value of an action in the current state, a value function gives the expected accumulated and long-term value of an action under subsequent states. The concept of a value function distinguishes reinforcement learning from evolutionary methods [9, 7, 15]. Instead of directly searching the entire policy space by evolutionary evaluation, a value function evaluates an action s desirability at the current state by accumulating delayed rewards. In reinforcement learning, the accuracy and efficiency of a value function is closely related to the performance of a reinforcement learner. In an online learning system, learning and the evaluation of the learning system occur concurrently. However, in order to maintain this concurrency, a reinforcement learner must

19 CHAPTER 1. INTRODUCTION 5 compute the value of a state-action pair as fast as possible. For a large state-action space, storing the state-action values may require a large amount of memory that may not be available. Reducing the size of this table is therefore necessary. One of the most successful reinforcement learning algorithm is Q-learning [47]. This approach uses a simple value iteration update process. At time t, for each state s t and each action a t, the algorithm calculates an update to its expected discounted reward, Q(s t, a t ) as follows: Q(s t, a t ) Q(s t, a t ) + α t (s t, a t )[r t + γ max Q(s t+1, a) Q(s t, a t )] a where r t is an immediate reward at time t, α t (s, a) is the learning rate such that 0 α t (s, a) 1, and γ is the discount factor such that 0 γ < 1. Q-learning stores the state-action values in a table. The requirement that an estimated value be stored for every state-action pair limits the size and complexity of the learning problems that can be solved. The Q-learning table is typically large because of the high dimensionality of the state-action space, or because the state or action space is continuous. Function approximation [10], which stores an approximation of the entire table, is one way to solve this problem. 1.2 Function Approximation Most reinforcement learners use a tabular representation of value functions where the value of each state or each state-action pair is stored in a table. However, for many practical applications that have continuous state space, or very large and high-dimension discrete

20 CHAPTER 1. INTRODUCTION 6 state and action spaces, this approach is not feasible. There are two explanations for this infeasibility. First, a tabular representation can only be used to solve tasks with a small number of states and actions. The difficulty derives both from the memory needed to store large tables, and the time and data needed to fill the tables accurately [40]. Second, most exact state-action pairs encountered may not have been previously encountered. Since there are often no state-action values that can be used to distinguish actions, the only way to learn in these problems is to generalize from previously encountered state-action pairs to pairs that have never been visited before. We must consider how to use a limited state-action subspace to approximate a large state-action space. Function approximation has been widely used to solve reinforcement learning problems with large state and action spaces [20, 17, 34]. In general, function approximation defines an approximation method which interpolates the values of unvisited points in the search space using known values at neighboring points. Within a reinforcement learner, function approximation generalizes the function values of state-action pairs that have not been previously visited from known function values of neighboring state-action pairs. A typical implementation of function approximation uses linear gradient descent [45]. In this method, the approximate value function of state-action pair sa, denoted V (sa), is a linear function of the parameter vector, denoted θ. The approximate value function is then V (sa) = θ T φsa = n θ(i)φ sa (i), where φ sa = (φ sa (1), φ sa (2),..., φ sa (n)) is a vector of features with the same number of elements as θ. This approximation can also be seen as a projection of the multidimensional i=1

21 CHAPTER 1. INTRODUCTION 7 state-action space to a feature space with few dimensions. The parameter vector is a vector with real-valued elements, θ = (θ(1), θ(2),..., θ(n)), and V (s) is a smooth differentiable function of θ for all state-action pairs sa SA. We assume that at each step t, we observe a new state-action pair sa t with reward v t. The parameter vector is adjusted by a small amount in the direction that would most reduce the MSE error for that state-action pair: θ t+1 = θ t + α[v t V (sa t )] θt V (sa t ), where α is a positive step-size parameter, and θt V (sa t ) is the vector of partial derivatives, ( V (sat) θ t(1) V (sat) V (sat), θ t(2),..., ). This derivative vector is the gradient of V (sa θ t(n) t) with respect to θ t. An advantage of this approach is that the change in θ t is proportional to the gradient of the MSE error of the encountered state-action pair, the direction in which the error decreases most rapidly. This implementation of function approximation has two important characteristics that affect its behavior. First, the approximate value function is a linear function of these features, and the choice of features has a direct effect on the accuracy of the approximate representation. Within the context of reinforcement learning, a state-action pair that has not been previously encountered can be generalized from these pre-selected features. However, the great diversity of potential types of features can make feature selection difficult. Second, the approximate value function is actually a projection from the large target space to a limited feature space, and the completeness of the projection depends on the shape and size of the receptive regions of the features. Within the context of reinforcement learning, a large state-action space can be spanned by the receptive regions of a set of features.

22 CHAPTER 1. INTRODUCTION 8 Features with large regions can give wide generalization, but might make the representation of the approximation function coarser and perform only rough discrimination. Features with small regions can give narrow generalization, but might cause many states to be out of the receptive regions of all features. Selecting the shape and size of the receptive regions is often difficult for particular application domains. A range of function approximation techniques has been studied in recent years. These techniques can be partitioned into three types, according to the two characteristics described above: function approximation using natural features, function approximation using basis functions, and function approximation using Sparse Distribution Memory (SDM) Function Approximation Using Natural Features For each application domain, there are natural features that can describe a state. For example, in some pursuit problems in the grid world, we might have features for location, vision scale, memory size, communication mechanisms, etc. Choosing such natural features as the components of the feature vector is an important way to add prior knowledge to a function approximator. In function approximation using natural features, the θ-value of a feature indicates whether the feature is present. The θ-value is constant across the features receptive region and falls sharply to zero at the boundary. These receptive regions may be overlapped. A large region may give a wide but coarse generalization while a small region may give a narrow but fine generalization.

23 CHAPTER 1. INTRODUCTION 9 The advantage of function approximation using natural features is that the representation of the approximate function is simple and easy to understand. The natural features can be selected manually and their receptive regions can be adjusted based on the designer s intuition. A limitation of this function approximation technique is that it cannot handle continuous state-action spaces or state-action spaces with high dimensionality. For naturalfeature-based function approximation techniques, the number of features has the largest effect on the discrimination ability of the approximate function. Increasing the number of features gives finer discrimination of the state-action space, but may also increase the computational complexity of the algorithm. In general, more features are needed to accurately approximate continuous state-action spaces and state-action spaces with high dimensionality, and the number of these needed features grows exponentially with the number of dimensions in the state-action space [34]. A typical function approximation technique using natural features is Tile Coding [6]. This approach, which is an extension of coarse coding [20], is also known as Cerebellar Model Articulator Controller, or CMAC [6]. In Tile Coding, k tilings are selected, each of which partitions the state-action space into tiles. The receptive field of each feature corresponds to a tile, and a θ-value is maintained for each tile. A state-action pair p is adjacent to a tile if the receptive field of the tile includes p. The Q-value of a state-action pair is equal to the sum of the θ-values of all adjacent tiles. In binary Tile Coding, which is used when the state-action space consists of discrete values, each tiling corresponds to a subset of the bit positions in the state-action space and each tile corresponds to an assignment of binary

24 CHAPTER 1. INTRODUCTION 10 values to the selected bit positions Function Approximation Using Basis Functions For certain problems, a more accurate approximation is obtained if θ-values can vary continuously and represent the degree to which a feature is present. A basis function can be used to compute such continuously varying θ-values. Basis functions can be designed manually and the approximate value function is a function of these basis functions. In this case, function approximation uses basis functions to evaluate the presence of every feature, then linearly weights these values. In function approximation with basis functions, the receptive region of a feature depends on the parameters of the basis function of that feature. These parameters can control the size, shape and intensity of the receptive region. In general, the θ-value of a feature can vary across the feature s receptive region. An advantage of function approximation with basis function is that the approximated function is continuous and flexible. The basis functions, each with its own parameters, give a more precise representation of the value function across the entire state-action space. However there are two limitations of this function approximation technique. The first is that selecting these basis functions parameters is difficult in general [38, 17, 34]. The coefficients of the function combination are often learned by training the solver using test instances, while the parameters of the basis functions themselves are tuned manually. [34]. When the number of dimensions of the state and action space is very large, such manual tuning can

25 CHAPTER 1. INTRODUCTION 11 be difficult. The second difficulty is that basis function cannot handle state-action spaces with high dimensionality. It has been found to be hard to apply to continuous problems with more than dimensions because of the difficulty of manually tuning the basic functions [31, 25]. Also, the number of basis functions needed to approximate a state-action space can be exponential in the number of dimensions, causing the number of basis functions needed to be very large for a state-action space with high dimensionality. A typical function approximation technique using basis function is Radial Base Function Networks (RBFNs) [38]. In an RBFN, a sequence of Gaussian curves is selected as the basis functions of the features. Each basis function φ i for a feature i has a center c i, and width σ i. Given an arbitrary state-action pair s, the Q-value of the state-action pair with respect to the feature i is: φ i (s) = e s c i 2 2σ 2. The total Q-value of the state-action pair with respect to all features is the sum of the values of φ i (s) across all features. A radial basis function is actually a real-valued function whose value depends only on the distance from its center. It also can be considered a fuzzy membership function, and in this sense RBFNs represent a fuzzy function approximation technique. But RBFNs are the natural generalization of coarse coding with binary features to continuous features. A typical RBF feature unavoidably represents information about some, but not all, dimensions of the state-action space. This limits RBFNs from approximating large-scale, high-dimension

26 CHAPTER 1. INTRODUCTION 12 state-action spaces efficiently Function Approximation Using SDM Function approximation using either natural features or basis functions is known to not scale well for large problem domains, or to require prior knowledge [39, 38, 25]. This approach is not well-suited to problem domains with high dimensionality. We instead seek a class of features that can construct approximation functions without restricting the dimensionality of the state and action space. The theory of Spare Distributed Memory (SDM) [23] gives such a class of features. These features are often not natural features. They are typically a set of state-action pairs chosen from the entire state and action space. In function approximation using SDM, each receptive region is typically defined using a distance threshold with respect to the location of the feature in the state-action space. The θ-value of a state-action pair with respect to a feature is constant within the feature s receptive region, and is zero outside of this region. An advantage of function approximation using SDM structure is that its structure is particularly well-suited to problem domains with high dimensionality. Its computational complexity depends entirely on the number of prototypes, which is not a function of the number of the dimensions of the state-action space. A limitation of this technique is that more prototypes are needed to approximate stateaction spaces for complex problem domains, and the efficiency of function approximation using SDM is sensitive to the number of the prototypes [25]. Even when enough prototypes

27 CHAPTER 1. INTRODUCTION 13 are used, the performance of the reinforcement learner with SDM is often poor and unstable [38, 26, 34]. There is no known mechanism to guarantee the convergence of the algorithm. Kanerva Coding [24] is the implementation of SDM in function approximation for reinforcement learning. Here, a collection of prototype state-action pairs (prototypes), is selected, each of which corresponds to a binary feature. A state-action pair and a prototype are said to be adjacent if their bit-wise representations differ by no more than a threshold number of bits. A state-action pair is represented as a collection of binary features, each of which equals 1 if and only if the corresponding prototype is adjacent. A value θ(i) is maintained for each prototype, and an approximation of the value of a state-action pair is then the sum of the θ-values of the adjacent prototypes. In this way, Kanerva Coding can greatly reduce the size of the value table that needs to be stored. 1.3 Our Application Domain In the dissertation, we apply our study to solve the instances from two application domain. These two domains are predator-prey pursuit domain and cognitive radio network. The predator-prey pursuit domain [28], introduced in 1986, is a classic example of a multi-agent system. Problems based on this domain have been solved using a wide variety of approaches [19, 42, 3, 21] and it also has many different versions that can be used to illustrate different multi-agent scenarios [34, 36, 37]. A general version of the predator-prey pursuit domain takes place on a rectangular grid with one or more predator agents and one or more prey agents. Each grid cell is either open

28 CHAPTER 1. INTRODUCTION 14 or closed, and an agent can only occupy open cells. Each agent has an initial position. The problem is played in a sequence of time periods. In each time period, each agent can move to a neighboring open cell one horizontal or vertical step from its current location, or it can remain in its current cell. All moves are assumed to occur simultaneously, and more than one predator agent may not occupy the same cell at the same time. The goal of the predator agents is to capture the prey agents in the shortest time. The domain can be fully specified by selecting different numbers of predators and prey, defining capture in different ways, and setting each agent s visible range. The pursuit domain is usually studied with two or more predators and one prey; capture occurs when a predator agent is in the same cell as a prey agent or all predator agents surround a prey agent; the agent s visible range may be global or local (limited). Pursuit problems are difficult to solve in general and problems similar to ours have been proven to be NP-Complete [8, 33]. Researchers have used approaches such as genetic algorithms [19] and reinforcement learning [42] to develop solutions. Closed-form solutions to restricted versions of the problem have been found [3, 21], but most such problems remain open. The cognitive radio network domain [30], introduced in 1999, is a novel paradigm of wireless communication. The basic idea is that the unlicensed devices (also called cognitive radio users) need to vacate the band once the licensed devices (also known as primary users) are detected. CR networks impose a great challenge due to the high fluctuation in the available spectrum as well as diverse quality-of-service (QoS) requirements. Specifically

29 CHAPTER 1. INTRODUCTION 15 in cognitive radio ad-hoc networks, the distributed multi-hop architecture, the dynamic network topology, and the time and location varying spectrum availability are some of the key distinguishing factors. As the CR network must appropriately choose its transmission parameters based on limited environmental information, it must be able to learn from its experience, and adapt its functioning. The challenge necessitates novel design techniques that simultaneously integrate theoretical research on reinforcement learning and multi-agent interaction with systems level network design. 1.4 Dissertation Outline In Chapter 2, we discuss the effectiveness of common function approximation techniques for large-scale problems. In particular, we first show empirically that the performance of reinforcement learners with traditional function approximation techniques over the predator-prey pursuit domain is poor. We then demonstrate that uneven feature distribution can cause poor performance and describe a class of adaptive mechanisms that dynamically delete and generate features for reducing the uneven feature distribution. Finally, we propose our adaptive Kanerva-based function approximation, which is a form of probabilistic prototype deletion plus prototype splitting, and show that using adaptive function approximation results in better learning performance compared to traditional function approximation. In Chapter 3, we evaluate a class of hard instances of the predator-prey pursuit problem. We show that the performance using adaptive function approximation is still poor, and

30 CHAPTER 1. INTRODUCTION 16 argue that this performance is a result of frequent prototype collisions. We show that dynamic prototype allocation and adaptation can partially reduce these collisions and give better results than traditional function approximation. To completely eliminate prototype collisions, we describe a novel fuzzy approach to Kanerva-based function approximation which uses a fine-grained fuzzy membership grade to describe a state-action pair s adjacency with respect to each prototype. This approach, coupled with adaptive prototype allocation, allows the solver to distinguish membership vectors and reduce the collision rate. We also show that reducing the similarity between the membership vectors of state-action pairs can give better results. We use Maximum Likelihood Estimation to adjust the variance of basis functions and tune the receptive fields of prototypes. Finally, we conclude that our adaptive fuzzy Kanerva approach with prototype tuning gives better performance than the pure adaptive Kanerva algorithm. In Chapter 4, we observe that inappropriate number of prototypes may cause unstable and poor performance of the solver on the hardest class of pursuit instances, and show that choosing an optimal number of prototypes can improve the efficiency of function approximation. We use the theory of rough sets to measure how closely an approximate value function is approximating the true value function and determines whether or not more prototypes are required. We show that the structure of equivalence classes induced by prototypes is the key indicator of the effectiveness of a Kanerva-based reinforcement learner. We then describe a rough sets-based approach to selecting prototypes. This approach eliminates unnecessary prototypes by replacing original prototype set with its reduct, and reduces prototype

31 CHAPTER 1. INTRODUCTION 17 collisions by splitting equivalence classes with two or more state-action pairs. Finally, we conclude that rough sets-based Kanerva coding can adaptively select an effective number of prototypes and greatly improve a Kanerva-based reinforcement learner s ability to solve large-scale problems. In Chapter 5, we apply reinforcement learning with Kanerva-based function approximation to solve the real-world application of Wireless cognitive radio (CR). Wireless cognitive radio is a newly emerging paradigm that attempts to opportunistically transmit in licensed frequencies without affecting the pre-assigned users of these bands. To enable this functionality, such a radio must predict its operational parameters, such as transmit power and spectrum. These tasks, collectively called spectrum management, are difficult to achieve in a dynamic distributed environment in which CR users may only make local decisions, and react to environmental changes. In order to evaluate the efficiency of multi-agent reinforcement learning-based spectrum management, we first investigate various real-world scenarios and compare the communication performance using different sets of learning parameters. Our results indicate that the requirement of RL-based approaches that an estimated value be stored for every state greatly limits the size and complexity of CR networks that can be solved. We therefore apply Kanerva-based function approximation to improve our approach s ability to handle large cognitive radio networks and evaluate its effect on communication performance. We conclude that spectrum management based on reinforcement learning with function approximation can significantly reduce the interference to the licensed users, while maintaining a high probability of successful transmissions in a cognitive radio ad hoc network.

32 Chapter 2 Adaptive Function Approximation Learning problems with large state spaces, such as multi-agent problems, can be difficult to solve. When applying reinforcement learning to such problems, the size of the table needed to store the state-action values can limit the complexity of the problems that can be solved. Function approximation can reduce the size of the table by storing an approximation of the entire table. Most reinforcement learners behave poorly when used with function approximation in domains that are very large, have high dimension, or that have a continuous state-action space [11, 27, 34]. In this chapter, we discuss the effectiveness of common function approximation techniques for large-scale problems. We first describe the performance of reinforcement learners with Tile Coding and Kanerva Coding over the predator-prey pursuit domain. We then show that uneven feature distribution can cause poor performance. We describe a class of adaptive mechanisms that dynamically delete and generate features based on feature visit frequencies. 18

33 CHAPTER 2. ADAPTIVE FUNCTION APPROXIMATION 19 Finally, we demonstrate that using adaptive function approximation results in better learning performance compared to traditional function approximation. 2.1 Experimental Evaluation: Traditional Function Approximation Tile Coding and Kanerva Coding are two typical function approximation techniques that have been widely studied in various application domains [6, 24, 25, 38]. Both techniques give good learning performance and fast convergence for some instances with small state-action spaces. However, some empirical results also indicate that reinforcement learners with Tile Coding or Kanerva Coding may still perform poorly as the size of the state-action space grows or when applied to hard instances. [26, 34]. We therefore investigate the efficiency of traditional function approximation as the size of the state-action space increases Application Instances We evaluate the efficiency of traditional function approximation techniques by applying them to the predator-prey pursuit domain. The domain was selected because: it is a well-known reinforcement learning problem; there is a class of instances with varying levels of difficulty; and most importantly, the size of state-action space for solving instances in this domain can be easily extended.

34 CHAPTER 2. ADAPTIVE FUNCTION APPROXIMATION 20 Figure 2.1: The grid world of size 32 x 32. The predator-prey pursuit problem is challenging to solve because the size of its stateaction space can be very large. A general version of the problem is described in Chapter 1. In our experiment, pursuit takes place on an n x n rectangular grid with open cells and n randomly selected closed blocks. Each open cell in the grid represents a state that the agent may occupy. Each predator agent is randomly placed in a starting cell. Figure 2.1 shows an example of our grid world of size 32 x 32. We investigate three classes of instances with different levels of difficulty. The easy class of instances uses direct rewards and a fixed prey. That is, the predator agent receives a reward that is proportional to the predator s distance from the prey, and the prey does not move. The hard class of instances uses indirect rewards and a randomly moving prey. That

35 CHAPTER 2. ADAPTIVE FUNCTION APPROXIMATION 21 is, the predator agent receives a reward of 1 when it reaches the cell the prey is occupying, and receives a reward of 0 in every other cell. The predator attempts to catch a prey that is moving randomly. We use Q-learning with traditional Tile Coding and Kanerva Coding to solve the three classes of pursuit instances on an n x n grid. The size of the grid varies from 8x8 to 32x32. In each epoch, we apply each learning algorithm to 40 random training instances followed by 40 random test instances. The exploration rate ɛ is set to 0.3, which we found experimentally to give the best results in our experiments. The initial learning rate α is set to 0.8, and it is decreased by a factor of after each epoch. For every 40 epochs, we record the average fraction of test instances solved during those epochs within 2n moves. Each experiment is performed 3 times and we report the means and standard deviations of the recorded values. In our experiments, all runs were found to converge within 2000 epochs Performance Evaluation of Traditional Tile Coding In Tile Coding, each state-action pair is represented as a binary vector, and a tiling is constructed by selecting three bit positions from the vector. That is, each tiling corresponds to a 3-tuple of bit positions. A tile within a tiling corresponds to an assignment of values to each bit position [48].. Figure 2.2 shows the implementation of a tiling. All tiles are selected randomly. As the dimension of the state-action space increases, we vary the number of tiles over the following values: 300, 400, 600, 700, 1000, 1500, 2000 and We apply Tile Coding for solving the easy class of pursuit instances. Table 2.1 shows the

36 CHAPTER 2. ADAPTIVE FUNCTION APPROXIMATION 22 Binary vector A tiling is constructed by selecting three bit positions. A tile is an assignment of values to each bit position Each tiling partitions the state-action space Figure 2.2: The implementation of the tiling. average fraction of the instances solved by Q-learning with traditional Tile Coding as the number of tiles and the size of the grid vary. The values shown represent the final converged values of the solution rates. The results indicates that the fraction of test instances solved increased from 67.8% to 98.2% for the 8x8 grid, from 30.1% to 84.6% for the 16x16 grid and from 6.0% to 38.6% for the 32x32 grid, as the number of tiles increases. Figure 2.3 shows the average fraction of test instances solved by Q-learning with traditional Tile Coding with 2000 tiles as the size of the grid varies from 8 to 32. The graph shows how the solvers converge as the number of epochs increases. The fraction of test instances solved decreases from 97.1% to 33.6% as the grid size increases.

37 CHAPTER 2. ADAPTIVE FUNCTION APPROXIMATION 23 Table 2.1: The average fraction of test instances solved by Q-learning with traditional Tile Coding. # of Prot. 8x8 16x16 32x % 30.1% 6.0% % 47.6% 9.9% % 51.7% 17.2% % 56.5% 20.1% % 64.4% 24.7% % 71.1% 29.3% % 80.9% 33.6% % 84.6% 38.6% Average solution rate X 8 Grid 16X16 Grid 32X32 Grid Epoch Figure 2.3: The fraction of test instances solved by Q-Learning with traditional Tile Coding with 2000 tiles. These results show that as the size of the grid varies from 8x8 to 32x32, the fraction of test instances solved decreases sharply using traditional Tile Coding across all number of tiles. The number of the tiles used across all sizes of grids has a large effect on the fraction of test instances solved.

CHAPTER 2. ADAPTIVE FUNCTION APPROXIMATION 24 Prototype #1 Prototype #2 Prototype #3 Each prototype has its own receptive region. Each receptive region partitions the stateaction space Figure 2.

38 CHAPTER 2. ADAPTIVE FUNCTION APPROXIMATION 24 Prototype #1 Prototype #2 Prototype #3 Each prototype has its own receptive region. Each receptive region partitions the stateaction space Figure 2.4: The implementation of Kanerva Coding Performance Evaluation of Traditional Kanerva Coding We evaluate traditional Kanerva Coding by varying the number of prototypes and the size of the grid. We implement Kanerva Coding by representing the state-action pair as a binary vector. Each entry in the binary vector equals 1 if and only if the corresponding prototype is adjacent. Every prototype is a randomly selected state-action pair. Figure 2.4 shows the implementation of the Kanerva Coding. As the dimension of the state-action space increases, we vary the number of prototypes from 300 to We compare Kanerva Coding to Tile Coding when the number of prototypes is same as the number of tiles. Table 2.2 shows the average fraction of test instances solved by

39 CHAPTER 2. ADAPTIVE FUNCTION APPROXIMATION 25 Table 2.2: The average fraction of test instances solved by Q-learning with traditional Kanerva Coding. # of Traditional Tile Traditional Kanerva Prototype 8x8 16x16 32x32 8x8 16x16 32x % 30.1% 6.0% 57.2% 28.5% 7.9% % 47.6% 9.9% 63.5% 36.7% 13.2% % 51.7% 17.2% 75.0% 42.3% 22.3% % 56.5% 20.1% 79.2% 47.2% 28.0% % 64.4% 24.7% 90.9% 50.3% 32.1% % 71.1% 29.3% 91.4% 59.1% 36.6% % 80.9% 33.6% 93.1% 75.4% 40.6% % 84.6% 38.6% 93.5% 82.3% 43.2% Q-learning with traditional Kanerva Coding as the number of prototypes varies from 300 to 2500, and the size of the grid varies from 8x8 to 32x32. The values shown represent the final converged value of the solution rate. The results indicate that the fraction of test instances solved increased from 57.2% to 93.5% for the 8x8 grid, from 28.5% to 82.3% for the 16x16 grid, and from 7.9% to 43.2% for the 32x32 grid, as the number of tiles increases. Figure 2.5 shows the average fraction of test instances solved by Q-learning with traditional Kanerva Coding with 2000 tiles as the size of the grid varies from 8 to 32. The graph shows how the solvers converge as the number of epochs increases. The fraction of test instances solved decreases from 93.1% to 40.6% as the grid size increases. These results show that as the size of the grid increases, the fraction of test instances solved decreases sharply using traditional Kanerva Coding for all numbers of prototypes. The fraction of test instances solved depends largely on the number of the prototypes used across all sizes of grids. With a grid size of 8x8, Figure 2.3 indicates that Tile Coding solves 98.2% of the test

40 CHAPTER 2. ADAPTIVE FUNCTION APPROXIMATION Average solution rate X 8 Grid 16X16 Grid 32X32 Grid Epoch Figure 2.5: The fraction of test instances solved by Q-Learning with traditional Kanerva Coding with 2000 prototypes. instances while Kanerva Coding solves only 93.5% of the test instances, after 2000 epochs. However, for a grid size of 32x32, Tile Coding solves 33.6% of the test instances while Kanerva Coding solves 43.2% of the test instances, after 2000 epochs. These results show that when the number of dimensions is small, traditional Tile Coding outperforms traditional Kanerva Coding. However, as the number of dimensions increases, Tile Coding s performance degrades faster than the performance of Kanerva Coding when the number of prototypes is fixed. We conclude that Kanerva Coding performs better relative to Tile Coding when the dimension of the state-action space is large, and for this reason we choose Kanerva Coding as the starting point for our research.

41 CHAPTER 2. ADAPTIVE FUNCTION APPROXIMATION Visit Frequency and Feature Distribution The performance evaluation in the previous section showed that the efficiency of traditional function approximation techniques decreases sharply as the size of state-action space increases. Our performance evaluation also showed that the performance of reinforcement learners with Tile Coding and Kanerva Coding is sensitive to the number of features, that is, the number of tiles in Tile Coding or the number of prototypes in Kanerva Coding. If the number of features is small relative to the number of state-action pairs, or if the features themselves are not well chosen, the approximate values will not be similar to the true values and the reinforcement learner will give poor results. If the number of features is very large relative to the number of state-action pairs, each feature may be adjacent to a small number of state-action pairs. In this case, the approximate state-action values will tend to be close to the true values, and the reinforcement learner will operate as usual. Unfortunately, we often do not have enough memory to store a large number of features, so we consider how to produce the smallest set of features which can span the entire state space. It is difficult to generate such an optimal set of features for several reasons: the space of possible subsets is very large and the state-action pairs encountered by the solver depend on the specific problem instance being solved. We therefore investigate several heuristic solutions to the feature optimization problem. We say that a feature is visited during Q-learning if it is adjacent to the current stateaction pair. Intuitively speaking, if a specific feature is rarely visited, it implies that few state-action pairs are adjacent to the feature. This suggests that the feature is inappropriate

42 CHAPTER 2. ADAPTIVE FUNCTION APPROXIMATION 28 for the particular application. In contrast, if a specific feature is visited frequently, it implies that many state-action pairs are adjacent to the feature. This suggests that the feature may not distinguish many distinct state-action pairs. Therefore, prototypes that are rarely visited do not contribute to the solution of instances. Similarly, prototypes that are visited very frequently are likely to decrease the distinguishability of state-action pairs. Removing the rarely-visited and heavily-visited prototypes may reduce inappropriate prototypes and improve the efficiency of Kanerva coding. Our goal is therefore to generate a set of features where each feature is visited an average number of times. We define a feature s visit frequency as the number of visits to the feature during a learning process. In particular, we refer to a tile s visit frequency in Tile Coding and a prototype s visit frequency in Kanerva Coding. We observe the distribution of visit frequencies across all tiles or prototypes over a converged learning process. The frequency distribution of visits to tiles over three sample runs using Q-learning with Tile Coding is shown in Figure 2.6. The example uses direct rewards, fixed prey, and 2000 tiles. Similarly, the frequency distribution of visits to prototypes over three sample runs using Q-learning with Kanerva Coding and 2000 prototypes is shown in Figure 2.7. The non-uniform distribution of visit frequencies across all tiles or prototypes indicates that most prototypes are either frequently visited, or rarely visited. In next section, we describe ways to generate sets of features with visit frequencies that are more uniform.

43 CHAPTER 2. ADAPTIVE FUNCTION APPROXIMATION 29 Number of Tiles x32 Grid 16x16 Grid 8x 8 Grid Visits Figure 2.6: The frequency distribution of visits to tiles over a sample run using Q-learning with Tile Coding Number of Prototypes x32 Grid 16x16 Grid 8x 8 Grid Visits Figure 2.7: The frequency distribution of visits to prototypes over a sample run using Q- learning with Kanerva Coding

44 CHAPTER 2. ADAPTIVE FUNCTION APPROXIMATION Adaptive Mechanism in Kanerva-Based Function Approximation The goal of feature optimization for function approximation is to produce a set of features where visit frequencies across feature are relatively uniform. The visit frequency of a feature is equal to the number of adjacent state-action pairs encountered during a learning process. The specific state-action pairs encountered by the solver depend on the specific problem instance being solved. Therefore, adaptively choosing features appropriate to the particular application is an important way to implement feature optimization for function approximation. Feature adaptation uses prior knowledge and online experience to improve a reinforcement learner. There have been few published attempts to explore this type of algorithm [34] and no known attempts to evaluate and improve the quality of feature adaptation for function approximation. We optimize features using visit frequencies. We divide the original features into three categories: features with a low visit frequency, features with a high visit frequency, and the rest of the features. We describe and evaluate four optimization mechanisms to optimize the set of features. Since Kanerva Coding outperforms Tile Coding when the state-action space is highdimensional, we base our optimization mechanisms on Kanerva Coding. Initial prototypes are selected randomly from the entire space of possible state-action pairs. Q-learning with

45 CHAPTER 2. ADAPTIVE FUNCTION APPROXIMATION 31 Kanerva Coding is used to develop policies for the predator agents, while keeping track of the number of visits to each prototype. After a fixed number of iterations, we update the prototypes using the mechanisms described below Prototype Deletion and Generation Prototypes that are rarely visited do not contribute to the solution of instances. Similarly, prototypes that are visited very frequently are likely to decrease the distinguishability of state-action pairs. It makes sense to delete both types of prototypes and replace them with new prototypes whose visit frequencies are closer to an average value. In our implementation, we periodically delete a fraction of the prototypes whose visit frequencies are lowest, and a fraction of the prototypes whose visit frequency are highest. The fraction of prototypes that is deleted slowly decreases as the algorithm runs. The θ-value and visit frequency of the new prototype are initially set to zero. We refer to this approach as deterministic prototype deletion. An advantage of this approach is that it is easy to implement and it uses application- and instance-specific information to guide the deletion of rarely or frequently visited prototypes. However, this approach deletes prototypes deterministically which does not give the solver the flexibility to keep some prototypes that are rarely or frequently visited. For example, if the number of prototypes is very large, some prototypes that might become useful will not be visited in an early epoch and will be deleted. In order to overcome this disadvantage, we delete prototypes with a probability equal

46 CHAPTER 2. ADAPTIVE FUNCTION APPROXIMATION 32 to an exponential function of the number of visits. I.e. the probability p del of deleting a prototype whose visit frequency is v is p del = λe λv, where λ is a parameter that can vary from 0 to 1. In this approach, prototypes that are rarely visited tend to be deleted with a high probability, while prototypes that are frequently visited are rarely deleted. We refer to this approach as probabilistic prototype deletion. We attempt to replace prototypes that have been deleted with new prototypes that will tend to improve the behavior of the function approximation. One approach is to generate new prototypes randomly from the entire state space. While this approach aggressively searches the state space for useful prototypes, it does not use domain- or instance-specific information. We instead create new prototypes by splitting heavily-visited prototypes. A prototype s 1 that has been visited the most times is selected, and a new prototype s 2 that is a neighbor of s 1 is created by inverting a fixed number of bits in s 1. The θ-value and visit frequency of the new prototype are initially set to zero. The prototype s 1 remains unchanged. In this approach, new prototypes near prototypes with the highest visit frequencies are created. These prototypes are similar but distinct which tends to reduce the number of visits to nearby prototypes, and therefore increase the distinguishability of these prototypes. We refer to this approach as prototype splitting. Our adaptive Kanerva-based function approximation uses the probabilistic prototype deletion with prototype splitting. The approach makes the distribution of feature visit frequencies more uniform. We therefore refer to this approach as frequency-based prototype

47 CHAPTER 2. ADAPTIVE FUNCTION APPROXIMATION 33 Table 2.3: The average fraction of test instances solved by Q-learning with adaptive Kanerva Coding. # of Grid Size Prototypes 8x8 16x16 32x % 49.6% 23.3% % 52.3% 28.3% % 62.4% 37.0% % 70.4% 41.7% % 84.5% 62.8% % 95.7% 77.6% % 95.9% 90.5% % 96.1% 92.4% optimization Performance Evaluation of Adaptive Kanerva-Based Function Approximation We evaluate our prototype optimization algorithm by applying Q-learning with adaptive Kanerva Coding to solve the easy class of predator-prey pursuit instances described in Section 2.1 on an nxn grid. Prototype optimization is applied after every 20 epochs. The size of the grid n also varies from 8x8 to 32x32. All others experimental parameters are unchanged. Table 2.3 shows the average fraction of test instances solved by Q-Learning with adaptive Kanerva Coding as the number of prototype varies from 300 to 2500, and the size of the grid varies from 8x8 to 32x32. The values shown represent the final converged values of the solution rates. The results indicate that the fraction of test instances solved increased from 81.3% to 99.5% for the 8x8 grid, from 49.6% to 96.1% for the 16x16 grid and from 23.3% to 92.4% for the 32x32 grid, as the number of prototypes increases.

48 CHAPTER 2. ADAPTIVE FUNCTION APPROXIMATION 34 Average solution rate Epoch (1) 8X 8 Grid, Adaptive (2) 16X16 Grid, Adaptive (3) 8X 8 Grid, Traditional (4) 32X32 Grid, Adaptive (5) 16X16 Grid, Traditional (6) 32X32 Grid, Traditional!"#$!%#$!&#$!'#$!(#$!)#$!"#$%&#'()*+,)-'.%/#' *& )%2& )%1& )%0& )%/& )%.& )%-& )%,& )%+& )%*& )&!"#$%& '(#"%&!"#$%&!"#$%& '(#"%& '(#"%& 131& *& */3*/& +&,+3,+&,& 0$12'(13#' Figure 2.8: The fraction of test instances solved by Q-Learning with adaptive Kanerva Coding with 2000 prototypes. Figure 2.8 shows the average fraction of test instances solved by Q-Learning with adaptive and traditional Kanerva Coding with 2000 tiles as the size of the grid varies from 8x8 to 32x32. The graph shows how the solvers converge as the number of epochs increases. The traditional Kanerva algorithm solves approximately 93.1% of the test instances with a grid size of 8x8, 75.4% with a grid size of 16x16, and 40.6% with a grid size of 32x32. Adaptive Kanerva algorithm solves approximately 99.5% of the test instances with a grid

49 CHAPTER 2. ADAPTIVE FUNCTION APPROXIMATION 35 Number of Prototypes x32 Grid 16x16 Grid 8x 8 Grid Visits Figure 2.9: The frequency distribution of visits to prototypes over a sample run using Q- learning with adaptive Kanerva Coding. size of 8x8, 95.9% with a grid size of 16x16, and 90.5% with a grid size of 32x32. These results indicate that adaptive Kanerva Coding outperforms traditional Kanerva Coding and that probabilistic prototype deletion with prototype splitting can significantly increase the efficiency of Kanerva-based function approximation. We also observe the distribution of visit frequencies across all prototypes after optimization. Figure 2.9 shows these frequency distributions over the same instances used in Section 2.2. The graph shows that most prototypes are visited an average number of times. These results indicate that the optimized prototypes correctly span the state-action space of a particular instance. The results suggest that that the improved performance of the adaptive Kanerva algorithm over the traditional algorithm is due to the more uniform frequency distribution of visits to prototypes.

50 CHAPTER 2. ADAPTIVE FUNCTION APPROXIMATION Summary In this chapter, we evaluated and compared the behavior of two typical function approximation techniques, Tile Coding and Kanerva Coding, over the predator-prey pursuit domain. We showed that traditional function approximation techniques applied within a reinforcement learner do not give good learning performance. By exploring the features visit frequencies, we revealed that the non-uniform frequency distribution of visits across all features is a key factor in achieving poor performance. We then described our new adaptive Kanerva-based function approximation algorithm, based on prototype deletion and generation. We showed that probabilistic prototype deletion with prototype splitting increases the fraction of test instances solved. These results demonstrate that our approach can dramatically improve the quality of the results obtained and reduce the number of prototypes required. We conclude that adaptive Kanerva Coding using frequency-based prototype optimization can greatly improve a Kanerva-based reinforcement learner s ability to solve large-scale multi-agent problems.

51 Chapter 3 Fuzzy Logic-based Function Approximation Feature optimization can be used to improve the efficiency of traditional function approximation within reinforcement learners to a certain extent. This approach can produce a uniform frequency distribution of visits across features by deleting features that are not necessary and splitting important features. In Chapter 2, we described our implementation of this algorithm using Adaptive Kanerva Coding. However this approach still gives poor performance, and the improvement over traditional Kanerva Coding is small when applied to hard instances of large-scale multi-agent systems. We therefore must consider whether other potential factors are causing this poor performance. In this chapter, we attempt to solve a class of hard instances in the predator-prey pursuit domain and argue that the poor performance that we observe is caused by frequent prototype 37

52 CHAPTER 3. FUZZY LOGIC-BASED FUNCTION APPROXIMATION 38 collisions. We show that feature optimization can give better results by partially reducing these collisions. We then describe our novel approach, fuzzy Kanerva-based function approximation, that uses a fine-grained fuzzy membership grade to describe a state-action pair s adjacency with respect to each prototype. This approach can completely eliminate prototype collisions. 3.1 Experimental Evaluation: Kanerva Coding Applied to Hard Instances In Chapter 2, we described three classes of pursuit instances that ranged in difficulty. Adaptive Kanerva Coding, which outperforms tradition Kanerva Coding, gave good learning performance and fast convergence over the easy class of instances. We first evaluate a reinforcement learner with adaptive Kanerva Coding on a collection of hard instances. We evaluate traditional and adaptive Kanerva Coding by applying them to pursuit instances using indirect rewards and a randomly moving prey. The state-action pairs are represented as binary vectors and all prototypes are selected randomly. Probabilistic prototype deletion with prototype splitting is used as feature optimization for adaptive Kanerva Coding. The number of prototypes varies over the following values: 300, 400, 600, 700, 1000, 1500, 2000 and The size of the grid varies from 8x8 to 32x32. Table 3.1 shows the average fraction of hard test instances solved by Q-learning with adaptive Kanerva Coding as the number of prototypes and the size of the grid vary. The

53 CHAPTER 3. FUZZY LOGIC-BASED FUNCTION APPROXIMATION 39 Table 3.1: The average fraction of test instances solved by Q-learning with adaptive Kanerva Coding. # of Grid Size Prototypes 8x8 16x16 32x % 32.3% 20.8% % 38.1% 24.9% % 50.5% 36.1% % 57.3% 39.7% % 65.3% 55.2% % 78.2% 60.7% % 83.4% 67.9% % 88.8% 76.4% values shown represent the final converged value of the solution rate. The results indicates that the fraction of test instances solved increased from 73.3% to 96.4% for the 8x8 grid, from 32.3% to 88.8% for the 16x16 grid, and from 20.8% to 76.4% for the 32x32 grid as the number of prototypes increases. By comparing with Table 2.2, we see that adaptive Kanerva Coding achieves a lower average solution rate when solving hard test instances than when solving easy test instances when the number of prototypes and the size of the grid are held constant. Figure 3.1 shows the average fraction of hard test instances solved by Q-learning with adaptive Kanerva Coding with 2000 tiles as the size of the grid varies from 8x8 to 32x32. The graph shows how the solvers converge as the number of epochs increases. The results show that when using adaptive Kanerva-based function approximation with 2000 prototypes, the fraction of test instances solved decreases from 94.9% to 67.9% as the grid size increases. These results indicate that although it improves on traditional Kanerva Coding, the fraction of test instances solved using adaptive Kanerva Coding still decreases sharply as

54 CHAPTER 3. FUZZY LOGIC-BASED FUNCTION APPROXIMATION 40 Average solution rate (1) (2) (3) (1) 8X 8 Grid (2) 16X16 Grid (3) 32X32 Gird Epoch Figure 3.1: The fraction of easy and hard test instances solved by Q-learning with adaptive Kanerva Coding with 2000 prototypes. the size of the grid increases when applied to hard test instances. Feature optimization only improves the efficiency of function approximation to a certain extent, and cannot solve hard instances of large-scale system. We need to further explore other factors that may be causing poor performance. 3.2 Prototype Collisions in Kanerva Coding Kanerva Coding is an implementation of SDM for reinforcement learning. A collection of k prototypes is selected, each of which corresponds to a binary feature. A state-action pair sa and a prototype p i are adjacent if their bit-wise representations differ by no more than a threshold number of bits. The threshold is typically set to 1 bit. We define the adjacency grade adj i (sa) of sa with respect to p i to be equal 1 if sa is adjacent to p i, and equal to 0 otherwise. A state-action pair s prototype vector consists of its adjacency grades with

CHAPTER 3. FUZZY LOGIC-BASED FUNCTION APPROXIMATION 41 SA SA sa 1 sa 2 collision SA sa 1 sa 2 sa 1 sa 2 collision (a) (b) (c) Figure 3.2: The illustration of prototype collision.

55 CHAPTER 3. FUZZY LOGIC-BASED FUNCTION APPROXIMATION 41 SA SA sa 1 sa 2 collision SA sa 1 sa 2 sa 1 sa 2 collision (a) (b) (c) Figure 3.2: The illustration of prototype collision. (a) adjacent to no prototype; (b) adjacent to an identical prototype set; (c) adjacent to unique prototype vectors. respect to all prototypes. A value θ(i) is maintained for the ith prototype, and Q(sa), an approximation of the value of a state-action pair sa, is then the sum of the θ-values of the adjacent prototypes; that is, Q(sa) = i θ(i) adj i (sa). A prototype collision is said to have taken place between two distinct state-action pairs, sa i and sa j, if and only if sa i and sa j have the same prototype vector, that is, the same adjacency grades over all prototypes. In Kanerva Coding, for two arbitrary state-action pairs, there are three possible cases: the state-action pairs are both adjacent to no prototypes, the state-action pairs have identical prototype vectors, or the state-action pairs have distinct prototype vectors, as shown in Figure 3.2. Kanerva Coding works best when each state-action pair has a unique prototype vector, where no prototype collision takes place. If prototypes are not well distributed across

56 CHAPTER 3. FUZZY LOGIC-BASED FUNCTION APPROXIMATION 42 the state-action space, many state-action pairs will either not be adjacent to any prototypes, or adjacent to identical sets of prototypes, corresponding to identical prototype vectors. If two similar state-action pairs are adjacent to the same set of prototypes, their state-action value are always same during the learning process. Typically, the solver needs to distinguish such state-action pairs, which is not possible in this case. Such prototype collisions reduce the quality of the results, and the estimates of Q-values of such state-action pairs will be equal [49].. The collision rate in Kanerva Coding is the fraction of state-action pairs that are either adjacent to no prototypes, or adjacent to the same set of prototypes as some other state-action pair. The larger the value of the collision rate, the more frequently prototype collisions will occur during Kanerva-based function approximation. The prototype collision is therefore inversely proportional to the learning performance of a reinforcement learner with Kanerva-based function approximation. Selecting a set of prototypes that distinguishes frequently-visit distinct state-actions pairs can improve the solver s ability to solve the problem. However, it is difficult to generate such a set of prototypes for several reasons: the space of possible subsets is very large, and the stateaction pairs encountered by the solver depend on the specific problem instance being solved. Dynamic prototype allocation and adaptation removes unnecessary prototypes and adds new prototypes that cover parts of the state-action space that are frequently visited during instance-based learning. In this way, prototypes can be adaptively adjusted to minimize prototype collisions for the specific problem domain.

57 CHAPTER 3. FUZZY LOGIC-BASED FUNCTION APPROXIMATION 43 % of State-action Pairs 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% 75.0% 15.6% 9.4% Adjacent to a unique prototype set Adjacent to a non-unique prototype set Adjacent to no prototypes 49.3% 91.5% 7.9% 0.6% 22.6% 28.1% 84.9% 12.3% 2.8% 28.5% 28.2% 43.3% 70.5% 20.3% 9.2% Traditional Adaptive Traditional Adaptive Traditional Adaptive 8X8 16X16 32X32 Grid size Figure 3.3: Prototype collisions using traditional and adaptive Kanerva-based function approximation with 2000 prototypes. In order to evaluate the negative effect of prototype collisions, we observe the collision rates produced when using traditional Kanerva Coding and adaptive Kanerva Coding as the size of the grid varies. Figure 3.3 shows the fraction of state-action pairs that are adjacent to no prototypes, adjacent to identical sets of prototypes, and adjacent to a unique set of prototypes when traditional Kanerva Coding and adaptive Kanerva Coding with 2000 prototypes are applied to easy predator-prey instances of varying sizes. Here, the collision rate is the sum of the fraction of state-action pairs that are adjacent to no prototypes and the fraction of state-action pairs that are adjacent to identical sets of prototypes. These results show that for the traditional algorithm, the collision rate increases from 25.0% to 71.5% as the size of grid increases. For the adaptive algorithm space, the collision rate increases from 8.5% to 29.5% as the size of the grid increases.

58 CHAPTER 3. FUZZY LOGIC-BASED FUNCTION APPROXIMATION 44 The results also suggest that the improved performance of the adaptive Kanerva algorithm over the traditional algorithm occurs with the reduction of prototype collisions. For example, the adaptive Kanerva algorithm reduces the collision rate from 71.5% to 29.5% while the average solution rate for the adaptive algorithm increases for a grid size of 32x32. However, the results also indicate that while the adaptive mechanism successfully reduces the number of collisions caused by the fraction of state-action pairs that are adjacent to no prototypes, it is not as successful at reducing the number of collisions caused by the fraction of state-action pairs that are adjacent to identical sets of prototypes. For example, the adaptive algorithm reduces the number of collisions caused by state-action pairs that are adjacent to no prototypes by 91.7% in the 8x8 grid, by 90.0% in the 16x16 grid, and by 78.8% in the 32x32 grid. But it reduces only 49.4% of the collisions caused by the fraction of state-action pairs that are adjacent to identical sets of prototypes in the grid of 8x8, 45.6% in the grid of 16x16 and 28.0% in the grid of 32x32. For further clarify the effect of prototype collisions to the efficiency of Kanerva-based function approximation, we evaluate the performance of traditional and adaptive Kanervabased function approximation and their corresponding collision rates using different number of prototypes and different sizes of grids. Figure 3.4 shows the fraction of test instances solved (solution rate) and the fraction of of state-action pairs that are adjacent to no prototypes and adjacent to identical prototype vectors (collision rate) by traditional and adaptive Kanervabased function approximation as the number of prototypes varies from 300 to 2500 in the grid of varying sizes from 8 to 32.

59 CHAPTER 3. FUZZY LOGIC-BASED FUNCTION APPROXIMATION 45 Sol$on Rate Sol$on Rate 100% 80% 60% 40% 20% 0% 100% 80% 60% 40% 20% 0% Tradi0onal Adap0ve Number of Prototypes (a) 8 x 8 Tradi0onal Adap0ve Number of Prototypes (c) 16 x 16 Collision Rate Collision Rate 100% 80% 60% 40% 20% Tra. Ada. Tra. Ada. Tra. Ada. Tra. Ada. Tra. Ada. Tra. Ada. Adjacent to unique prototype set Adjacent to iden;cal prototype set Adjacent to no prototype 0% Number of Prototypes 100% 80% 60% 40% 20% (b) 8 x 8 Tra. Ada. Tra. Ada. Tra. Ada. Tra. Ada. Tra. Ada. Tra. Ada. Adjacent to unique prototype set Adjacent to iden;cal prototype set Adjacent to no prototype 0% Number of Prototypes (d) 16 x 16 Sol$on Rate 100% 80% 60% 40% 20% 0% Tradi0onal Adap0ve Number of Prototypes (e) 32 x 32 Collision Rate 100% 80% 60% 40% 20% Tra. Ada. Tra. Ada. Tra. Ada. Tra. Ada. Tra. Ada. Tra. Ada. Adjacent to unique prototype set Adjacent to iden;cal prototype set Adjacent to no prototype 0% Number of Prototypes (f) 32 x 32 Figure 3.4: Average fraction of test instances solved (solution rate) (a) 8x8 grid; (c) 16x16 grid; (e) 32x32 grid, and the fraction of of state-action pairs that are adjacent to no prototypes and adjacent to identical prototype vectors (collision rate) (b) 8x8 grid; (d) 16x16 grid; (f) 32x32 grid by traditional and adaptive Kanerva-based function approximation as the number of prototypes varies from 300 to 2500.

60 CHAPTER 3. FUZZY LOGIC-BASED FUNCTION APPROXIMATION 46 The values shown represent the final converged value of the solution rate. The results show that, when using traditional Kanerva Coding, the solution rate increases from 57.2% to 93.5% while the collision rate decreases from 83.7% to 22.9% for the 8x8 grid, the solution rate increases from 28.5% to 82.3% while the collision rate decreases from 65.7% to 48.4% for the 16x16 grid and the solution rate increases from 7.9% to 43.2% while the collision rate decreases from 89.7% to 70.8% for the 32x32 grid, as the number of prototypes increases. As a comparison, when using adaptive Kanerva Coding, the solution rate increases from 81.3% to 99.5% while the collision rate decreases from 50.0% to 8.8% for the 8x8 grid, the solution rate increases from 49.6% to 96.1% while the collision rate decreases from 65.7% to 16.2% for the 16x16 grid and the solution rate increases from 23.3% to 92.4% while the collision rate decreases from 84.9% to 25.4% for the 32x32 grid, as the number of prototypes increases. These results indicate that the fraction of test instances solved decreases sharply while the collision rate increases sharply using both traditional and adaptive Kanerva-based function approximation with different number of prototypes across all sizes of grids. The results also indicate that adaptive Kanerva Coding has better learning performance and causes fewer prototype collisions over traditional Kanerva Coding and the tendency is magnified as the size of grid increases. However, the performance of the adaptive algorithm on large instances is still poor as the number of prototypes decreases, as shown in Figure 3.4. It is therefore necessary to consider a more effective approach for reducing the collision rate as the dimension of the state-action space increases.

61 CHAPTER 3. FUZZY LOGIC-BASED FUNCTION APPROXIMATION Adaptive Fuzzy Kanerva Coding A more flexible and powerful approach to function approximation is to allow a state-action pair to update θ-values of all prototypes, instead of a subset of neighbor prototypes. Instead of being binary values, we use fuzzy membership grades that vary continuously between 0 and 1 across all prototypes. Such fuzzy membership grades are larger for closer prototypes and smaller for more distant prototypes. Since prototype collisions occur only when two state-action pairs have the same real values in all elements of their membership vectors, collisions are less likely. In the traditional Kanerva Coding, a collection of k prototypes is selected. A stateaction pair sa and a prototype p i are said to be adjacent if their bit-wise representations differ by no more than a threshold number of bits. To introduce fuzzy membership grades, we reformulate this definition of traditional Kanerva Coding using fuzzy logic [16, 13, 44]. We define the membership grade µ i (sa) of s with respect to p i 1 if sa is adjacent to p i, µ i (sa) = 0 otherwise. A state-action pair s membership vector consists of its membership grades with respect to all prototypes. A value θ(i) is maintained for the ith feature, and ˆQ(sa), an approximation of the value of a state-action pair sa, is then the sum of the θ-values of the adjacent prototypes. That is ˆQ(sa) = i θ(i)µ i (sa).

62 CHAPTER 3. FUZZY LOGIC-BASED FUNCTION APPROXIMATION 48 Figure 3.5: Sample membership function for traditional Kanerva Coding. Therefore Kanerva Coding can greatly reduce the size of the value table that needs to be stored. Figure 3.5 gives an abstract description of the distribution of a state-action pair s membership grade with respect to each element of a set of prototypes. The figure shows the regions of the state-action space where prototype collisions take place. Note that receptive fields with crisp boundaries can cause frequent collisions Fuzzy and Adaptive Mechanism In our fuzzy approach to Kanerva Coding, the membership grade is defined as follows. Given a state-action pair s, the ith prototype p i, and a constant variance σ 2, the membership grade

63 CHAPTER 3. FUZZY LOGIC-BASED FUNCTION APPROXIMATION 49 Figure 3.6: Sample membership function for fuzzy Kanerva Coding. of sa with respect to p i is µ i = e sa p i 2 2σ 2, where s p i represents the bit difference between sa and p i. Note that the membership grade of a prototype with respect to an identical state-action pair is 1, and the membership grade of a state-action pair and a completely different prototype approaches 0. The effect of an update θ to a prototype s θ-value is now a continuous function of the bit difference sa p i between the state-action pair s and the prototype p i. The update can have a large effect on immediately adjacent prototypes, and a smaller effect on more distant prototypes. Figure 3.6 gives an abstract description of the distribution of a state-action pairs fuzzy membership grade with respect to each member of a set of prototypes. In the adaptive Kanerva Coding algorithm described above, prototypes are updated based

64 CHAPTER 3. FUZZY LOGIC-BASED FUNCTION APPROXIMATION 50 on their visit frequencies. In fuzzy Kanerva Coding the visit frequency of each prototype is identical, so we instead use membership grades which vary continuously from 0 to 1. If the membership grade of a state-action pair with respect to a prototype tends to 1, we say that the prototype is strongly adjacent to the state-action pair, Otherwise, the prototype is said to be weakly adjacent to the state-action pair. The probability p update (sa) that a state-action pair sa is chosen as a prototype is p update (sa) = λe λm(sa), where λ is a parameter that can vary from 0 to 1, and where m(sa) is the sum of the membership grades of state-action pair sa with respect to all prototypes. In this mechanism, prototypes that are weakly adjacent to frequently-visited state-action pairs tend to be probabilistically replaced by prootypes that are strongly adjacent to frequently-visited state-action pairs Adaptive Fuzzy Kanerva Coding Algorithm Algorithm 1 describes our adaptive fuzzy Kanerva Coding algorithm. The algorithm begins by initializing parameters and repeatedly executes Q-learning with fuzzy Kanerva Coding. Prototypes are adaptively updated periodically. The algorithm computes fuzzy membership grades for all state-action pairs with respect to all prototypes. Current prototypes are then periodically probabilistically replaced with state-action pairs with the highest accumulated membership grades.

65 CHAPTER 3. FUZZY LOGIC-BASED FUNCTION APPROXIMATION 51 Algorithm 1 Pseudo Code of Fuzzy Kanerva Coding Main() choose a set of prototypes p and initial their θ value; repeat Generate initial state-action pair s from initial state ς and action a Q-with-Kanerva(s, a, p, θ) Update-prototypes( p, θ) until all episodes are traversed Q-with-Kanerva(s, a, p, θ) repeat Take action a, observe reward r, get next state ς s p 2 2σ µ(s) = e «; 2 ˆQ(s) = µ(s) θ; for all actions a* under new state ς do Generate the state-action pair s f rom state ς and action a*; s p 2 µ(s 2σ ) = e «; 2 ˆQ(s) = µ(s ) θ; end for δ = r + γ maxq(s ) Q(s); θ = α δ µ(s); θ = θ + θ; m(s) = m(s) + µ(s); if random probability ε then for all actions a* under current state s do ˆQ(s) = µ(s) θ; a = argmax a Q(sa) end for else a = random action end if until s is terminal Update-prototypes( p, θ) p = φ repeat for all state-action pairs s do with probability λe λm(s) p = p {s} end for until p is full

66 CHAPTER 3. FUZZY LOGIC-BASED FUNCTION APPROXIMATION 52 Table 3.2: The average fraction of test instances solved by Q-Learning with adaptive fuzzy Kanerva Coding. # of Prot. 8x8 16x16 32x % 42.8% 20.9% % 50.0% 25.5% % 61.8% 39.0% % 67.2% 41.2% % 71.2% 58.6% % 86.7% 78.4% % 91.6% 82.8% % 92.2% 85.3% Performance Evaluation of Adaptive Fuzzy Kanerva-Based Function Approximation We evaluate the performance of adaptive fuzzy Kanerva Coding by applying Q-learning with adaptive Kanerva Coding and adaptive fuzzy Kanerva Coding with different number of prototypes to hard pursuit instances on grids of various sizes. Table 3.2 shows the average fraction of hard test instances solved by Q-learning with fuzzy Kanerva Coding as the number of prototypes and the size of the grid vary. The values shown represent the final converged value of the solution rate. The results indicates that the fraction of test instances solved increased from 80.9% to 97.5% for the 8x8 grid, from 42.8% to 92.2% for the 16x16 grid, and from 20.9% to 85.3% for the 32x32 grid as the number of prototypes increases. By comparing with Table 3.1, we see that the fuzzy Kanerva Coding increases the average solution rate over the adaptive Kanerva Coding when the number of prototypes and the size of the grid are held constant.

67 CHAPTER 3. FUZZY LOGIC-BASED FUNCTION APPROXIMATION 53 Average solution rate (1) 8X 8 Grid, Fuzzy (2) 8X 8 Grid, Adaptive (3) 16X16 Grid, Fuzzy (4) 16X16 Grid, Adaptive (5) 32X32 Grid, Fuzzy (6) 32X32 Grid, Adaptive, Epoch!"#$!%#$!&#$!'#$!)#$!(#$!"#$%&#'()*+,)-'.%/#' +% *$3% *$2% *$1% *$0% *$/% *$.% *$-% *$,% *$+% *%!"##$% &'()$%!"##$% &'()$%!"##$% &'()$% 242% +% +04+0%,% -,4-,% -% 0$12'(13#' Figure 3.7: Average solution rate for adaptive fuzzy Kanerva Coding with 2000 prototypes. Figure 3.7 shows the average fraction of test instances solved when adaptive Kanerva and adaptive fuzzy Kanerva-based function approximation are applied to our instances as the number of prototypes varies. The results show that with 2000 prototypes, using the fuzzy algorithm increases the fraction of the test instances solved over the adaptive algorithm from 83.4% to 91.6% in the grid of 16x16 and from 67.9% to 82.8% in the grid of 32x32. These results indicates that the fuzzy algorithm increases the fraction of the test instances solved over the adaptive Kanerva algorithm.

68 CHAPTER 3. FUZZY LOGIC-BASED FUNCTION APPROXIMATION Prototype Tuning While fuzzy Kanerva Coding can give good results for our instances, the quality of the results is often unstable. That is, the average fraction of test instances solved by the fuzzy approach may be low. An explanation for these results can be found by considering the similarity of membership vectors across state-action pairs. Intuitively, the similarity of membership vectors of state-action pairs is equivalent to the prototype collisions observed with traditional Kanerva Coding. In both cases, it may reduce the quality of the results Experimental Evaluation: Similarity Analysis of Membership Vectors Figure 3.8(a) shows the average membership grade of each prototype with respect to all other prototypes on a sample run. The prototypes are ordered by decreasing average membership grade. The results show that prototypes fall into three general regions. On the left, the prototypes have a higher average membership grade, corresponding to prototypes that are closer on average to other prototypes. On the right, prototypes have a lower average membership grade, corresponding to prototypes that are on average farther from other prototypes. The prototypes on the left are in a region of the state-action space where the distribution of prototypes is more dense, and prototypes on the right are in a region where the distribution of prototypes is more sparse. This variation in the distribution of the prototypes causes the receptive fields to be unevenly distributed across the state-action space. State-action pairs in the dense region of the space are near to more prototypes and

69 CHAPTER 3. FUZZY LOGIC-BASED FUNCTION APPROXIMATION 55 Figure 3.8: (a) Distribution of membership grades and (b) prototype similarity across sorted prototypes. therefore have large membership grades that are near the top of the Gaussian response function. Similarly, state-action pairs in the sparse region of the space are far from more prototypes and therefore have small membership grades that are near the tail of the Gaussian response function. A state-action pair s membership grade is less sensitive to variations when the membership grade is near 1 or 0, as illustrated in Figure 3.9(a). Two state-action pairs in the dense region are therefore more likely to have membership vectors that are similar, and the same is true for two state-action pairs in the sparse region. This similarity between the membership vectors of state-action pairs is equivalent to the prototype collisions observed

CHAPTER 3. FUZZY LOGIC-BASED FUNCTION APPROXIMATION 56 A prototype s membership function (a) Before prototype tuning. A prototype s membership function (b) After prototype tuning.

with traditional Kanerva Coding, and may have a similar negative effect on the quality of the results. Figure 3.

70 CHAPTER 3. FUZZY LOGIC-BASED FUNCTION APPROXIMATION 56 A prototype s membership function (a) Before prototype tuning. A prototype s membership function (b) After prototype tuning. Figure 3.9: Illustration of the similarity of membership vectors across sparse and dense prototype regions. with traditional Kanerva Coding, and may have a similar negative effect on the quality of the results. Figure 3.9(b) illustrates how the similarity between prototypes can vary across the stateaction space. The graph shows the average Euclidean distance between each prototype and every other prototype. Prototypes in the dense and sparse regions have a smaller average Euclidean distance, indicating that they are more similar to one another.

71 CHAPTER 3. FUZZY LOGIC-BASED FUNCTION APPROXIMATION Tuning Mechanism We can reduce the effect of similar membership vectors by adjusting the variance of the Gaussian response function used to compute membership grades. The variance is decreased in the dense region which narrows the Gaussian response function, and the variance is increased in the sparse region which broadens the Gaussian response function. This prototype tuning increases the sensitivity of state-action pairs membership vectors to variations in the state-action space in these regions, as shown in Figure 3.9(b). We use Maximum Likelihood Estimation to compute an estimate ˆσ 2 i of the variance of a prototype s membership function. Given a prototype i, we let d ij be the bit difference between prototype p i and all other prototypes p j, where j i, and d i the sample mean of d ij. The estimate of ˆσ 2 i is ˆσ 2 i = n (d ij d i ) 2 /n, j=1 where n is the number of prototypes Performance Evaluation of Tuning Mechanism We evaluate our implementation of adaptive fuzzy Kanerva Coding with prototype tuning by using it to solve pursuit instances on a grid of size 32x32. As a comparison, the adaptive fuzzy and adaptive approaches are also implemented to solve the same instances. Figure 3.10 shows the average fraction of test instances solved by adaptive fuzzy Kanerva Coding with prototype tuning. We can see that using prototype tuning increases the fraction of the test instances solved over the adaptive fuzzy algorithm and the adaptive algorithm. For example, with 2000 prototypes, using prototype tuning increases the fraction of the test

72 CHAPTER 3. FUZZY LOGIC-BASED FUNCTION APPROXIMATION 58 Average solution rate (1) Fuzzy Kanerva with Tuning (2) Fuzzy Kanerva (3) Adaptive Kanerva Epoch (1) (2) (3) Figure 3.10: Average solution rate for adaptive fuzzy Kanerva Coding with Tuning using 2000 prototypes. instances solved over the adaptive and fuzzy algorithm from 67.9% and 82.8% to 97.1%. These results demonstrate that using prototype tuning can greatly improve the efficiency of adaptive fuzzy Kanerva Coding. We further evaluate our adaptive fuzzy Kanerva Coding algorithm with prototype tuning by applying it to solve the four-room problem employed by Sutton, Precup and Singh [41] and Stone and Veloso [37]. To increase the size of the state space, we extend the grid to size 32x32, shown in Figure Pursuit takes place on a rectangular grid with 4 rooms. The agent can move to a neighboring open cell one horizontal or vertical step from its current location, or it can remain in its current cell. To go to another room, the agent must pass through a door. The agent is randomly placed in a starting cell, and the agent attempts to reach a fixed goal cell. The agent receives a reward of 1 when it reaches the goal cell, and

73 CHAPTER 3. FUZZY LOGIC-BASED FUNCTION APPROXIMATION 59 Figure 3.11: The four-room gridworld receives a reward of 0 in every other cell. Figure 3.12 compares the average fraction of test instances solved by adaptive fuzzy Kanerva Coding with prototype tuning to solve the instances of the four-room problem. The results show that using adaptive fuzzy Kanerva Coding with prototype tuning increases the fraction of the test instances solved over using adaptive and adaptive fuzzy approaches. For example, using adaptive fuzzy Kanerva Coding with prototype tuning with 2000 prototypes increases the fraction of the test instances solved over using adaptive and adaptive fuzzy approaches from 58.4% and 78.9% to 94.9%. These results again demonstrate that using adaptive fuzzy Kanerva Coding with prototype tuning can greatly improve the quality of

74 CHAPTER 3. FUZZY LOGIC-BASED FUNCTION APPROXIMATION 60 Average solution rate (1) Fuzzy Kanerva with Tuning (2) Fuzzy Kanerva (3) Adaptive Kanerva Epoch (1) (2) (3) Figure 3.12: Average solution rate for adaptive Fuzzy Kanerva Coding with Tuning in the four-room gridworld of size 32x32. the results obtained. 3.5 Summary In this chapter, we evaluated a class of hard pursuit instances of the predator-prey problem and argued that this poor performance is caused by frequent prototype collisions. We also showed that dynamic prototype allocation and adaptation can partially reduce these collisions and give better results. However the collision rate remained quite high and the performance was still poor for large-scale instances. It was therefore necessary to consider a more effective approach for eliminating the collision rate as the dimension of the state-action space increases. Our new fuzzy approach to Kanerva-based function approximation uses a fine-grained

75 CHAPTER 3. FUZZY LOGIC-BASED FUNCTION APPROXIMATION 61 fuzzy membership grade to describe a state-action pair s adjacency with respect to each prototype. This approach, coupled with adaptive prototype allocation, allows the solver to distinguish membership vectors and reduce the collision rate. Our adaptive fuzzy Kanerva approach gives better performance than the pure adaptive Kanerva algorithm. We then showed that prototype density varies widely across the state-action space, causing prototypes receptive fields to be unevenly distributed across the state-action space. State-action pairs in dense or sparse regions of the space are more likely to have similar membership vectors which limits the performance of a reinforcement learner based on Kanerva Coding. Our fuzzy framework for Kanerva-based function approximation allows us to tune the prototype receptive fields to balance the effects of prototype density variations, further increasing the fraction of test instances solved using this approach. We conclude that adaptive fuzzy Kanerva Coding with prototype tuning can significantly improve a reinforcement learner s ability to solve large-scale high dimension problems.

76 Chapter 4 Rough Sets-based Function Approximation Fuzzy Kanerva-based function approximation can significantly improve the efficiency of function approximation within reinforcement learners. As we described in Chapter 3, this approach distinguishes frequently-visited state-actions pairs by using a fine-grained fuzzy membership grade to describe a state-action pair s adjacency with respect to each prototype. In this way, the fuzzy approach completely eliminates prototype collisions. We have shown that this approach gives a function approximation architecture that outperforms other approaches. However, our experimental results show that this approach often gives poor performance when solving hard large-scale instances and shows unstable behavior when changing the 62

77 CHAPTER 4. ROUGH SETS-BASED FUNCTION APPROXIMATION 63 number of prototypes. We therefore extend our work to improve our algorithm. In this chapter, we show that choosing an optimal number of prototypes can improve the efficiency of function approximation. We propose to use the theory of rough sets to measure how closely an approximate value function is approximating the true value function and determine whether more prototypes are required. Finally, we describe a rough sets-based approach to selecting prototypes for a Kanerva-based reinforcement learner. 4.1 Experimental Evaluation: Effect of Varying Number of Prototypes The efficiency of Kanerva-based function approximation depends largely on the number of prototypes. It is clear that the efficiency of our function approximator decreases as the number of prototypes decreases. We therefore investigate the performance of a reinforcement learner with adaptive Kanerva Coding as the number of prototypes decreases. We evaluate the effect of varying the number of prototypes by applying Q-learning with adaptive Kanerva Coding to the class of predator-prey pursuit instances. The state-action pairs are represented as prototype vectors and all prototypes are selected randomly. Probabilistic prototype deletion with prototype splitting is used as feature optimization. The number of prototypes varies from 300 to The size of the grid varies from 8x8 to 32x32. Table 4.2 shows the average fraction of test instances solved by Q-learning with adaptive Kanerva Coding as the number of prototypes and the size of the grid vary. The values shown

78 CHAPTER 4. ROUGH SETS-BASED FUNCTION APPROXIMATION 64 Table 4.1: The average fraction of test instances solved by Q-Learning with adaptive Kanerva Coding. Prototypes 8x8 16x16 32x % 49.6% 23.3% % 52.3% 28.3% % 82.4% 37.0% % 90.4% 41.7% % 94.5% 62.8% % 95.7% 77.6% % 95.9% 90.5% % 96.1% 92.4% represent the final converged value of the solution rate. The results show that the average fraction of test instances solved by adaptive Kanerva Coding decreases as the number of prototypes decreases, which is similar with the behaviors under both traditional and adaptive fuzzy Kanerva Coding. Figure 4.1 shows the average fraction of hard test instances solved by Q-learning with adaptive Kanerva Coding as the number of prototypes decreases from 2500 to 300. The results show that when the number of prototypes decreases, the fraction of test instances solved decreases from 99.5% to 81.3% in the 8x8 grid, from 96.1% to 49.6% in the 16x16 grid, and from 92.4% to 23.3% in the 32x32 grid. This indicates that the efficiency of adaptive Kanerva-based function approximation does increase as the number of prototypes increases. Unfortunately, we often do not have enough memory to store a large number of prototypes. We must therefore consider how to generate an appropriate number of prototypes that can improve the efficiency of Kanerva-based function approximation.

79 CHAPTER 4. ROUGH SETS-BASED FUNCTION APPROXIMATION 65 Average Solution Rate 100% 80% 60% 40% 20% 0% (1) 8x 8 Grid (2) 16x16 Grid (3) 32x32 Grid Number of Prototypes (1) (2) (3) Figure 4.1: The fraction of hard test instances solved by Q-learning with adaptive Kanerva Coding as the number of prototypes decreases. 4.2 Rough Sets and Kanerva Coding In traditional Kanerva Coding, a set of state-action pairs is selected from the state-action space as prototypes. We assume that P is the set of prototypes, Λ is the set of all possible state-action pairs in the state-action space, and SA is the set of state-action pairs encountered by the solver. For Kanerva-based function approximation, P Λ and SA Λ. Our goal is to represent a set of observed state-action pairs SA using a set of prototypes P. That is, given an arbitrary set of state-action pairs SA, we wish to express the set using an approximate set induced by prototype set P. Assume that the function f p (sa) represents the adjacency between prototype p and stateaction pair sa. That is, if sa is adjacent to p, f p (sa) is equal to 1, otherwise it equals 0.

80 CHAPTER 4. ROUGH SETS-BASED FUNCTION APPROXIMATION 66 Table 4.2: Sample of adjacency between state-action pairs and prototypes. p 1 p 2 p 3 p 4 p 5 p 6 sa sa sa sa sa sa sa sa sa sa The set of adjacency values for a state-action pair with respect to all prototypes is referred as the state-action pair s prototype vector. On the basis of prototype set P, we define an indiscernibility relation, denoted IN D(P ): IND(P ) = {(sa 1, sa 2 ) Λ 2 p P, f p (sa 1 ) = f p (sa 2 )}. where p is a prototype and sa 1 and sa 2 are two state-action pairs, that is, sa 1 SA, sa 2 SA. If any two state-action pairs sa 1 and sa 2 in the set SA are indiscernible by the prototypes in P, there is an associated indiscernibility relation between sa 1 and sa 2. The set of state-action pairs with the same indiscernibility relation is defined as an equivalence class, and the ith such equivalence class is denoted Ei P. The set of prototypes P therefore partitions the set SA into a collection of equivalence classes, denoted {E P }. For example, assume ten state-action pairs, (sa 1, sa 2, sa 3,..., sa 10 ), are encountered by a solver, and we have six prototypes, (p 1, p 2, p 3,..., p 6 ). We attempt to express each state-action pair using prototypes. Table 4.2 shows a sample of the adjacencies between state-action pairs and prototypes. When the prototypes are considered, we can induce the

81 CHAPTER 4. ROUGH SETS-BASED FUNCTION APPROXIMATION 67 E P 1 sa 1 sa 2 sa 3 E P 3 sa 7 E P 2 sa 4 sa 6 sa 5 E P 4 E P 7 sa 8 sa 10 E P 6 sa 9 E P 5 SA Λ Figure 4.2: Illustration of equivalence classes of the sample. following equivalence classes, (E P 1, E P 2, E P 3, E P 4, E P 5, E P 6, E P 7, ) = ({sa 1 }, {sa 2 }, {sa 3 }, {sa 4, sa 5 }, {sa 9 }, {sa 6 }, {sa 7, sa 8, sa 10 }). Figure 4.2 shows an illustration of the equivalence classes of the sample. The structure of equivalence classes induced by the prototype set has a significant effect on function approximation. Kanerva Coding works best when each state-action pair has a unique prototype vector. That is, the ideal set of equivalence classes induced by the prototype set should each include no more than one state-action pair. If two or more state-action pairs are in the same equivalence class, these state-action pairs are indiscernible with respect to the prototypes, causing a prototype collision. The definition of prototype collision can be found in Section 3.2.

82 CHAPTER 4. ROUGH SETS-BASED FUNCTION APPROXIMATION 68 Given a set of prototypes, one or more prototypes may not affect the structure of the induced equivalence classes, and therefore do not help differentiate state-action pairs. These prototypes can be replaced with prototypes that are more useful. To do this, we use a reduct of the prototype set. A reduct is a subset of prototypes R P such that (1) {E R } = {E P }, that is, the equivalence classes induced by reduced prototype set R are the same as the equivalence classes induced by original prototype set P; and (2) R is minimal, that is {E (R {p}) } {E P } for any prototype p R. Thus, no prototype can be removed from reduced prototype set R without changing the equivalence classes E P. In the above example, the subset (p 2, p 4, p 5 ) is a reduct of the original prototype set P. This can be shown easily because (1) the equivalence classes induced by (p 2, p 4, p 5 ) are the same as the equivalence class structure induced by the original prototype set P; and (2) eliminating any of these prototypes alters the equivalence class structure that is induced. Replacing a set of prototypes with its reduct eliminates unnecessary prototypes. Adaptive prototype optimization can also eliminate unnecessary prototypes by deleting rarely-visited prototypes, but that approach cannot eliminate prototypes that are heavily-visited but unnecessary, such as prototype p 6 in the above example. Note that all state-action pairs are adjacent to this prototype, but deleting it does not change the structure of the equivalence classes. We evaluate the structure of equivalence classes and the reduct of prototypes in traditional Kanerva Coding and adaptive Kanerva Coding. We apply traditional Kanerva and adaptive Kanerva with 2000 prototypes to sample predator-prey instances of varying sizes.

83 CHAPTER 4. ROUGH SETS-BASED FUNCTION APPROXIMATION % 8 x 8 Grid, Traditional Kanerva 100% 16 x 16 Grid, Traditional Kanerva 100% 32 x 32 Grid, Traditional Kanerva 80% 80% 80% Percentage 60% 40% 20% 0% 93.1% 25.0% 27.6% Solution Rate Collision Rate Conflict Rate Percentage 60% 40% 20% 0% 75.4% 50.7% 47.6% Solution Rate Collision Rate Conflict Rate Percentage 60% 40% 20% 0% 71.5% 79.5% 40.6% Solution Rate Collision Rate Conflict Rate 100% 8 x 8 Grid, Adaptive Kanerva 100% 16 x 16 Grid, Adaptive Kanerva 100% 32 x 32 Grid, Adaptive Kanerva 80% 80% 80% Percentage 60% 40% 99.5% Percentage 60% 40% 95.9% Percentage 60% 40% 90.5% 20% 0% 8.5% 8.2% Solution Rate Collision Rate Conflict Rate 20% 0% 15.1% 16.5% Solution Rate Collision Rate Conflict Rate 20% 0% 29.5% 35.2% Solution Rate Collision Rate Conflict Rate Figure 4.3: The fraction of equivalence classes that contain two or more state-action pairs over all equivalence classes, the conflict rate, and its corresponding solution rate and collision rate using traditional Kanerva and adaptive Kanerva with frequency-based prototype optimization across all sizes of grids Figure 4.3 shows the fraction of equivalence classes that contain two or more stateaction pairs over all equivalence classes, the conflict rate, and its corresponding solution rate and collision rate using traditional Kanerva and adaptive Kanerva with frequency-based prototype optimization across all sizes of grids. These results show that as the fraction of equivalence classes that contain two or more state-action pairs increases, the collision rate increases and the performance of each algorithm decreases. For example, for the traditional algorithm, the collision rate increases from 25.0% to 71.5% and the average solution rate decreases from 93.1% to 40.6%, while the fraction of equivalence classes that contain two or more state-action pairs increases from 27.6% to 79.5% as the size of the grid increases.

84 CHAPTER 4. ROUGH SETS-BASED FUNCTION APPROXIMATION 70 For the adaptive algorithm, the collision rate increases from 8.5% to 29.5% and the average solution rate decreases from 99.5% to 90.5%, while the fraction of equivalence classes that contain two or more state-action pairs increases from 8.2% to 35.2% as the size of the grid increases. The results also demonstrate that the improved performance of the adaptive Kanerva algorithm over the traditional algorithm is due to the reduction of the fraction of equivalence classes that contain two or more state-action pairs. For example, the adaptive algorithm reduces the fraction of equivalence classes that contain two or more state-action pairs from 79.5% to 35.2% while the average solution rate for the adaptive algorithm increases from 40.6% to 90.5% for a grid size of 32x32. Figure 4.4 shows the fraction of prototypes remaining after performing a prototype reduct using traditional and optimized Kanerva-based function approximation with 2000 prototypes. The original and final number of prototypes is shown on each bar. The results indicate that the structure of the equivalence classes can be maintained using fewer prototypes. For example, the equivalence classes induced by 1821 prototypes for the adaptive algorithm using frequency-based prototype optimization are same as the equivalence classes induced by 2000 prototypes for a grid size of Rough Sets-based Kanerva Coding A more reliable approach to prototype optimization for function approximation is to apply rough sets theory to reformulate Kanerva-based function approximation. Instead of using

85 CHAPTER 4. ROUGH SETS-BASED FUNCTION APPROXIMATION 71 % of original prototype in reduct 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% 821 / / / / / 2000 Trad. Adap. Trad. Adap. Trad. Adap. 8X8 16X16 32X32 Grid size 1821 / 2000 Figure 4.4: The fraction of prototypes remaining after performing a prototype reduct using traditional and optimized Kanerva-based function approximation with 2000 prototypes. The original and final number of prototypes is shown on each bar. visit frequencies for frequency-based prototype optimization, we focus on the structure of equivalence classes induced by the set of prototypes, a key indicator of the efficiency of function approximation. When the fraction of equivalence classes that contain two or more state-action pairs increases, the performance of a reinforcement learner based on Kanerva coding decreases. Since a prototype reduct maintains the equivalence class structure, prototype deletion can be conducted by replacing the set of prototypes with a reduct of the original prototype set. Since prototype collisions occur only when two state-action pairs are in a same equivalence class, prototype generation should reduce the fraction of equivalence classes that contain two or more state-action pairs.

86 CHAPTER 4. ROUGH SETS-BASED FUNCTION APPROXIMATION Prototype Deletion and Generation In rough sets-based Kanerva coding, if the structure of equivalence classes remains unchanged, the efficiency of function approximation is also unchanged. Replacing a set of prototypes with its reduct clearly eliminates unnecessary prototypes. We therefore implement prototype deletion by finding a reduct R of original prototype set P. We refer to this approach as reduct-based prototype deletion. Note that a reduct of prototype set is not necessarily unique, and there may be many subsets of prototypes which preserve the equivalence-class structure. The following algorithm finds a reduct of original prototype set. We consider each prototype in P one by one. For prototype p P, if the set of equivalence classes {E P {p} } induced by P {p} is not identical to the equivalence classes {E P } induced by P, that is, {E P {p} } {E P }, then p is in a reduct R of original prototype set P, p R; otherwise, p is not in the reduct R, p / R. We then delete p from prototype set P and consider the next prototype. The final set R is a reduct of original prototype set P. We find a series of random reducts of the original prototype set, then select a reduct with the fewest elements to be the replacement of original prototype set. Reduct-based prototype optimization makes only a few passes through the prototypes and is not time-consuming. With n state-action pairs and p prototypes, the complexity is O(n p 2 ). Once a prototype is deleted, the θ-value of this prototype is accumulated to the nearest prototypes. In rough sets-based Kanerva coding, if the number of equivalence classes that contain only one state-action pair increases, prototype collisions are less likely and the efficiency of

87 CHAPTER 4. ROUGH SETS-BASED FUNCTION APPROXIMATION 73 function approximation increases. An equivalence class that contains two or more stateaction pairs is likely to be split up by adding a new prototype equal to one of those stateaction pairs. We therefore implement prototype generation by adding new prototypes that split equivalence classes with two or more state-action pairs. We refer to this approach as equivalence class-based prototype generation. For an arbitrary equivalence class that contains n > 1 state-ation pairs, we randomly select log(n) state-ation pairs to be new prototypes. Note that this value is the smallest number of prototypes needed to distinguish all state-action pairs in an equivalence class that contains n elements. This algorithm does not guarantee that each equivalence class will be split into new classes that contain exactly one state-action pair. For example, this approach cannot split an equivalence class with two neighboring state-action pairs. In this case, we add such a new prototype that is a neighbor of one state-action pair, but not a neighbor of the other Rough Sets-based Kanerva Coding Algorithm Algorithm 2 describes our algorithm for implementing Q-learning with adaptive Kanerva coding using rough sets-based prototype optimization. The algorithm begins by initializing parameters, and repeatedly executes Q-learning with adaptive Kanerva Coding. Prototypes are adaptively updated periodically. In each update period, the encountered state-action pairs are recorded. To update prototypes, the algorithm first determines the structure of the equivalence classes of the set of encountered state-action pairs with respect to original

88 CHAPTER 4. ROUGH SETS-BASED FUNCTION APPROXIMATION 74 Algorithm 2 Pseudo code of Q-learning with rough sets-based Kanerva Coding Main() choose a set of prototypes p and initial their θ value; repeat Generate initial state-action pair s from initial state ς and action a Q-with-Kanerva(s, a, p, θ) Update-prototypes( p, θ) until all episodes are traversed Q-with-Kanerva(s, a, p, θ) repeat Take action a, observe reward r, get next state ς Q(sa) = θ; for all actions a* under new state ς do Generate the state-action pair sa from state ς and action a* Q(sa ) = θ; end for δ = r + γ maxq(s ) Q(sa) θ = α δ θ = θ + θ if random probability ε then for all actions a* under current state s do ˆQ(sa) = θ; a = argmax a Q(sa) end for else a = random action end if until s is terminal Update-prototypes( p, θ) Prototype-reduct-based-Deletion( p, θ); Equivalent-class-based-Generation( p, θ).

89 CHAPTER 4. ROUGH SETS-BASED FUNCTION APPROXIMATION 75 Prototype-reduct-based-Deletion( p, θ) E( p) = equivalence classes induced by p p reduct = p for i = 0 to 10 do p tmp = p repeat ˆ p = p tmp p E(ˆ p) = equivalence classes induced by ˆ p if E(ˆ p) = E( p) then p tmp = ˆ p end if until all prototypes p p tmp are traversed. if p reduct > p tmp then p reduct = p tmp end if end for Equivalent-class-based-Generation( p, θ) repeat n = size(e( p)) if n > 1 then if (n = 2) and (two state-action pairs sa 1 and sa 2 are neighbor) then p = p {p p = a neighbor of sa 1, but not a neighbor of sa 2 } else repeat randomly select a state-action pair sa p = p {sa} until log(n) new prototypes are generated. end if end if until all equivalence classes E( p) E( p) are traversed.

90 CHAPTER 4. ROUGH SETS-BASED FUNCTION APPROXIMATION 76 prototypes. Unnecessary prototypes are then deleted by replacing the original prototype set with a reduct with the fewest elements among ten randomly-generated reducts. In order to split large equivalence classes, new prototypes are randomly selected from these equivalence classes. For the equivalence classes with two neighboring state-action pairs, a new prototype is a neighbor of one state-action pair, but not a neighbor of the other. The optimized prototype set is constructed by adding newly generated prototypes to the reduct of original prototype set Performance Evaluation of Rough Sets-based Kanerva Coding We evaluate the performance of rough sets-based Kanerva coding by using it to solve pursuit instances on grids of varying sizes. As a comparison, traditional Kanerva coding and adaptive Kanerva coding with different number of prototypes are also applied to the same instances. Traditional Kanerva coding follows Sutton [38]. Kanerva coding with adaptive prototype optimization is implemented using prototype deletion and prototype splitting. A detailed description of prototype deletion and splitting can be found in Section 3.3. When rough sets-based Kanerva coding is implemented during a learning process, we also observe the change in the number of prototypes and the fraction of equivalence classes that contain only one state-action pair. Figure 4.5 shows the average fraction of test instances solved when traditional Kanerva, adaptive Kanerva, and rough sets-based Kanerva are applied to our instances with grids of

91 CHAPTER 4. ROUGH SETS-BASED FUNCTION APPROXIMATION Average solution rate Rough Set-based Optimization, 568 prototypes 0.20 Frequence-based Optimization with 2000 Prototypes 0.10 Frequence-based Optimization with 568 Prototypes No Optimization with 2000 Prototypes Epoch (a) Average solution rate Rough Set-based Optimization, 955 Prototypes Frequence-based Optimization with 2000 Prototypes 0.10 Frequence-based Optimization with 955 Prototypes 0.00 No Optimization with 2000 Prototypes Epoch (b) Average solution rate Rough Set-based Optimization, 1968 Prototypes Frequence-based Optimization with 2000 Prototypes 0.10 Frequence-based Optimization with 1968 Prototypes No Optimization with 2000 Prototypes Epoch (c) Figure 4.5: Average solution rate for traditional Kanerva, adaptive Kanerva and rough setsbased Kanerva (a), in 8x8 grid; (b), in 16x16 grid; (c), in 32x32 grid.

92 CHAPTER 4. ROUGH SETS-BASED FUNCTION APPROXIMATION 78 Table 4.3: Percentage improved performance of rough sets-based Kanerva over adaptive Kanerva. Size 4x4 8x8 16x16 32x32 64x64 Gap 13.2% 11.8% 24.6% 11.7% 16.5% size 8, 16 and 32. The results show that the rough sets-based algorithm increases the fraction of test instances solved over adaptive Kanerva algorithm when using the same number of prototypes. For example, after 2000 epochs, using the rough sets-based algorithm increases the fraction of test instances solved over the adaptive algorithm from 87.6% to 99.4% in the 8x8 grid, from 73.4% to 98.0% in the 16x16 grid and from 81.1% to 92.8% in the 32x32 grid, respectively. The results when using grids of varying sizes indicate that rough setsbased Kanerva coding uses fewer prototypes and achieves higher performance by adaptively changing the number and allocation of prototypes. Table 4.3 shows the percentage improved performance using rough sets-based Kanerva over adaptive Kanerva across varying grid sizes. The results show that the improved performance of the rough sets-based approach is consistently more than 10% better than the adaptive approach with different grids of size. It indicates that our rough sets-based approach can reliably improve a Kanerva-based reinforcement learner s ability. Figure 4.6 shows the effect of our rough sets-based Kanerva coding on the number of prototypes and the corresponding change in the fraction of equivalence classes that contain only one state-action pair in the grid of size 8, 16 and 32.. The results show that the rough sets-based algorithm reduces the number of prototypes and increases the fraction of equivalence classes with only one state-action pair. For example, after 2000 epochs, the rough

93 CHAPTER 4. ROUGH SETS-BASED FUNCTION APPROXIMATION % Number of prototypes Prototypes 568 Prototypes 90% 80% 70% 60% 50% 40% 30% 20% 10% 0 0% Epoch (a) % of equivalence classes that contain only one state-action pair % Number of prototypes Prototypes 955 Prototypes 90% 80% 70% 60% 50% 40% 30% 20% 10% 0 0% Epoch (b) % of equivalence classes that contain only one state-action pair % Number of prototypes Prototypes 1968 Prototypes 90% 80% 70% 60% 50% 40% 30% 20% 10% 0 0% Epoch (c) Figure 4.6: Effect of rough sets-based Kanerva on the number of prototypes and the fraction of equivalence classes (a), in 8x8 grid; (b), in 16x16 grid; (c), in 32x32 grid. % of equivalence classes that contain only one state-action pair

94 CHAPTER 4. ROUGH SETS-BASED FUNCTION APPROXIMATION 80 sets-based algorithm reduces the number of prototypes to 568, 955 and 1968 prototypes, and increases the fraction of equivalence classes with one state-action pair to 99.5%, 99.8% and 94.9% in the grid of size 8, 16 and 32, respectively. These results also demonstrate that rough sets-based Kanerva can adaptively explore the optimal number of prototypes and dynamically allocate prototypes for optimal structure of equivalence classes in a particular application. 4.4 Effect of Varying the Number of Initial Prototypes The accuracy of Kanerva-based function approximation is sensitive to the number of prototypes. In general, more prototypes are needed to approximate the state-action space for more complex applications. On the other hand, the computational complexity of Kanerva Coding also depends entirely on the number of prototypes, and larger sets of prototypes can more accurately approximate more complex spaces. Neither traditional Kanerva nor adaptive Kanerva can adaptively select the number of prototypes. Therefore, the number of prototypes has a significant effect on the efficiency of traditional and adaptive Kanerva coding. If the number of prototypes is too large relative to the number of state-action pairs, the implementation of Kanerva coding is unnecessarily time-consuming. If the number of prototypes is too small, even if the prototypes are well chosen, the approximate values will not be similar to true values and the reinforcement learner will give poor results. Selecting the appropriate number of prototypes is difficult for traditional and adaptive Kanerva coding, and in most known applications of these algorithms the number of prototypes is

95 CHAPTER 4. ROUGH SETS-BASED FUNCTION APPROXIMATION Prototypes Number of prototypes Prototypes 1000 Prototypes 500 Prototypes 250 Prototypes 975 Prototypes 922 Prototypes 0 0 Prototype Epoch Figure 4.7: Variation in the number of prototypes with different numbers of initial prototypes with rough sets-based Kanerva in a 16x16 grid. selected manually. However, for a particular application, the set of observed state-action pairs is limited to a fixed subset of all possible state-action pairs. The number of prototypes needed to distinguished this set of state-action pairs is also fixed. We are interested in investigating the effect of different number of initial prototypes using rough sets-based Kanerva coding. We use our rough sets-based algorithm with 0, 250, 500, 1000, 1500 or 2000 initial prototypes to solve pursuit instances in the 16 X 16 grid. Figure 4.7 shows the effect of our algorithm on the number of prototypes. The results show that the number of prototypes tends to converge to a fixed number in the range from 922 to 975 after 2000 epochs. The results demonstrate that our rough sets-based Kanerva coding has the ability to adaptively determine an effective number of prototypes during a learning process.

96 CHAPTER 4. ROUGH SETS-BASED FUNCTION APPROXIMATION Summary Kanerva Coding can be used to improve the performance of function approximation within reinforcement learners. This approach often gives poor performance when applied to largescale systems. We evaluated a collection of pursuit instances of the predator-prey problem and argued that the poor performance is caused by inappropriate selection of the prototypes, including the number and allocation of these prototypes. We also showed that adaptive Kanerva coding can give better results by dynamically allocating the prototypes. However the number of prototypes remains hard to select and the performance was still poor because of inappropriate number of prototypes. It was therefore necessary to consider a more effective approach for adaptively selecting the number of prototypes. Our new rough sets-based Kanerva-based function approximation uses rough sets theory to reformulate prototype set and its implementation in Kanerva Coding. This approach uses the structure of equivalent classes to explain how prototype collisions occur. Our algorithm eliminates unnecessary prototypes by replacing the original prototype set with its reduct, and reduces prototype collisions by splitting equivalence classes with two or more state-action pairs. Our results indicate that rough sets-based Kanerva coding can adaptively select an effective number of prototypes and greatly improve a Kanerva-based reinforcement learner s ability to solve large-scale problems.

97 Chapter 5 Real-world Application: Cognitive Radio Network 5.1 Introduction Radio frequency spectrum is a scarce resource. In many countries, the governmental agencies, e.g. the Federal Communications Commission (FCC) in the United States, assign spectrum bands to specific operators or devices to prevent them from being used by unlicensed users. However, much of these assigned bands depend strongly on time and place, and often are rarely used. Recent studies have demonstrated that much of the radio frequency spectrum is inefficiently utilized [35, 5]. To address this issue, the FCC has recently recently begun to allow unlicensed users to utilize licensed bands whenever it would not cause any interference 83

98 CHAPTER 5. REAL-WORLD APPLICATION: COGNITIVE RADIO NETWORK 84 [1]. Therefore, dynamic spectrum management techniques are needed to improve the efficiency of spectrum utilization [5, 18, 29]. The development of these techniques motivates a novel research area of cognitive radio (CR) networks. The key idea of CR networks is that the unlicensed devices (also called cognitive radio users) detect vacant spectrum and utilize it without harmful interference with licensed devices (also known as primary users). This approach requires that CR networks have the ability to sense spectrum holes and capture the best transmission parameters to meet the quality-of-service (QoS) requirements. However, in a real-world ad-hoc networks, dynamic network topology and varying spectrum availability on different time slots and locations pose a critical challenge for CR networks. Recent studies have shown that applying theoretical research on multi-agent reinforcement learning to spectrum management in CR network is a feasible approach for meeting the challenge [50]. Since a CR network must have sufficient computational intelligence to choose its appropriate transmission parameters based on external network environment, it must be capable of learning from its historical experience, and adapting its behavior to the current context. This approach works well to solve small topology networks. However, it often gives poor performance when applied to large-scale networks. These networks typically have a very large number of unlicensed and licensed users, and a wide range of possible transmission parameters. The experimental results have shown that the performance of CR networks decreases sharply as the size of network increases [50]. There is therefore a need for algorithms to apply function approximation techniques to scale up reinforcement learning

99 CHAPTER 5. REAL-WORLD APPLICATION: COGNITIVE RADIO NETWORK 85 Unlicensed Band CR Users Licensed Band 1 PU Licensed Band 2 Primary Networks Cognitive Radio Ad Hoc Networks Figure 5.1: The CR ad hoc architecture for large-scale cognitive radio networks. Our work focuses on cognitive radio ad hoc networks with decentralized control [4]. The architecture of a CR ad hoc network, shown in Figure 5.1 [50], can be partitioned into two groups of users: the primary network and the CR network components. The primary network is composed of primary users (PUs) that have a license to operate in a certain spectrum band. The CR network is composed of cognitive radio users (CR users) that share wireless channels with licensed users that already have an assigned spectrum. Under this architecture, the CR users need to continuously monitor spectrum for the presence of the primary users and reconfigure the radio front-end according to the demands and requirements of the higher layers. This capability can be realized, as shown in Figure 5.2 [50], by the cognitive cycle

100 CHAPTER 5. REAL-WORLD APPLICATION: COGNITIVE RADIO NETWORK 86 Transmitted Signal Radio Environment RF Stimuli Spectrum Mobility PU Detection Spectrum Characterization Spectrum Sharing Channel Capacity Decision Request Spectrum Decision Spectrum Hole Spectrum Sensing Figure 5.2: The cognitive radio cycle for the CR ad hoc architecture composed of the following spectrum functions: (1) determining the portions of the spectrum currently available (Spectrum sensing), (2) selecting the best available channel (Spectrum decision), (3) coordinating access to this channel with other users (Spectrum sharing), and (4) effectively vacating the channel when a licensed user is detected (Spectrum mobility). In this chapter, we describe a reinforcement learning-based solution that allows each sender-receiver pair to locally adjust its choice of spectrum and transmit power, subject to connectivity and interference constraints. We model this as a multi-agent learning system, where each action, i.e. choice of power level and spectrum, earns a reward based on the utility that is maximized. We first evaluate the reinforcement learning-based approach, and show that it works well for small topology networks and performs poorly for large topology networks. We argue that large-scale cognitive radio wireless networks are typically difficult to

101 CHAPTER 5. REAL-WORLD APPLICATION: COGNITIVE RADIO NETWORK 87 solve using reinforcement learning because of huge state-action space. Thus, using a smaller approximation value table instead of the original state-action value table is necessary for a real cognitive radio wireless network. We then apply function approximation techniques to reduce the size of state-action value table. We conclude that our function approximation technique can scale up the ability of the reinforcement learning based cognitive radio approach. 5.2 Reinforcement Learning-Based Cognitive Radio Problem Formulation In this chapter, we assume that our network consists of a collection of PUs and CR users, each of which is paired with another user to form transmitter-receiver pairs. exist in a spatially overlapped region with the nodes of the wireless network. The PUs The CR users undertake decisions on choosing the spectrum and transmission power independently of the others in the neighborhood. We also assume perfect sensing in which the CR user correctly infers the presence of the PU if the former lies within the PU s transmission range. Moreover, the CR users can also detect, in the case of collision, if the colliding node is a PU transmitter, or another CR user. We model this by keeping the PU transmit power an order of magnitude higher than the CR user s power, which is realistic in contexts such as the use of TV transmitters. If the receiver, while performing energy detection, observes the received signal energy at a level several multiples greater than the CR user-only case, it identifies a

102 CHAPTER 5. REAL-WORLD APPLICATION: COGNITIVE RADIO NETWORK 88 collision with the PU, and relays this condition back to the sender via an out-of-band control channel. As the PU receiver location is unknown (and hence, if there was a collision at the PU receiver location due to the concurrent sensor transmission) all such cases are flagged as PU interference. Thus, our approach is conservative, and it overestimates the effect of interference to the PU to safeguard its performance. A choice of spectrum by CR user i is essentially the choice of the frequency represented by F i F, the set of available frequencies. The CR users continuously monitor the spectrum that they choose in each time slot. The channels chosen are discrete, and a jump from any channel to another is possible in consecutive time slots. The transmit power chosen by the CR user i is given by P i tx. The transmission range and interference range are represented by R t and R i, respectively. Our simulator uses the freespace path loss equation to calculate the attenuated power incident at the receiver, denoted P j rx. Thus, P j rx = α P i tx { D i } β, where the path loss exponent β = 2 and the the speed of light c = m/s. The power values chosen are discrete, and a jump from any given value to another is possible in consecutive time slots Application to cognitive radio In cognitive radio network, if we consider each cognitive user to be an agent and the wireless network to be the external environment, cognitive radio can be formulated as a system in

Cogni1ve Users Spectrum Sensing Spectrum Mobility Figure 5.

which communicating agents sense their environment, learn, and adjust their transmission parameters to maximize

3 gives an overview of how we apply reinforcement learning to cognitive radio.

103 CHAPTER 5. REAL-WORLD APPLICATION: COGNITIVE RADIO NETWORK 89 Radio Environment Spectrum Decision Spectrum Sharing Mul1ple Cogni1ve Users Spectrum Sensing Spectrum Mobility Figure 5.3: Multi-agent reinforcement learning based cognitive radio. which communicating agents sense their environment, learn, and adjust their transmission parameters to maximize their communication performance. This formulation fits well within the context of reinforcement learning. Figure 5.3 gives an overview of how we apply reinforcement learning to cognitive radio. Each cognitive user acts as an agent using reinforcement learning. These agents do spectrum sensing and perceive their current states, i.e., spectra and transmission powers. They then make spectrum decisions and use spectrum mobility to choose actions, i.e. switch channels or change their power value. Finally, the agents use spectrum sharing to transmit signals.

Rough Sets-based Prototype Optimization in Kanerva-based Function Approximation

215 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology Rough Sets-based Prototype Optimization in Kanerva-based Function Approximation Cheng Wu School of Urban Rail