ADAPTIVE TILE CODING METHODS FOR THE GENERALIZATION OF VALUE FUNCTIONS IN THE RL STATE SPACE A THESIS SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL

Size: px
Start display at page:

Download "ADAPTIVE TILE CODING METHODS FOR THE GENERALIZATION OF VALUE FUNCTIONS IN THE RL STATE SPACE A THESIS SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL"

Transcription

1 ADAPTIVE TILE CODING METHODS FOR THE GENERALIZATION OF VALUE FUNCTIONS IN THE RL STATE SPACE A THESIS SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL OF THE UNIVERSITY OF MINNESOTA BY BHARAT SIGINAM IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE RICHARD MACLIN MARCH 2012

2 Abstract The performance of a Reinforcement Learning (RL) agent depends on the accuracy of the approximated state value functions. Tile coding (Sutton and Barto, 1998), a function approximator method, generalizes the approximated state value functions for the entire state space using a set of tile features (discrete features based on continuous features). The shape and size of the tiles in this method are decided manually. In this work, we propose various adaptive tile coding methods to automate the decision of the shape and size of the tiles. The proposed adaptive tile coding methods use a random tile generator, the number of states represented by features, the frequencies of observed features, and the difference between the deviations of predicted value functions from Monte Carlo estimates to select the split points. The RL agents developed using these methods are evaluated in three different RL environments: the puddle world problem, the mountain car problem and the cart pole balance problem. The results obtained are used to evaluate the efficiencies of the proposed adaptive tile coding methods. i

3 Acknowledgements I am indebted to my advisor Dr. Richard Maclin for giving me an opportunity to work under him and for patiently guiding me during the course of my thesis-work. I would like to thank my committee members Dr. Hudson Turner and Dr. Marshall Hampton for considering my request to evaluate the thesis. I would also like to thank all the faculty and staff at Computer Science Department and Dr. Joseph Gallian for helping me during my course work and TA work. ii

4 Contents Abstract i Acknowledgements ii List of Figures vi List of Tables viii 1 Introduction Thesis Statement Overview Background Reinforcement Learning Generalization Methods Basic Idea of Generalization Methods Function Approximation Gradient-Descent Methods Coarse Coding Radial Basis Functions Tile Coding Single Tiling Multiple Tilings Review of Problem Proposed Solutions Tile Coding a State Space Adaptive Tile Coding Methods Adaptive Tile Coding Implementation Mechanism Randomly Splitting Tiles Splitting Based on the size of Features Splitting Based on the Observed Frequency of Input Value Smart Tile Coding iii

5 3.2.6 Hand Coding Experimental Setup RL-Glue The Experiment Module The Environment Module Agent Environments The Puddle World Problem Actions Goal State Rewards Objective The Mountain Car Problem Actions Goal State Rewards Objective The Cart Pole Balance Problem Actions Goal State Rewards Objective Experiments Methodology Baseline Results The Puddle World Problem Random Tile Coding Feature-based Tile Coding Value-based Tile Coding iv

6 Smart Tile Coding Claim Validation The Mountain Car Problem Random Tile Coding Feature-based Tile Coding Value-based Tile Coding Smart Tile Coding Claim Validation The Cart Pole Balance Problem Random Tile Coding Feature-based Tile Coding Value-based Tile Coding Smart Tile Coding Claim Validation Summary Related and Future Work Summary and Conclusions Bibliography v

7 List of Figures 1.1 Tile representation in single tiling and multiple tilings methods Reinforcement Learning framework The Grid-world problem represented as RL problem The state values in the Grid-world problem after following a policy with equal probability for all state-action pairs The state-action values in the Grid-world problem after following a policy with equal probability for all state-action pairs An outline of two-step backup in n-step method The estimated Q-values by the Sarsa algorithm after the first episode for the Grid-world problem discussed in section The estimated Q-values by the Sarsa algorithm, after first episode for the Grid-world problem discussed in section The graphical representation of the Q-values from Figures 2.6 and Coarse coding using rectangles Partition of two different state spaces S1 and S2 using tile coding The features of the simple Grid-world problem represented using single tiling The features of a simple Grid-world problem represented using multiple tilings A generalized two-dimensional state space using adaptive tile coding Example for random tile coding Example for feature size based tile coding Example for value-based tile coding Example for smart tile coding RL-Glue framework The standard puddle world problem with two puddles The standard view of mountain car environment The standard view of cart pole balance environment vi

8 5.1 The graphical representation of the performance of random tile coding agent vs multiple tile coding agents in the puddle world environment The graphical representation of performance of a feature-based tile coding agent vs multiple tile coding agent in the puddle world environment The graphical representation of the performance of value-based tile coding agent vs multiple tilings agents in the puddle world environment The graphical representation of performance of smart tile coding agent vs multiple tile coding agents in the puddle world environment The graphical representation of performance of various adaptive tile coding methods, hand-coding method and multiple tilings method in the puddle world environment The graphical representation of performance of a random tile coding agent vs multiple tile coding agents in the mountain car environment The graphical representation of performance of a feature-based tile coding agent vs multiple tile coding agent in the mountain car environment The graphical representation of the performance of value-based tile coding agent vs multiple tilings agents in the mountain car environment The graphical representation of the performance of smart tile coding agent vs multiple tilings agents in the mountain car environment The graphical representation of performance of various adaptive tile coding methods, hand-coding method and multiple tilings method in the mountain car environment The graphical representation of performance of a random tile coding agent vs multiple tile coding agents in the cart pole balance environment The graphical representation of performance of a feature-based tile coding agent vs multiple tile coding agent in the mountain car environment The graphical representation of the performance of value-based tile coding agent vs multiple tilings agents in the mountain car environment The graphical representation of the performance of smart tile coding agent vs multiple tilings agents in the mountain car environment The graphical representation of performance of various adaptive tile coding methods, hand-coding method and multiple tilings method in the mountain car environment vii

9 List of Tables 2.1 The steps followed in basic algorithm The steps followed in the Sarsa algorithm General steps followed in an adaptive tile coding method General algorithm used to implement an adaptive tile coding method A modified form of the Sarsa algorithm to generalize a state space using adaptive tile coding Algorithm to implement the ε-greedy policy Algorithm to implement in random tile coding Algorithm for implementing in feature size based tile coding Algorithm to implement in value-based counting Additional steps required in the Sarsa( ) algorithm to approximate the 3.8 estimated Q-values deviations during a testing phase Algorithm to implement in smart tile coding General sequence of steps followed in the experiment module The sequence of steps followed by RL-Glue during a call The common RL-Glue functions used in the environment module The common RL-Glue functions used in the agent program The possible actions and the corresponding integers used in the puddle world environment The possible actions and the corresponding numbers in the mountain car environment Possible actions and corresponding integers used in a cart pole balance environment The general methodology used to perform the experiments on testbed The average rewards received after every 5 splits in various adaptive tile coding methods in the puddle world environment The average rewards received after every 5 splits for various adaptive tile coding methods in the mountain car environment The average rewards received after every 5 splits in various adaptive tile coding methods in the cart pole balance environment viii

10 5.5 Observed attributes values for random tile coding Observed attributes values for feature-based tile coding Observed attributes values for value-based tile coding Observed attributes values for smart tile coding ix

11 CHAPTER 1 INTRODUCTION Artificial Intelligence (AI) is a branch of computer science that studies and develops intelligent agents to learn to act in new environments. Machine Learning (ML) is a branch of AI that designs algorithms to generalize the observed data of an environment for automating the estimation of patterns and to use these patterns to maximize the performance of an agent in the environment. Multiple approaches like Supervised Learning, Unsupervised Learning, Semi-supervised Learning, Reinforcement Learning, etc., are available to implement ML algorithms. In this work, I used the algorithms developed using Reinforcement Learning methods. Learning by interaction is one of the most effective and natural methods of learning. It is also one of the important inherent attributes of humans and animals. Making a move in chess, adjusting the thermostat and increasing/decreasing the speed of car are all examples of interaction-based activities. These interactions can be used by an agent to learn about the environment and improve its performance in the environment. For example, a player while making a move in chess tries to use all the experience he gained from playing chess previously. This interaction-based learning can be applied to machines by asking an environment to reward an agent for key actions performed by the agent (often just at goals). This approach is implemented as Reinforcement Learning (RL), a goal-directed and interactive learning mechanism wherein an agent interacts with an environment to learn what action to take given a state of the environment (Sutton and Barto, 1998). The agent perceives the state of the environment and takes one of the 1

12 available actions which results in a new state and a possible reward. The objective of the agent is to map states to actions in a way that maximizes the discounted future reward. A value function is assigned to each state to store the estimated discounted cumulative reward of the state. Given a state and a set of possible actions from, is mapped to an action which leads to a state with maximum future reward (as discussed in Section 2). Estimating the value function for RL environments with a discrete and small set of states is trivial; a table with each cell representing a state of the environment can be used to store the value functions of all the states. In the case of complex RL environments with continuous states, value functions cannot be learned for every state as only a few states are visited during training and there are too many states to assign a separate cell for each state. In these environments, a function approximation method can be used to estimate the value functions by splitting the state space into groups (features) of states (the number of groups is considerably less than the number of states) and using a common value function for all of the states in the same group. To update the value function of an observed state, the update rule is applied to all the features that contain. As a result, the other states which are present in these features are also affected by the update. The range of this effect is proportional to the number of common features between a state and. Different kinds of function approximation methods have been proposed to generalize a state space; one of them is tile coding (Sutton and Barto, 1998), a function approximation method that uses grid-like structures to represent the features of a state space. The grid is chosen such that these features, when grouped, should represent the entire state space effectively. Each group of the features is known as a tiling and each feature in a tiling is known as a tile (Sutton and Barto, 1998). In single tiling only a single group (tiling) of features are used to represent the entire state space, whereas in multiple tilings more than single group (tilings) of features are used to represent the entire state space. In Figure 1.1, a two dimensional state space is represented using single tiling method and multiple tilings method, and the tiles which contain state (4,4) are highlighted in red color. Sutton and Barto, in their work proved that multiple tilings method is an efficient function approximator. Though the tiling method proposed by 2

13 Sutton and Barto has worked well for small number of dimensions where tiling is applied to each dimension individually, it was not fully applied. In this work, we implemented different adaptive tile coding methods for single tiling in a RL environment X Y (a) X Y (b) Figure 1.1: Tile representation in single tiling and multiple tilings methods for a two dimensional state space. The range of X is -1 to 11 and the range of Y is - 5 to 5. The number of a tile is represented inside the corresponding cell and the range of tile is represented below the corresponding cell. The tiles containing the state (4,4) is highlighted in red color. (a) Representation of the states along X and Y dimensions using single tiling. (b) Representation of the states along X and Y dimension using two tilings. 3

14 1.1 Thesis Statement In this thesis we propose to evaluate different tiling methods for generalizing the value functions of a state space. In each of these tiling methods, we start with a small number of tiles and gradually divide the tiles based on how well the learner is performing in the environment. Each tiling method follows a unique policy for selecting the target points to split. The polices used are based on random generator, the number of states represented by features, the frequencies of observed features, and the difference between the deviations of predicted value functions from Monte Carlo estimates for observed states in each tile. We used these methods to train and test RL agents in different RL environments using the RL-Glue experiment tool (discussed in Chapter 4). The performance of each RL agent is evaluated and then compared with the RL agents developed using other adaptive tile coding methods and multiple tilings. 1.2 Overview The details related to this thesis are presented in the following order. Chapter 2 presents the background for this thesis with a detailed description of Reinforcement Learning, tile coding and other generalization methods. In Chapter 3, a detailed description of the algorithms used for implementing all the proposed adaptive tile coding methods is provided. Chapter 4 presents the methodology and the environments used to conduct the experiments. In Chapter 5, the results obtained from the experiments are provided, and the efficiencies of the proposed tile coding methods are evaluated using these results. Chapter 6 discusses the related work and the possible future work in adaptive tile coding methods. In Chapter 7, summary of the thesis and the conclusions are provided. 4

15 CHAPTER 2 Background This chapter provides background material for this thesis. The first section gives an introduction to Reinforcement Learning, an interaction-based machine learning technique. This technique guides an agent to learn what actions to take at each perceived state in an environment to maximize the cumulative reward. The second section discusses different generalization methods that are used in complex RL environments to generalize the value functions. The third section describes in detail a generalization method called Tiling. 2.1 Reinforcement Learning Reinforcement Learning is a goal-directed and interactive learning mechanism wherein an agent interacts with an environment to learn what action to take given a state of the environment (Sutton and Barto, 1998). The agent perceives the state of the environment and takes one of the available actions which results in a new state and a possible reward. The objective of the agent is to map states to actions in a way that maximizes the discounted future reward. The process of mapping states to actions is called learning. The learning mechanism for an agent in RL is different from other computational approaches such as supervised learning. A supervised learning agent uses a set of examples provided by the supervisor to learn an optimal policy, whereas an RL agent uses the rewards obtained to learn an optimal policy which is based on future rewards rather than immediate rewards (Sutton and Barto, 1998). 5

16 RL uses a framework of states, actions and rewards to depict the interactions between an agent and an environment. Figure 2.1 represents the RL framework which contains an agent and an environment. The learner, which gets to select an action, is called the agent. Everything in the system except the learner is called the environment (Sutton and Barto, 1998). The role of an agent is to observe the current state at the current time-step and react accordingly by selecting an action. The role of an environment is to update the state of the system using the selected action and to return a reward, which may be 0, to the agent for selecting the action. In general, this process continues until the agent reaches a goal state. Figure 2.1: Reinforcement Learning framework: the agent perceives the current state ( ) of the environment and performs an action ( ). The selected action is returned to the environment, which generates a new state using and returns a reward ( ) to the agent. In RL it is the responsibility of the agent to determine how to reach the goal state in an efficient way. The future states that an agent encounters depend on the actions previously taken by the agent. To help the agent in choosing the correct actions the environment gives feedback for each selected action in the form of a reward. If a chosen action can help in reaching the goal state in an efficient way, then the environment will return high rewards to the agent. Otherwise, it will punish the agent with low or negative rewards. To 6

17 understand more about how rewards are associated with actions, a sample Grid-world problem is presented in Figure 2.2. The entire Grid-world environment is represented using a 3x3 table where each cell represents a state within the environment. The goal of an agent is to learn how to reach the goal state from the start state in an efficient way. The start state can be any state other than a mine state and the goal state. In a given state, each possible action is represented using an arrow pointing towards the resulting state for that action. The reward for each action is given next to the arrow. The value of a reward for an action depends on the resulting state. The actions which lead to the goal state receive 10 as the reward whereas the actions which lead to a mine state receive -10 as the reward; the rest of the actions receive 0 as the reward. Figure 2.2: The Grid-world problem represented as RL problem: Each cell in the table represents a state in the Grid-world. Arrows represent the possible actions and the number next to them represents the rewards. Black cells represent mines. 7

18 A policy ( ) is used to define the behavior of an agent (Sutton and Barto, 1998). It is a function that maps each state and action to the probability ( ) of taking when in, where contains all the states in an environment and contains all possible actions in. In general, multiple policies are possible in an environment and it is the responsibility of an agent to find a policy which is optimal in the sense of maximizing future cumulative reward. An agent, while choosing an action, has to consider not only the immediate rewards but also the future rewards to maximize the cumulative return (Sutton and Barto, 1998): where t is the current time step and T is the final time step (when the agent reaches the goal state or maximum number of steps allowed). The above function works for episodic tasks but not for continuing tasks, as the value of is infinite in the latter case. A discounted cumulative return (Sutton and Barto, 1998) defined as: is used to generalize cumulative return for all tasks by introducing a constant called discount factor ( ). The value of decides whether the future rewards are to be considered or not. If the value of is set to 0, future rewards are ignored while choosing an action. If the value of is set to 1, future rewards are as important as immediate reward while choosing an action. Generally in RL we chose a value of and to incorporate future rewards to push the agent to achieve those rewards sooner. The value of can not be set to 1 in continuous tasks as there is no final step in continuous tasks. 8

19 The value function (Sutton and Barto, 1998) of a state is defined as the expected cumulative reward from starting at the state and following a particular policy in choosing the future actions. The value function for a state if an agent follows a policy is: A policy is considered as better than or equal to policy if the value function of is better than or equal to for all states: Figure 2.3: The state values in the Grid-world problem after following a policy with equal probability for all state-action pairs: V is the state value for fixed trails of a policy where each action in a state has same probability. V* is the state value for an optimal policy. 9

20 A policy whose value function is better than or equal to the value functions of all other policies at all possible states is known as an optimal policy ( ) (Sutton and Barto, 1998): A policy with the same probability for all actions was used for the above-described Gridworld problem and the results obtained are shown in Figure 2.3. The value of each state is obtained by taking an average of the state values obtained from following a fixed set of paths. The value of V* represent the optimal state values in each state. An action-value function (Sutton and Barto, 1998) for an action policy is: in a state following a The values of the action-value function are also known as Q-values. An optimal action-value function (Sutton and Barto, 1998) ( ) is defined as: The RL tasks are mainly divided into two types: episodic tasks and continuing tasks (Sutton and Barto, 1998). If the agent-environment interactions in a RL task naturally consist of sub-sequences, like in maze and mountain car, the RL task is considered episodic. If the agent-environment interactions in a RL task do not consist of subsequences, like in real time robot application, the RL task is considered continuing. According to the above given equations, to update the value functions and for a policy, an agent has to wait until the end of an episode and has to keep a record of all the rewards obtained and all the states visited. In the case of episodic tasks, updating value functions and using the above equations is complex but possible, as the length of an episode is finite. But in the case of continuing tasks, updating value functions and using the above equations is not feasible as an episode never ends in 10

21 the continuing tasks. An alternate update rule which works fine with both continuing and episodic tasks was proposed by Bellman. According to Bellman update rule, the value function for an observed state-action pair following policy, is updated in the same step by using the value function of the following state-action pair: Figure 2.4: The state-action values in the Grid-world problem: Q is the state-action value for fixed trails of a policy where each action in a state has same probability. Q* is the state-action value for an optimal policy. In Bellman update rule, the state value function of an observed state is updated towards the sum of the immediate reward and the value function of the following state. Similarly, the action-value function of an observed state-action pair is 11

22 updated towards the sum of the immediate reward of the following state-action pair and the action-value function A policy with equal probability for all state-action pairs was used to estimate the actionvalues functions for the above-described Grid-world problem. The obtained estimated values are shown in Figure 2.4. The Q-value in each state is obtained by taking an average of the state-action values obtained from following a fixed set of paths. The Q* in each state represents the optimal state-action values in the state Q-policy, where is a step-rate 4. Table 2.1: The steps followed in basic algorithm. An agent can use the above-described optimal state-action values to find an optimal policy. Different algorithms are available to find the optimal Q-values for the state-action pairs in an environment. In this work I have used a modified form of the algorithm (Sutton and Barto, 1998) to find the optimal Q-values. The algorithm is a temporal difference method, used to approximate the value functions of a state space using the immediate reward and the current approximated value function. It initializes all the Q- values to a user defined value and repeatedly updates the Q-values until a reasonable estimate of the Q-values is reached (A point where all the Q-values are near optimal). A 12

23 brief description of the steps followed in implementing the basic form of is given in the Table 2.1. algorithm At the start of an experiment, the Q-values for all the state-action pairs are initialized to some user defined value. At the start of each episode, and are initialized randomly to help the agent to visit different state-action pairs. The agent perceives the current state (initialized by environment) and takes the action either stochastically or using a Q-policy: A constant is used to decide if an action has to be taken stochastically or using the Q- policy. If the value of is zero, an action is taken using the Q-policy. If the value of is non-zero, an action is taken stochastically with a probability and using the Q-policy with a probability. After taking the action, the environment updates the current state to the resulting state obtained by taking the action. The agent again has to choose an action in the current state. The agent has to update the values of observed state-action pairs after each step, to improve the accuracy of the estimated value function. As discussed before, Bellman update rule is used to update the value function of an observed state-action pair. The agent updates to minimize the difference between the Q-value of resulting state action-pair and : where is a learning parameter and is a discount factor. The steps in an episode are repeated until a terminal state is reached, at which point a new episode is started at step 2. The Sarsa algorithm only considers the reward obtained and the discounted Q-value of the current state to assess the Q-value of the previous state. It is also known as 1-step backup. In this way, an agent does not have to wait until the end of an episode to update 13

24 the Q-values of visited states. The efficiency of this approach depends on the accuracy of the Q-values. At the start of training, the Q-values for all the state-action pairs in a state space are filled arbitrarily (generally as 0); updates based on these Q-values are not reliable and will result in a poor training. It is also a proven fact that learning rate of 1- step backup is slow. An alternate solution is to use a Monte Carlo method (Sutton and Barto, 1998) at the start of training. A Monte Carlo method waits until it receives a final cumulative reward before updating the Q-values. The discounted value of is used to update the Q-values of all the state-action pairs visited in that episode. This way the updates to Q-values are reliable, as is an observed result. Though the estimates of this method are more reliable, implementing these methods for continuing tasks is not feasible because of the reasons which are explained before. State-action pairs Time line ( ) Figure 2.5: An outline of the two-step backup. In the State-action pairs column, the observed state-action pairs at time step T=1, t-1 and t are shown. The state-action pairs which correspond to the green colored ellipses are eligible for a Q-value update at time step T=t. Under the Time line column, the sequence of time step values until t is shown. It is verified that the most accurate Q-values are not approximated by using either TD method or Monte Carlo method, but by using a generalized form of these two methods called an n-step method (Sutton and Barto, 1998). In the n-step method, not only the 14

25 value function for current observed state-action pair is adjusted towards the target, but also the value functions for the previous n state-action pairs are adjusted towards the target; hence it is also known as n-step backup. In Figure 2.5, the outline of the two-step backup at a time step is given. The eligible state-action pairs for which value functions are updated at time step are and. In case of 1-step backup only is updated at time step. The value of n at which the approximated value functions are most accurate is not same for all the environments. Instead of using environment specific n value, a better alternative is to take the average of all possible n-step returns, by assigning weights to each n-step return, and making sure that the sum of the weights is equal to 1. Here, the value functions of all the state-actions pairs are adjusted towards target by variable amounts. The amount of adjustment depends on how close a state-action pair is temporally to the current state-action pair. This method can be implemented by introducing eligibility traces (Sutton and Barto, 1998) in the algorithm; the resulting algorithm is called as the algorithm. The algorithm combines the efficiency of TD learning and the reliability of Monte Carlo method. In this method, an observed reward is used not only to update the Q-value of the current state but also to update the Q-values of the states which lead to the current state. From an implementation point of view, at any time step, the Q-values of all the states which are most recently observed are also updated along with the Q-value of the current state. A variable associated with each state is used to store the temporal closeness of the states with the current state. Eligibility traces are an array consisting of the above defined variables for all the states in a state space. At the start of training, eligibility trace value for each state-action pair is initialized to 0: 15

26 At the start of an episode, the value of eligibility trace for the most recent observed sateaction pair is incremented by 1, as it is temporally closest to the current state and is most responsible for reaching the current state: where is the most recent observed state-action pair. At the end of each step during an episode, the values of eligibility traces for all the stateaction pairs are reduced by a constant factor to indicate that the credit to award for all previously visited state-action pairs for reaching the current state decreases with every time step: where is a trace-decay parameter and is a discount factor. The update rule for Q-value using eligibility traces is given below: where is the maximum estimated action-value function in the current state, is the value function of the previous state-action pair that agent has chosen. In the above equation, using the eligibility traces to update the Q-values of the stateaction pairs allows to reward all those state-action pairs which lead to the current state. The amount of credit given to a state-action pair depends on how close it is temporally to the current state. This general method is called as the ( ) algorithm and the steps followed in this algorithm are given in Table 2.2. It is evident from the table that most of the steps in are same as the Sarsa algorithm except for updating the Q-values 16

27 of the state-action pairs. Here, not just the value function of the previously observed state-action pair, but the value functions of all the state-action pairs which are observed during the previous steps of the episode are adjusted towards the current target by variable amount. The amount by which the value function of a state-action pair is adjusted depends on the corresponding eligibility trace associated with the state-action pair. The values of eligibility traces for recently visited state-action pairs are relatively higher than the rest of the state-action pairs Q-policy 4. Table 2.2: The steps followed in the Sarsa ( ) algorithm. The learning rate of an agent which uses eligibility traces to estimate the value functions is more when compared to the learning rate of an agent which does not use eligibility traces to estimate the value functions. To understand more about how eligibility traces affect the learning rate of an agent, the Grid-world problem discussed in section 2.3 is considered here. Initially all the Q-values of state space are set to 0. The estimated Q- values of a state space by and algorithms, after an agent completed the path consisting of states 1, 2, 3, 5, 6 and Goal, are given in Figure 2.6 and Figure

28 Goal state Figure 2.6: The estimated Q-values by the Sarsa algorithm after first episode for the Grid-world problem discussed in section Goal state Figure 2.7: The estimated Q-values by the Sarsa algorithm, after first episode for the Grid-world problem discussed in section

29 The numbers at the center of each cell in both the tables represent the states. In Figure 2.8, the effect of the completed episode in the state space is shown for both the algorithms, and it is clear from the figure that by using eligibility traces the learning rate is higher for algorithm when compared to the algorithm. In this work, all the RL environments that are used to evaluate the performance of various agents are episodic; hence details about applying Monte Carlo methods to continuing tasks are not provided here. In all the experiments, during training, for the first few episodes Q-values are updated using Monte Carlo method and for the rest of the episodes Q-values are updated using Bellman update rule and eligibility traces. Goal Goal (a) (b) Figure 2.8: The graphical representation of the Q-values from Figures 2.6 and 2.7. (a) Representation of knowledge gained using the Sarsa algorithm. (b) Representation of knowledge gained using the Sarsa algorithm. The above-described algorithm works fine for environments with small and discrete state space, but if the state space is large, table cannot be used to represent the state values and the state-action values. In such cases, generalization methods are followed to represent the Q-values in a state space. These generalization methods are discussed in detail in the next section. 19

30 2.2 Generalization Methods The representation of value functions for an RL environment comprising discrete and limited number of states is simple and straightforward; a table can be used to represent the values with a row for each state and a column for each action. But in general, an RL environment contains a large state space. It is not feasible to represent value functions for these state spaces using a table. Even if a large table is used to represent the state space, the value functions in most of the cells are not updated for the reason that the number of states visited is far less than total number of states present in the environment; hence an agent has to generalize the value function obtained from a limited state space to the entire state space. Several generalization methods are available to approximate the value functions for a large state space. In this thesis I have used a function approximation method (Sutton and Barto, 1998) to generalize the value functions of a state space. The function approximation method is used to generalize a target function from a set of examples. More about the implementation of this method will be discussed in the following sections Basic Idea of Generalization Methods In general, the difference in feature values of the states which are close enough in a large state space is minimal. A generalization method exploits this fact and uses the feature values of visited states to approximate the feature values of nearby states. This allows an RL agent to train efficiently on a small set of states and apply this knowledge to a larger set. The problem with this approach is the boundary of generalization. If the boundary is too small then only the value functions of those states which are really close enough to a visited state are approximated. It improves the accuracy of the results, but it affects the 20

31 extent of generalization. If the boundary is too big then the extent of generalization is improved, but the states which are not close enough might end in the same group. Different generalization methods use different approaches to solve the boundary problems. In the following sections, function approximation, a common generalization approach and various other generalization methods which are obtained by extending function approximation method are discussed in detail Function Approximation A function approximation is a supervised learning technique which approximates a target function from a set of examples (Sutton and Barto, 1998). For example, given a set of examples of the form, the function approximation has to approximate the target function. In this thesis, I am using the function approximation method to approximate a parameterized Q-value function for the state-actions pairs in a state space using a limited set of Q-values, where, estimated for observed stateactions pairs. In this context, is an observed state action pair and is a Q- value for action taken in state. The generalized function has a parameter, a variable vector with each component representing a group of states in the state space. The values of the components in the variable vector are used to compute the Q-values of all the state-action pairs in a state space. The above specified parameter vector can be implemented in different ways. In this thesis I have used a linear model of gradient descent method to implement the parameter vector. In an environment with a discrete and small state space, a table is used to represent the entire state space. Each cell of the table represents a unique state in the state space, and it is easy to keep track of value functions for each state. If a state space is continuous then a table cannot be used to represent the states of the state space as it is not feasible to allocate separate cell for each state in the state space. Assuming a very large table is allocated for the entire state space, value functions for most of the cells in the table are 21

32 not estimated, as number of states visited during training is far less than the total number of states in the state space. A function approximation method can solve the above two problems of representing a continuous state space. Since a parameter vector represents an entire state space using far less number of vector components, representation of state space gets simple. Since each component of represents a group of states in the state space; to estimate the value functions for the entire state space, we just have to make sure that reasonable number of states in each component of is visited during training. The value function of a state is obtained using the values of the vector components in which the state is present. As the number of components in a vector is far less than the actual count of states, it is possible to visit reasonable number of states in each component of To understand more about how parameter vector generalize value functions, the state space of a mountain car is represented in the form of a generalized function below. The state space of mountain car consists of two dimensions: the position of car (X), and the velocity of the car (Y). The range of X is from -1.2 to 0.6 and range of Y is from to 0.07; and both X, Y are continuous. Each component in is called a feature and each feature contains a set of states from the state space. As represents all the states of an environment, all values of X and Y must be present in at least one component/feature of. Assuming that represents the component of, one way to generalize the features X and Y is given below: = [-1.21, -0.9) = [-1.0, -0.3) = [-0.5, 0.1) 22

33 For the sake of simplicity, in the above vector, only the values of a single dimension are placed in each feature. The components are used to represent the features comprising X and the components are used to represent the features comprising Y. If an observed state has x = 0.3 and y = 0.01 then only the Q- values of components and are adjusted towards a target but the Q-values of the rest of the components are not disturbed. Updating the Q-values of and effects the value functions of states in, [0.3, 0.61) for X and, [0.0, 0.05) for Y. It is important to make sure that the components of every feature in are closely related. Otherwise, the accuracy of the estimated value function is affected. Various methods are available to group the states into different components of Gradient-Descent Methods Gradient-Descent is a form of function approximation method which uses a column vector filled with real value components as the parameter (Sutton and Barto, 1998). The number of elements in is fixed at the start of training and is a lot less than the number of states in a state space. At each time step, the difference between a target and estimated Q-value of an observed state-action pair is used to update standard error in approximation: to reduce the mean 23

34 where is a learning rate parameter, is an eligibility trace vector at time t. The value of depends on the learning mechanism used; in a Monte Carlo method, is a discounted cumulative reward, whereas in Temporal Difference method, target is the sum of observed reward and the best Q-value for the next state. It is also required to update at each time step to adjust the weights of where is a discount factor, is an eligibility trace component, and is a vector derivative of with respect to. In this method, Q-value function is obtained by applying a differentiable function to the parameter vector. Different types of gradientdescent methods apply various differentiable functions to Coarse Coding Coarse coding is a linear form of gradient-descent function approximation method (Sutton and Barto, 1998). It uses several instances of a geometrical shape to represent the features in a state space, where each feature consists of a subset of states in. Each instance of represents a feature in the state space and a component in the feature vector. If a set of states have some particular property in common, they are all grouped into one feature. If a state lies within the boundaries of an instance of G then has the feature corresponding to that instance. The presence of a feature in a state can be represented in different ways, but in general a binary digit is used. The number of features shared by two different states is equal to the number of instances common for both the states. The scope of generalization depends on the shape and size of an instance. For example, in Figure 2.9 generalization along X-axis is more prominent when compared to the generalization along Y-axis; hence the rectangles with length along X- axis are used to represent the features. If A is observed during a training, then 24

35 generalization from A to C is better when compared to generalization from A to D, as A has three features in common with C but only two with D. Moreover there will be no generalization from A to E as there are no common features between A and E. With large size instances, generalization is quick and broad but not accurate. With small size instances, generalization is slow and narrow but accuracy is much better. Using large numbers of moderate size instances allows generalization to be quick, broad and near optimal. Y-axis E 1 F 2 3 D 4 B A C X-axis Figure 2.9: Coarse coding using rectangles: A, B, C and D are different states in a state space. The number of features shared by any two states in the state space is equal to the number of rectangles which contains both the states. The use of binary numbers to represent the presence of a feature in a state is not an efficient way of generalization. If an observed state and some other states and have only one common instance then generalization is same for and as the feature value for is set to one for both y and. A better generalization method should consider 25

36 how, and are related with respect to before generalizing from to and. For example, in Figure 2.9, assuming state D was observed during training, if binary numbers are used to represent the feature presence, as D has only one feature in common with both B and F, while adjusting the values of features, both B and F are adjusted by same amount. It is pretty clear from the figure, state D is present at somewhere near the corner of the feature which contains both D and F (feature 1), whereas it is present near to the middle point of the feature which contains both D and B (feature 2). It means the commonality between D and B is more when compared to the D and F. Actually B should be adjusted more towards the target when compared to F, but it is not possible with binary features. The following method resolves this problem by not only considering the presence of a state in the feature but also the distance between the observed state and the state while adjusting the value function of the state towards the target Radial Basis Functions A function whose values depend only on the distance from a particular point is known as a radial basis function. Radial basis functions are used in coarse coding method to estimate the degree of presence of a feature in a state. The degree of presence of a feature in a state is represented using a value between 0 and 1. If an instance is used to represent then the feature value depends on how far is from the center of relative to the width of (Sutton and Barto, 1998): For example, in Figure 2.9, if we consider radial basis function instead of binary numbers to estimate the presence of a feature in a state S, and assume that the width of feature 1 is, distance between state D and the center of feature 1 is, width of feature 2 is and the distance between state D and the center of feature 2 is, the resultant radial basis functions are: 26

37 It is evident from the figure that and, which means the value inside exp is more negative for feature 1 when compared to feature 2. As a result, the value of is less than, from which we can deduce that state D is closely associated with feature 2 when compared to feature 1. Since D is the observed state, the adjustment of value functions towards target for states in feature 2 is higher when compared to that of value functions for states in feature Tile Coding Tile coding is an extension of the coarse coding generalization method (Sutton and Barto, 1998; Stone, Sutton, and Kuhlmann, 2005). In tile coding, a state space is generalized using a set of features which represent a partition of the state space. The set of features are grouped in such a way that each single group contains non-overlapping features that represent the entire state space. The group of non-overlapping features that represent the entire state space is also known as a tiling. The accuracy of the generalization in tile coding is improved with the number of tilings. Each feature in a tiling is called a tile, and each tile has an associated binary value to indicate whether the corresponding feature is present in the current state or not. The direction and accuracy of a generalized state space depends primarily on the shape and size of the tiles. In general, the shape of a tile affects the direction of generalization whereas the size of a tile affects the range of generalization. In Figure 2.10, single tiling is applied to generalize two different state spaces. Tiles with different shapes and sizes are used to represent the features of these state spaces. It is evident from the figure that the direction of generalization is along Y 27

38 for S1, and along X for S2. In the following sections, these two tile coding methods are discussed in detail. S1 S2 Y Y X (a) X (b) Figure 2.10: Partition of two different state spaces S1 and S2 using tile coding. (a) S1 is generalized along Y direction. (b) S2 is generalized along X direction. The update rule used by a linear generalization method to estimate the value functions is: where is an estimated value function, is a parameter vector and is a component vector. In a linear generalization method which uses gradient-descent approximation to estimate value function, the gradient of with respect to is used to update the value function: 28

39 Tile coding uses an extended form of the above specified gradient-descent update rule to estimate the value functions: since the presence of a feature is represented using a binary value associated with the corresponding component of a feature. Unlike in coarse coding, where in the number of features observed at a point depends on the current observed state, tile coding always has same number of features for any state. It simplifies the update rule, as can be set to a constant value. The number of features present in an observed state always depends on the number of tilings used to represent the features of a state space. In the following sections, the implementation of single tiling and multiple tilings are discussed in detail. In this thesis, I have used only single tiling for implementing adaptive tile coding Single Tiling In single tiling, all the features of a state space are represented using one tiling. Any state in the state space must correspond to only one of the tiles in the tiling. During an experiment, at each step, the feature that contains the observed state is selected and the value function of a corresponding tile is updated according to the rewards obtained in the following steps. When the value function of a tile is updated, the value functions of all states within the tile are equally affected as the observed state. Hence, while choosing features for a state space, states with identical value functions are place in a same tile. Since, every state in the state space corresponds to only one of the features of the tiling and only a single tiling is available, the number of the features selected at any step is a constant and is equal to one. A binary variable is associated with each tile to indicate the presence of a corresponding feature at each step. The binary variable of a tile corresponding to the observed feature is set to 1 and the binary variables of the rest of the tiles are set to 0. In case of single tiling, the above specified tile coding update rule can 29

40 update only one feature at each step, as the number of tiles selected is restricted to one per tiling and the number of available tilings is one Y t13 t14 t15 t16 t9 t10 t11 t12 t5 t6 s t7 t8 t1 t2 t3 t X Figure 2.11: The features of the simple Grid-world problem represented using single tiling. A two-dimensional grid is used to represent the non-overlapping features, with x-axis representing X values and y-axis representing Y values. Each cell represents a tile with width and height equal to The numbers inside each cell indicates the tile number. To understand more about how to generalize a state space using single tiling, the simple Grid-world problem discussed earlier is revisited. The Grid-world problem contains two dimensions X and Y with values in the range 0 to 1. The state space is partitioned into multiple non-overlapping features using single tiling as shown in Figure Each cell (t1, t2,, t16) of the table represents a tile in the tiling. In this case, the parameter vector and the value of component vector is 1 for the tiles that correspond to features which contain s and 0 for the rest of the tiles. At any step, an observed state correspond to only one of the 16 available tiles. For example, assume that a current observed state (represented using a dot in the above figure) s has X value equal to 0.59 and Y value equal to To update the value function of s, the value function of t7 which contains s has to be updated. As a result of the update, the value functions of all those states which are present in t7 are also affected. 30

41 To improve the accuracy of a generalization, a state space has to be partitioned in such a way that the resulting tiles contain only closely associated states, which in return affects the range of generalization. As an alternative, another tile coding technique called multiple tiling is used to increase the range of generalization without affecting the accuracy of approximation Multiple Tilings In multiple tilings, the features of a state space are represented using more than one tiling. Any state in the state space must correspond to only one of the tiles in each tiling. During an experiment, at each step, a tile that contains the observed state is selected in each tiling and the value functions of those tiles are updated according to the rewards obtained in the following steps. Unlike in the single tiling, updating the value function of tiles does not equally affect the value function of other states which correspond to the same tiles. The value functions of the states which appear in most of the active tiles are affected more when compared to the value functions of the states which appear in few of the active tiles. It improves the range of generalization without limiting the accuracy of generalization, as updating multiple tiles at the same time affects more number of states, and only the states which have more number of features in common with an observed state are adjusted more towards the target when compared to the other states which have only few features in common with an observed state. Since, every state in the state space correspond to only one of the features of a tiling, the number of the features selected at any step is a constant and is equal to the number of tilings. A binary variable is associated with each tile to indicate the presence of a corresponding feature at each step. The binary variable of the tiles corresponding to the observed features are set to 1 and the binary variables of the rest of the tiles are set to 0. To understand more about the multiple tilings, the state space of the above discussed Grid-world problem is partitioned into multiple non-overlapping features using multiple tilings as shown in Figure Two tilings are used to represent all the features of the 31

42 state space. Each cell in the tables represents a tile in one of the two tilings. In this case, the parameter vector and the value of component vector is 1 for the tiles that correspond to features which contain the current observed state and 0 for the rest of the tiles. At any step, an observed state correspond to one of the tiles in tiling 1 and tiling 2. For example, assume that a current observed state (represented using a dot in the above figure) s has X value equal to 0.59 and Y value equal to The two tiles, one from tiling 1 and another from tiling 2, that contain s are highlighted in the figure. These value functions of these two tiles are adjusted towards the target value. In the figure, the states which correspond to both of the highlighted tiles are affected more when compared to the tiles which correspond to only one of the highlighted tiles. Tiling 1 Tiling 2 Y s State space X Figure 2.12: The features of a simple Grid-world problem represented using multiple tilings. Two different two-dimensional grids are used to represent the non-overlapping features, with x-axis representing X values and y-axis representing Y values. Each cell in both the tables represents a tile with width and height equal to Another fact to notice is the difference in value for single tiling and multiple tilings to update the Q-values at a same step rate. The value of step-size parameter in multiple tiling is different from that of single tiling. In single tiling, the value of is set to if the required step size is. In multiple tiling, the value of is set to where is the 32

43 number of tilings if the required step size is. The division by is to balance the repeated update performed on the value functions of the eligible states towards the same target for times (once for each tile in a tiling). 2.4 Review of Problem The efficiency of tile coding method depends on how well the tiles can partition the state space into groups of similar states. Sutton and Barto, in their work, applied single tiling and multiple tilings methods to generalize different RL state spaces, and suggested that multiple tilings is an efficient method to generalize state spaces with lower-dimensions. They used a manual approach to group the states (for generalization) of an RL state space. The drawback of this approach is that user must have a thorough understanding of environment to partition the state space effectively. An alternate approach is to give the control to the RL agent to automate the process of grouping the states. This approach is also known as adaptive tile coding. Whiteson, Taylor and Stone used the Bellman update rule to implement this approach. In this thesis, we proposed different adaptive tile coding methods and evaluated their efficiencies. 33

44 CHAPTER 3 Proposed Solutions In this chapter I will present several adaptive tile coding methods that I have used as an alternative for multiple tilings to solve RL problems. An adaptive tile coding method is a technique used to automate the feature defining process of tile coding (Whiteson et al., 2007). The first section describes the tile split concept and its advantages. In the second section the algorithm and the implementation details of different adaptive tile coding methods like random tile coding, feature-based tile coding, value-based tile coding, smart tile coding and hand-coded tiling are provided. In random tile coding, a tile is selected randomly from a set of available tiles for further division. In feature-based tile coding, tile division is based on the feature s size (the portion of a dimension represented by the feature). In value-based tile coding, tile division is based on the frequency of the observed states. In smart tile coding, split points selection is based on the estimated Q- values deviation. In hand-coded tiling split points are selected manually. 3.1 Tile Coding a State Space In tile coding, the direction (the dimension along which generalization is prominent) and the accuracy (how closely related are the states in the same tile) of a generalized state space depend primarily on the shape and size of the tiles representing the state space. To manually partition a state space, significant knowledge of the state space is required. An alternate solution is to automate the partitioning, by using the knowledge gained during learning to partition the state space. This mechanism is also known as adaptive tile coding (Whiteson et al., 2007). The steps followed in adaptive tile coding are given in Table

45 In linear adaptive tile coding, different set of tiles are used to represent each feature of a state space. At the start, a minimal number of tiles (often two) are used to represent each feature in a state space. Using RL mechanism, the action-value functions of the available tiles are approximated. The approximated action-value functions of the available tiles represent the generalized action-value functions for the entire state space. As the minimal number of tiles is used, states with conflicting action-value functions might correspond to the same tile. A policy based on parameters like action-value functions and frequency of the observed states is used to select a point which separates most conflicting states. A tile which contains the point is divided into two separate tiles by splitting at the point. The action-value functions are approximated again using available tiles, and a tile which contains a point that separates most conflicting states is split at the point into two separate tiles. The process is repeated until the policy determines that splitting can no longer improve generalization or the splitting count reaches the required value. 1. Choose minimal number of tiles for each feature in a state space. 2. Repeat (tile split): a. Approximate the action-value functions for all states in the state space using available tiles. b. A policy is used to find a state which divides the states that have conflicting value functions. c. Find the tile containing the selected state and split the tile into two tiles at the selected state. Until the policy determines splitting can no longer improve generalization. Table 3.1: General steps followed in an adaptive tile coding method. Figure 3.1 shows how an imaginary two-dimensional state space might be tiled using adaptive tile coding. Initially, each feature in the state space is represented using two tiles. Assuming that an adaptive tile coding policy has selected a tile containing feature 35

46 equal to for further division, tile is split at into tile and tile. In the second pass, the tile containing feature equal to is selected for further division; hence tile is divided at into tile and tile. Similarly, in the last pass, tile having feature equal to is divided at 1.2 into tile and tile. Figure 3.1: A generalized two-dimensional state space using adaptive tile coding. On the left, each cell represents a tile that generalizes a group of states specified at the bottom of the cell, for features x and y. On the right, the number of splits performed on the state space. In adaptive tile coding, generalization at the start of learning is broad and not reliable, as only a few tiles are used to partition a state space. During the course of learning, by splitting tiles, generalization becomes narrow but it does not guarantee an improvement in the accuracy of generalization. The accuracy of a generalization depends on how effectively a policy partitions a state space. Different policies are used to split tiles; most 36

47 of them use approximated value functions. In the paper Adaptive tile coding using value function approximation (Whiteson et al., 2007); a policy based on Bellman error was used for selecting the split points. In the next section, I described different policies that are used to implement adaptive tile coding. 3.2 Adaptive Tile Coding Methods Different methods use different approaches to select a split point in adaptive tile coding. In this work, I implemented methods based on random selection, feature size, frequency of the observed states, difference between estimated Q-value deviations of states in a tile, and hand-coding. The algorithms used to implement these methods are described in the following sections. All these algorithms are derived from the Sarsa ( ) algorithm. In the next chapter, the efficiencies of these methods, obtained by evaluating the performance of RL agents developed using the methods, are provided Adaptive Tile Coding Implementation mechanism The implementation of the following adaptive tile coding methods is a repeated process of training (exploring) an agent using available tiles, testing (exploiting) the performance of the agent, and splitting a tile selected by the specific adaptive tile coding method. The implementation of training phase and testing phase in all adaptive tile coding methods are similar, and use a modified form of the Sarsa( ) algorithm. A generalized algorithm to control and implement the above three phases is provided in Table 3.2. All the steps of the algorithm except for are same for all of the following adaptive tile coding methods. The implementation of is unique for each adaptive tile coding method. It defines the approach used by an adaptive tile coding method to partition the tiles. In feature size based tile coding and smart tile coding, during a testing phase, additional steps are required in the Sarsa( ) algorithm at the end 37

48 of each pass. The additional steps required for these adaptive tile coding methods are discussed in separate sections. The input to the generalized algorithm is the number of episodes for, the number of episodes for, the maximum number of splits to perform, the minimum number of tiles to start with, the maximum number of steps to perform, the initial Q-values, the function to retrieve corresponding tiles for the current state, and the values for different RL parameters. The algorithm has an exploration/training phase, an exploitation/testing phase and a splitting phase. For each split, the exploration phase runs for N episodes and the exploitation phase runs for M episodes. The exploration phase and exploitation phase use the modified form of Sarsa( ) algorithm provided in Table 3.3. where number of splits to use number of training episodes number of testing episodes number of tiles to represent each dimension at the start parameters for runepisode (see Table 3.3) // Training // Testing // Depends on the adaptive tile coding method Table 3.2: General algorithm used to implement an adaptive tile coding method. 38

49 where, -greedy learning // Current state vector repeat calculate ++ else fi,, -greedypolicy() // Number of steps so far // Used in value-based tile coding only // t is the tile vector for the current state ++ // value-based tile coding only action with max Q-value if (!= null) fi until // Update for final action Table 3.3: A modified form of the Sarsa( ) algorithm to generalize a state space using tile coding. 39

50 -greedypolicy() returns else // returns a random value between 0 and 1 // returns a random action fi Table 3.4: Algorithm to implement the ε-greedy policy. It is used to select an action during training. The implementation of the Sarsa( ) algorithm is discussed in Chapter 2. At the start of an experiment, the starting state of an agent is initialized randomly, with every possible state having non-zero probability. The tiles representing the current observed state is returned from. During the exploration phase, an -greedy policy (Table 3.4) is used to train an agent and approximate Q-values for the tiles. The Q-value for a tile is approximated only if it contains at least one point which represents a feature of an observed state. Hence, the value of ε has to be chosen carefully, so that at least few points of all available tiles are visited during training. During the exploitation phase, Q-policy is used to exploit the knowledge gained during training and evaluate the performance of the agent. In some adaptive tile coding methods, the data required for splitting a tile is also gathered during this phase. During the splitting phase, a tile is selected from the set of current tiles following the policy of specific adaptive tile coding method. The selected tile is partitioned into two different tiles. A variable is used to store the cumulative reward received during an exploitation phase of each split. The sequence of exploration, exploitation and splitting is repeated for P times Randomly Splitting Tiles In random tile coding method, the tiles are chosen randomly for further partition. At the start, each feature in a state space is represented using the minimal number of tiles (two 40

51 in this case). At the end of each pass, a tile is selected randomly and is further divided into two new tiles of equal size. The algorithm used for implementing the RL mechanism of smart tile coding is given in Table 3.2. It includes a modified form of Sarsa(λ) algorithm (explained in section 2.3) given in Table 3.3. The implementation of is provided in the Table 3.5. At the start of the algorithm, two variables and are defined to store the point and the tile at which a split would occur; both the variables are initialized to 0. The value of is set to a tile selected randomly from the set of current tiles. The value of is set to the mid-point of. At the end, is split at into two new separate tiles. splitting_method() selects a tile randomly from Tilings at Table 3.5: Algorithm to implement in random tile coding. To understand more about how random tile coding is performed, an example is provided in Figure 3.2. The state space consists of two features X and Y, with X ranging from -4 to 4 and Y ranging from 0 to 2. At the start, all the features in a state space are partitioned into two tiles of equal size. At the end of first pass, tile 4 is selected randomly and is partitioned into tile 4 and tile 5. At the end of second pass, tile 1 is selected randomly and partitioned into tile 1 and tile 6. At the end of third pass, tile 1 is selected randomly and is partitioned into tile 1 and tile 7. 41

52 Figure 3.2: Example for random tile coding. The states space contains features X and Y. Each row represents the state of tiles and spits performed so far. Each cell represents a tile whose number is shown at the top of cell. The range of a tile is shown at the bottom of corresponding cell. During each pass, a split is performed randomly and it is represented using a red line Splitting Based on the Size of Features In this method, splitting of tiles is based on the size (its extent in terms of its magnitude from the highest and lowest value it can represent) of features in a state space. At the start, each dimension is represented using a minimal number of tiles. During the course of the experiment, for each pass, a single tile is split into two new tiles of equal size. In feature-base tile coding, only the tiles for which the ratio of tile size and corresponding dimension size is maximum are eligible for splitting. If there is only one eligible tile, it is 42

53 split into two new tiles of equal size. If there are multiple eligible tiles, one of the eligible tiles is selected randomly and split into two new tiles of equal size. For implementation, all the tiles in a state space are organized into groups of different levels, with newly formed tiles being grouped one level higher than its predecessor. A tile from any level (except first level) is allowed to split further only after splitting all the tiles at lower levels. If more than one tile is present at a level which is eligible for splitting, one of the tiles from that level is selected randomly. create an integer variable and an integer counter during the first pass of the algorithm, to to of to and i all of the available tiles of Table 3.6: Algorithm to implement in feature size based tile coding. 43

54 The algorithm used for implementing the RL mechanism of feature size based tile coding is given in Table 3.2. It includes a modified form of the Sarsa(λ) algorithm (explained in section 2.3) given in Table 3.3. The implementation of () for feature size based tile coding is provided in Table 3.6. During the first pass, it defines two integer variables to store the eligibility level of each available tile and the current eligibility level. The first variable is a dynamic integer array called with size equal to the total number of available tiles. The size of the array increases as new tiles are added to the state space. Each tile in the state space is associated with a cell of. The value of cells in is used to keep track of current level of tiles in a state space. Initially the values of all the cells in a are set to, so that all available tiles at the start of algorithm are placed at level. The second variable is an integer variable called, which represents the level of tiles eligible for a split. The value of is initialized to at the start of a learning. It means that any tile which is at level is eligible for a split. A tile is eligible for a split, only if the value of its corresponding cell in is equal to. The algorithm to implement in feature-based tile coding is provided in Table 3.6. At the start of the algorithm, two variables and are defined to store the point and the tile at which a split would occur; both the variables are initialized to 0. During splitting, the values of all the cells in are compared with, to check if there is at least one tile which is eligible for splitting. If there is only one tile which has the same value as, is set to the tile and is set to the mid-point of the tile. If there is more than one tile which has the same value as, all eligible tiles are added to a new set and a tile is selected randomly from The is set to the selected tile and the target point is set to the mid-point of the selected tile. In case if all the tile at are already partitioned, no tile will have value equal to. In this case, the value of is incremented by one and a tile is selected randomly from all available tiles in state space. The value of 44

55 is set to the selected tile and the value of is set to the mid-point of the selected tile. In above all cases, the values of corresponding indexes in for new tiles are set to one level higher than that of their parent tile level. This avoids the possibility of splitting newly formed tiles with less size before splitting tiles with bigger size. At the end, is split at into two new tiles. The sequence of exploration, exploitation and splitting is repeated for P times. Figure 3.3: Example for feature size based tile coding. The states space contains two dimensions X and Y. Each row represents the state of tiles, splits performed so far and. Each cell indicates a tile with tile number on the top, range at the bottom and at the center. 45

56 To understand more about how feature size based split tile coding is performed, an example is provided in Figure 3.3. At the start, all the features for and in a state space are partitioned into two tiles of equal size, and all the four tiles are set to level. According to the algorithm, only the tiles at have to be partitioned; hence an eligible tile set contains only those tiles whose level is equal to. At the start eligible tile set contains tiles 1, 2, 3 and 4. At the end of first pass, tile is selected randomly from the eligible tile set. Tile 4 is split into two different tiles (tile 4 and tile 5) of equal size; the levels of both the tiles are set to. At the end of second pass, tile 2 is selected randomly from the eligible tile set containing tiles 1, 2 and 3. Tile 2 is split into two different tiles (tile 2 and tile 6) of equal size; the levels of both the tiles are set to. At the end of third pass, tile 3 is selected randomly from an eligible tile set containing tiles 1 and 3. Tile 3 is split into two different tiles (tile 3 and tile 7) of equal size; the levels of both the tiles are set to. At the end of fourth pass, the eligible tile set contains only tile 1, and it is selected automatically. Tile 1 is partitioned into two different tiles (tile 1 and tile 8) of equal size; the levels of both the tiles are set to 2. At the end of fifth pass, the eligible tile set contains no tiles. As all the tiles from level 1 are already partitioned and are updated to level 2, the value of is incremented to 2 and all available tiles (since all available tiles are at level 2) are added to the eligible tile set. Tile 1 is selected randomly from the eligible tile set and is partitioned into tile 1 and tile 9, whose levels are set to Splitting Based on the Observed Frequency of Input Value In this method, splitting of the tiles is based on the observed frequency of the states in a state space. At the start, each dimension in a state space is represented using a minimal number of tiles. During each pass, a tile with maximum number of observed states is partitioned into two tiles of equal size. The algorithm used for implementing a valuebased split tile coding is given in Table 3.2. It includes a modified form of 46

57 algorithm (explained in section 2.3) given in Table 3.3. At the start of a testing phase, the above specified Sarsa( ) algorithm is extended to include an integer array called. The size of is equal to the total number of available tiles. The array is used to store the frequency of observed states, tile wise. At each step, during the testing phase, the value of a cell in with index equal to the tile number which contains the observed state is incremented by one. The frequency of a tile is obtained by adding the frequencies of all observed points represented by the tile. At the end of testing phase, contains frequency of all the observed states, arranged tile wise. The algorithm used to implement in value-based tile coding is provided in Table 3.7. At the start of the algorithm, two variables and are defined to store the point and the tile at which a split would occur; both the variables are initialized to 0. The values of all the cells in are checked to find a cell with maximum frequency. The value of is set to the number of tile corresponding to the cell with maximum frequency. The value of is set to the mid-point of. At the end, is split at into two new tiles. splitting_method() Returns mid-point split at Table 3.7: Algorithm to implement in value-based coding. To illustrate how value-based tile coding is performed, an example with assumed data is provided in Figure 3.4. At the start, features and, in a state space, are partitioned into two tiles of equal size, as shown in Figure 3.4(a). At the end of first pass, cells in 47

58 are filled with frequencies of corresponding tiles as shown in the table from Figure 3.4(b). It is evident from the table that tile has highest frequency. Hence tile is partitioned into two tiles of equal size, by splitting at midpoint, as shown in Figure 3.4(c). Figure 3.4: Example for value-based tile coding: (a) Initial representation of states space with features X and Y; each feature is represented using two tiles of size. X ranges from -1 to 1 and Y ranges from 2 to 6. (b) A table showing tile number and observed frequencies of points in each tile, after completion of phase. (c) The representation of state space after splitting tile 1 into two tiles of same size, by splitting at

59 3.2.5 Smart Tile Coding In smart tile coding, splitting of tiles is based on the deviation between predicted Q- values and observed Monte Carlo Estimates. Each pass consists of training, testing and splitting phases. During training, the Q-values of tiles representing the observed states are updated using ε-greedy policy. During testing, the predicted Q-values are used by Q- policy to select the actions. For every observed state, the difference between Monte Carlo Estimate and predicted Q-value is used to guess the error in Q-value approximation (Qdeviation) for the points representing the observed state. At the end of testing, all observed points are grouped tile wise. For each tile, a point with maximum difference between Q-deviation of all the points on the left side and the right side of the point (in the tile) is selected. The tile whose selected point contains maximum Q-deviation difference among all the tiles is split at the selected point into two new tiles. The algorithm used for implementing the RL mechanism of smart tile coding is given in Table 3.2. It includes a modified form of the Sarsa(λ) algorithm (explained in section 2.3) given in Table 3.3. In smart tile coding, require Q-deviation values for all the states observed in testing phase. The Q-deviation of an observed state is obtained by finding the difference between Monte Carlo Estimate and predicted Q-value of the state. The steps used to calculate Q-deviation values are provided in the Table 3.8. A data structure ( ) is used to store the Q-deviation values of all the observed states. In, observed states are saved as keys and the estimated Q-deviations of the corresponding observed states are stored as values for the keys. At the end of each pass, is searched to find a matching key for the current state (. If a match is not found, is added as a key, with value equal to the difference between predicted Q- value ( ) and Monte Carlo Estimate ( ) of. If a match is found, the value of the key is updated to the weighted average of the current value of the key ( and the difference between predicted Q-value ( ) and Monte 49

60 Carlo Estimate ( ) of.the above specified steps are executed at the end of each step during a testing phase, to record the Q-deviation values of the observed states. else Table 3.8: Additional steps required in the Sarsa( ) algorithm to approximate the estimated Q-values deviations during a testing phase. The algorithm used to implement in smart tile coding is provided in Table 3.9. In splitting_method(), the predicted Q-value deviations of observed states are used to select a point for splitting. Initially, all the observed Q-value deviations are arranged tile wise. For each tile with an observed state, a point with maximum difference between Q-value deviations of either sides of the point is selected. The tile which contains a point with maximum value among all the tiles is selected for further division at the point. At the start of the algorithm, two variables and are defined to store the point and the tile at which a split would occur; both the variables are initialized to 0. All observed states are arranged tile wise following the order of value. Starting from the least state, and are calculated tile wise for each observed state. To calculate and for an observed state, only the Q-deviations within the tile containing the observed state are considered. The for a state is defined as the difference between the sum of all the Q- deviations on the left side of the state(including the current state) and the product of mean and number of elements on the left side of the state(including the current state). To calculate for a state, a count of number of states on the left side of the state is required. The number of elements on left side of the state, including the current state is stored in. The for a state is defined as the difference between sum of 50

61 all the errors on the right side of the state and the product of and number of elements on the right side of the state. To calculate for a state, a count of number of states on the right side of the state is required. The number of elements on right side of the state is stored in. Arrange all observed states ( ) tile-wise. for each tile : initialize to. set to,. sort observed features of in ascending order. for each in : number of elements to the left of number of elements to the right of Set to sum of all to the left of and. update to difference of and. set to sum of all to the right of. update to difference of and.., where is the feature next to. split into two tiles at. Table 3.9: Algorithm to implement in smart tile coding. A state with maximum absolute difference between and among all observed states in a state space is selected. The range of points between and next observed state in the same tile is expected to contain most conflicting states on either side of it. The value of is set to the tile which contained the state L. 51

62 The above described method only finds the range of points but not a specific point for splitting. In this work, the mid-point of the obtained range is considered as the value of. For example, in an observed state list for tile, if state is followed by state and the difference between and is maximum at then split point lies in between and in tile. In this work, I have chosen mid-point of and as the split point. The main algorithm splits at into two separate tiles. The exploration, exploitation and splitting phases are repeated with new tiles until the number of splits performed is equal to. To illustrate how smart tile coding is performed, an example with assumed data for a tile is provided in Figure 3.5. In Figure 3.5(a) the approximated Q-value deviations for points representing observed states in a tile are represented using a table, and in Figure 3.2 (b) the difference between Q-deviations of either side of portion between observed consecutive feature pairs is represented using a table. To comprehend how Q-deviation difference is measured for an eligible split range, let us consider the split range which lies in between (. The only observed point to the left of is ; hence is set to and is set to the Q-deviation estimated at, which is. Since of Q-deviations for observed points in the tile is, is updated to, after subtracting from initial. The observed points to the right of are ; hence is set to and is set to the sum of Q-deviations of above points, which is. After subtracting from initial is updated to. The difference in Q-deviations between left, right portions of is equal to the absolute difference between and, which is. The difference between Q-deviations of right side and left side portions of remaining eligible split ranges are also calculated in a similar way. It is evident from table in Figure 3.2(b) that split range contains maximum Q-deviation difference between left side and right side; hence split point for the tile lies in between points representing features and. In this algorithm, I have used the midpoint of selected split range as 52

63 the split point. Hence the split point in this case is 65, which is a midpoint for 60 and 70. In Figure 3.2(c), the tile is split into two new tiles at point. Figure 3.5: Example for smart tile coding: (a) Representation of approximated Q- deviations for observed points in a tile, using a table. In each row, left cell represents an observed point and right cell represents a corresponding estimated Q-deviation. (b) Representation of difference in Q-deviations between right and portions of eligible split ranges using a table. In each row, the left cell represents an eligible split range and right cell represents the corresponding Q-deviation difference. (c) Splitting the parent tile at point 65 into new tiles. 53

64 3.2.6 Hand coded In this method, split points are selected manually at the start of algorithm. The algorithm used for implementing the RL mechanism of hand-coded tiling is given in Table 3.2. It includes a modified form of the Sarsa(λ) algorithm (explained in section 2.3) given in Table 3.3. At the start, each feature in a state space is represented using minimal number of tiles. In, manually provided split points are used to split the tiles. The process of exploration, exploitation and splitting is repeated for times. 54

65 CHAPTER 4 Experiment Setup In this chapter I will explain the testing environment used for evaluating different adaptive tile coding mechanisms. In the first section, I will give a brief introduction to RL-Glue, a software tool used to test standard RL problems (Tanner and White, 2009). In the second section, a detailed description about multiple environments that are used for evaluating RL agents is provided. 4.1 RL-Glue RL-Glue is a software tool used to simplify and standardize the implementation of RL experiments. The implementation of each module and establishing a communication mechanism between the modules of an RL framework (discussed in section 2.0) is complex. RL-Glue provides a simple solution to this problem by providing a common platform to implement RL agents, RL environments and RL experiments as isolated modules. It takes care of establishing and maintaining the communication mechanism between the modules. Implementing the agent, environment and experiment using the RL-Glue protocol reduces the complexity and the amount of code considerably. It allows the use of an RL agent developed using a particular policy with a variety of RL environments. In a similar way, it allows the use of a particular RL environment with a variety of RL agents. It is a perfect platform for comparing the performance and efficiency of different agents using a common environment. RL-Glue considers the experiment setting, the environment and the agent as separate modules and requires separate programs to implement each module. A function calling 55

66 mechanism is used to establish a connection between the modules. Figure 4.1 provides the standard framework used in RL-Glue. Figure 4.1: RL-Glue framework indicating separate modules for Experiment, Environment and Agent programs. The arrows indicates the flow of information The Experiment Module The experiment module is responsible for starting an experiment, controlling the sequence of other module interactions, extracting the results, evaluating the performance of an agent and ending the experiment. It has to follow the RL-Glue protocol to send a message to either the agent module or the environment module. According to the protocol, it cannot access the functions of other modules and can interact with them only by calling the predefined RL-Glue functions. For example, it requires the use of to send a message to the environment module and to send a message to the agent module. is used to start an episode and to restrict the maximum number of steps allowed in an episode. is called to extract the cumulative reward received for the actions of agent in the most recently finished episode. The general sequence of steps followed in the experiment module is provided in Table 4.1. At the start of an experiment, is called to initialize both the environment and action modules. To start a new episode is called with a parameter equal to the maximum 56

67 number of steps allowed in the episode. After calling, the control returns to the experiment module only after either a terminal state is reached or the number of steps executed in the episode is equal to. The return value ) of the above function is the cumulative reward received by the agent after completing the last episode, which is used to evaluate the performance of the agent. At the end of the experiment, is called to trigger cleanup functions in the agent module and the environment module to free any resources being held by the modules. Table 4.1: General sequence of steps followed in the experiment module. The sequence of steps implemented by RL-Glue during the call to is provided in Table 4.2. At the start of a new episode, is called to set the current state ( ) to some random state in the state space. It is followed by a call to with the details of the current state. The agent, using its policy, will choose an action ( ) to perform at the current state. The value of is used as a parameter to call. The environment will update to a new state after applying to. A terminal flag ( ) is set to true if is a terminal state. The reward ( ) for leading to is gauged. The value of an integer variable is updated by 1. The value of is 57

68 added to a variable ( ). If is set or exceeds, is returned to the experiment module. Otherwise, is sent to the agent using. The agent, using its policy will update the value of. Except for the first two steps, the process is repeated. Table 4.2: The sequence of steps followed by RL-Glue during a call to During an experiment, the experiment module can send messages to both the agent module and the environment module using and The Environment Module The environment module is responsible for initializing and maintaining all of the parameters describing a state space. It uses different functions to initialize and update the environment. Different RL-Glue functions used by the environment module are provided in Table 4.3. At the start of an experiment, is called to initialize the environment data and to allocate resources required. At the start of every episode, 58

69 is called to initialize the current state to some random state in the state space. The value of the current state is returned to the agent through RL-Glue. At each step of an episode, is called with the current selected action as a parameter. The environment performs in the current state to get a new state and then the current state is updated to the resulting new state. If the new state is a terminal state the terminal flag is set to true. It also finds the reward for moving to the new state. It returns the terminal flag and the reward to RL-Glue and the current state to the agent through RL-Glue. At the end of an experiment, is called to free any resources allocated to the environment. Allocate required resources Initialize the current state arbitrarily update the current state to a resulting state obtained after performing in the current state set the terminal flag to true if the resulting state is a terminal state Find a reward for reaching the current state and return the reward to the agent. Free the resources Table 4.3: The common RL-Glue functions used in the environment module Agent The role of the agent program is to select an action to be taken in the current state. The agent can use either a well-defined policy or just a random number generator to select an action for the current observed state. In general, it uses a policy to keep track of the desirability of all the states in a state space and chooses an action based on the desirability of the following state. The previously discussed adaptive tile coding 59

70 mechanisms and other tile coding methods are implemented in the agent program. The performance of an agent depends on the policy being used by the agent to choose actions. Different RL-Glue functions used by the environment module are provided in Table 4.4. At the start of an experiment, is called to initialize the agent data and to allocate resources required. At the start of every episode, is called to select an action to be performed in the starting state. The action value is returned to the environment through RL-Glue. At each step of an episode, is called with the current state and previous reward as a parameter. The agent uses the reward to estimate the desirability of the previous action and chooses an action to perform in the current state. At the end of an experiment, is called to free any resources allocated to the agent. Allocate required resources Select an action to perform in the starting state use the reward to update the desirability of the previous action choose an action to perform in the current state Free the resources Table 4.4: The common RL-Glue functions used in the agent program. 4.2 Environments Different environments are available in the RL-Glue library which can be used for evaluating the performance of an RL agent. In this work, I have used three different environments from the RL-Glue library to evaluate the performance of various tile coding mechanisms that I have implemented. The three different environments used are the puddle world problem, the mountain car problem, and the cart pole problem. All the 60

71 parameters used in these environments are set according to the values given in the RL- Glue library specifications The Puddle World Problem Puddle world is the one of the most basic environments used in evaluating the performance of an RL agent. It is a two-dimensional continuous state space with and parameters. The range of values is from to for both parameters. A portion of the state space in puddle world has puddles with variable depths. The puddles are spread around the line segment which starts at (0.1, 0.75) and ends at (0.45, 0.75), and the line segment which starts at (0.45, 0.4) and ends at (0.45, 0.8). A typical puddle world would look like the one shown in Figure 4.2. Figure 4.2: The standard puddle world problem with two puddles. The values along and are continuous from Actions In the puddle world, four different actions are possible in each state. Given a state, it is the responsibility of an agent to choose one of the four possible actions. Table

72 represents the numbers associated with each action. According to the Table 4.5, represents the action to move right, represent the action to move left, represents the action to move up, and represents the action to move down. action number right 0 left 1 top 2 down 3 Table 4.5: The possible actions and the corresponding integers used in the puddle world environment Goal State An agent is said to reach a goal state if both the x-position and y-position of the agent is greater than or equal to (in the upper right corner): Rewards In puddle world, a negative reward is used to help an agent in state-action mapping. The action of an agent, which does not lead to an immediate puddle, receives a reward of -1 if it does not lead to an immediate goal state, and a reward of 0 if it leads to an immediate goal state. If the action taken by an agent leads to an immediate puddle, a large negative reward is received. The value of negative reward depends on the position of the agent in puddle. The negative reward for a position which is in the puddle and is closer to the 62

73 center is higher when compared to a position which is in the puddle but near the boundary of the puddle. Hence, the negative reward for a position in the puddle is proportional to the distance between the center of the puddle and the position Objective The goal of an agent is to reach the terminal state, from an arbitrary starting position in the least possible number of steps possible from the starting position while avoiding the puddles. At each pass, the agent starts from an arbitrary position within the state space. The Q-value of an action at a particular instance of a feature need not be constant; it might depend on the value of other feature. For example, the Q-value for moving right (towards goal state) has to be higher than the rest of the possible actions at x equal to 0.4. It does not hold true always; if the value of y for the state lies between 0.4 and 0.8, then agent has to either move up or bottom to avoid the puddle which is located towards the right side of the current state. Hence, an agent has to consider the effect of other features while determining the Q-value of an action for a feature. The maximum number of steps allowed during an episode is limited to 1000, if an agent is unable to reach the terminal state within 1000 steps, it is forced to quit the episode and a new episode is started The Mountain Car Problem Mountain car environment is based on the mountain car task used by Sutton and Barron(1998). In this environment, an agent has to drive a car uphill to reach a goal position located at the top of a hill. The two dimensional state space has features for car position( ) and velocity( ). The value of ranges from and the value of ranges from. The slope of the hill and the gravity makes sure that a car cannot simply reach the top position by only using full acceleration in forward direction from starting point. It has to build some momentum by moving backward. Only after attaining proper momentum the car can press forward with full acceleration to reach the goal state. During update, if the value becomes less than -, it is reset to. 63

74 Similarly if v crosses any of its boundaries, it is reset to the nearest range limit. Figure 4.3 shows a typical two-dimensional mountain car problem. Figure 4.3: The standard view of mountain car environment. The goal of an agent is to drive the car up to the goal position. The inelastic wall ensures that car stays within the boundaries. action number forward 2 neutral 1 backward 0 Table 4.6: The possible actions and the corresponding numbers in the mountain car environment Actions Three different actions are possible at every given state in a mountain car problem. An agent can drive a car forward, backward and keep neutral. An agent has to choose one of 64

75 the above three actions at each state. A set of integer values are used to represent the actions, Table 4.6 shows the possible actions and their corresponding integers. The forward action pushes car towards the goal state, the backward action pushes car away from a goal state, and neutral action does not exert any external force to the car Goal State An agent successfully completes an episode if it can drive the car to a position where greater than or equal to (at the top of the hill): is Rewards In the mountain car problem, a negative reward is used to help an agent in state-action mapping. An agent is awarded for each step in an episode. An agent receives more negative reward if it takes more steps to complete the episode, and receives less negative reward if it takes fewer steps to complete the episode. It forces the agent to learn how to reach the goal state in less number of steps, in order to get fewer negative rewards Objective An agent has to learn how to drive a car to the top of the hill from any given arbitrary position in least possible steps possible from the starting position. The mountain car is a perfect RL problem to evaluate the performance of an RL agent. Here, the Q-value of a feature changes with the value of another feature. For example, the Q-value for a forward action and backward action at is equal to depending on the value of If the value of is near to the upper limit, the Q-value for the forward action is higher when compared to the backward action and vice versa if the value of is near to lower limit. 65

76 To generalize the state space for the mountain car problem is not trivial. Depending on the car position and velocity, an agent may have to learn to drive a car away from the goal state initially to build some momentum and then drive towards the goal state. The maximum number of steps allowed during an episode is limited to If an agent cannot reach the goal state before steps, it is forced to quit the episode The Cart Pole Problem The cart pole problem is based on the pole-balancing task used by Sutton and Barron(1998). In this environment, an agent has to control a cart moving along a bounded plain surface, so that a pole attached perpendicularly at the center of the cart lies within a certain angle range, which assures balance of the pole. The four-dimensional continuous state space has features for cart position( ), pole angle( ), car velocity( ), and pole angular velocity ( ). The value of ranges from, the value of a ranges from degrees to degrees. The balance of pole is lost, if value becomes less than or greater than, and if the value of a becomes less than degrees or greater than degrees. Figure 4.3 shows a typical four dimensional cart pole problem. Figure 4.4: The standard view of the cart pole balance environment. 66

Performance Comparison of Sarsa(λ) and Watkin s Q(λ) Algorithms

Performance Comparison of Sarsa(λ) and Watkin s Q(λ) Algorithms Performance Comparison of Sarsa(λ) and Watkin s Q(λ) Algorithms Karan M. Gupta Department of Computer Science Texas Tech University Lubbock, TX 7949-314 gupta@cs.ttu.edu Abstract This paper presents a

More information

Pascal De Beck-Courcelle. Master in Applied Science. Electrical and Computer Engineering

Pascal De Beck-Courcelle. Master in Applied Science. Electrical and Computer Engineering Study of Multiple Multiagent Reinforcement Learning Algorithms in Grid Games by Pascal De Beck-Courcelle A thesis submitted to the Faculty of Graduate and Postdoctoral Affairs in partial fulfillment of

More information

Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound. Lecture 12: Deep Reinforcement Learning

Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound. Lecture 12: Deep Reinforcement Learning Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound Lecture 12: Deep Reinforcement Learning Types of Learning Supervised training Learning from the teacher Training data includes

More information

10703 Deep Reinforcement Learning and Control

10703 Deep Reinforcement Learning and Control 10703 Deep Reinforcement Learning and Control Russ Salakhutdinov Machine Learning Department rsalakhu@cs.cmu.edu Policy Gradient I Used Materials Disclaimer: Much of the material and slides for this lecture

More information

Introduction to Reinforcement Learning. J. Zico Kolter Carnegie Mellon University

Introduction to Reinforcement Learning. J. Zico Kolter Carnegie Mellon University Introduction to Reinforcement Learning J. Zico Kolter Carnegie Mellon University 1 Agent interaction with environment Agent State s Reward r Action a Environment 2 Of course, an oversimplification 3 Review:

More information

Reinforcement Learning: A brief introduction. Mihaela van der Schaar

Reinforcement Learning: A brief introduction. Mihaela van der Schaar Reinforcement Learning: A brief introduction Mihaela van der Schaar Outline Optimal Decisions & Optimal Forecasts Markov Decision Processes (MDPs) States, actions, rewards and value functions Dynamic Programming

More information

Novel Function Approximation Techniques for. Large-scale Reinforcement Learning

Novel Function Approximation Techniques for. Large-scale Reinforcement Learning Novel Function Approximation Techniques for Large-scale Reinforcement Learning A Dissertation by Cheng Wu to the Graduate School of Engineering in Partial Fulfillment of the Requirements for the Degree

More information

Slides credited from Dr. David Silver & Hung-Yi Lee

Slides credited from Dr. David Silver & Hung-Yi Lee Slides credited from Dr. David Silver & Hung-Yi Lee Review Reinforcement Learning 2 Reinforcement Learning RL is a general purpose framework for decision making RL is for an agent with the capacity to

More information

Markov Decision Processes and Reinforcement Learning

Markov Decision Processes and Reinforcement Learning Lecture 14 and Marco Chiarandini Department of Mathematics & Computer Science University of Southern Denmark Slides by Stuart Russell and Peter Norvig Course Overview Introduction Artificial Intelligence

More information

Reinforcement Learning and Optimal Control. ASU, CSE 691, Winter 2019

Reinforcement Learning and Optimal Control. ASU, CSE 691, Winter 2019 Reinforcement Learning and Optimal Control ASU, CSE 691, Winter 2019 Dimitri P. Bertsekas dimitrib@mit.edu Lecture 1 Bertsekas Reinforcement Learning 1 / 21 Outline 1 Introduction, History, General Concepts

More information

A Brief Introduction to Reinforcement Learning

A Brief Introduction to Reinforcement Learning A Brief Introduction to Reinforcement Learning Minlie Huang ( ) Dept. of Computer Science, Tsinghua University aihuang@tsinghua.edu.cn 1 http://coai.cs.tsinghua.edu.cn/hml Reinforcement Learning Agent

More information

Deep Reinforcement Learning

Deep Reinforcement Learning Deep Reinforcement Learning 1 Outline 1. Overview of Reinforcement Learning 2. Policy Search 3. Policy Gradient and Gradient Estimators 4. Q-prop: Sample Efficient Policy Gradient and an Off-policy Critic

More information

Adaptive Robotics - Final Report Extending Q-Learning to Infinite Spaces

Adaptive Robotics - Final Report Extending Q-Learning to Infinite Spaces Adaptive Robotics - Final Report Extending Q-Learning to Infinite Spaces Eric Christiansen Michael Gorbach May 13, 2008 Abstract One of the drawbacks of standard reinforcement learning techniques is that

More information

Algorithms for Solving RL: Temporal Difference Learning (TD) Reinforcement Learning Lecture 10

Algorithms for Solving RL: Temporal Difference Learning (TD) Reinforcement Learning Lecture 10 Algorithms for Solving RL: Temporal Difference Learning (TD) 1 Reinforcement Learning Lecture 10 Gillian Hayes 8th February 2007 Incremental Monte Carlo Algorithm TD Prediction TD vs MC vs DP TD for control:

More information

Assignment 4: CS Machine Learning

Assignment 4: CS Machine Learning Assignment 4: CS7641 - Machine Learning Saad Khan November 29, 2015 1 Introduction The purpose of this assignment is to apply some of the techniques learned from reinforcement learning to make decisions

More information

Per-decision Multi-step Temporal Difference Learning with Control Variates

Per-decision Multi-step Temporal Difference Learning with Control Variates Per-decision Multi-step Temporal Difference Learning with Control Variates Kristopher De Asis Department of Computing Science University of Alberta Edmonton, AB T6G 2E8 kldeasis@ualberta.ca Richard S.

More information

Adaptive Building of Decision Trees by Reinforcement Learning

Adaptive Building of Decision Trees by Reinforcement Learning Proceedings of the 7th WSEAS International Conference on Applied Informatics and Communications, Athens, Greece, August 24-26, 2007 34 Adaptive Building of Decision Trees by Reinforcement Learning MIRCEA

More information

Marco Wiering Intelligent Systems Group Utrecht University

Marco Wiering Intelligent Systems Group Utrecht University Reinforcement Learning for Robot Control Marco Wiering Intelligent Systems Group Utrecht University marco@cs.uu.nl 22-11-2004 Introduction Robots move in the physical environment to perform tasks The environment

More information

Fundamentals of Operations Research. Prof. G. Srinivasan. Department of Management Studies. Indian Institute of Technology Madras.

Fundamentals of Operations Research. Prof. G. Srinivasan. Department of Management Studies. Indian Institute of Technology Madras. Fundamentals of Operations Research Prof. G. Srinivasan Department of Management Studies Indian Institute of Technology Madras Lecture No # 06 Simplex Algorithm Initialization and Iteration (Refer Slide

More information

Reinforcement Learning (INF11010) Lecture 6: Dynamic Programming for Reinforcement Learning (extended)

Reinforcement Learning (INF11010) Lecture 6: Dynamic Programming for Reinforcement Learning (extended) Reinforcement Learning (INF11010) Lecture 6: Dynamic Programming for Reinforcement Learning (extended) Pavlos Andreadis, February 2 nd 2018 1 Markov Decision Processes A finite Markov Decision Process

More information

Module 1 Lecture Notes 2. Optimization Problem and Model Formulation

Module 1 Lecture Notes 2. Optimization Problem and Model Formulation Optimization Methods: Introduction and Basic concepts 1 Module 1 Lecture Notes 2 Optimization Problem and Model Formulation Introduction In the previous lecture we studied the evolution of optimization

More information

Apprenticeship Learning for Reinforcement Learning. with application to RC helicopter flight Ritwik Anand, Nick Haliday, Audrey Huang

Apprenticeship Learning for Reinforcement Learning. with application to RC helicopter flight Ritwik Anand, Nick Haliday, Audrey Huang Apprenticeship Learning for Reinforcement Learning with application to RC helicopter flight Ritwik Anand, Nick Haliday, Audrey Huang Table of Contents Introduction Theory Autonomous helicopter control

More information

Generalization in Reinforcement Learning: Successful Examples Using Sparse Coarse Coding

Generalization in Reinforcement Learning: Successful Examples Using Sparse Coarse Coding Advances in Neural Information Processing Systems 8, pp. 1038-1044, MIT Press, 1996. Generalization in Reinforcement Learning: Successful Examples Using Sparse Coarse Coding Richard S. Sutton University

More information

Using Continuous Action Spaces to Solve Discrete Problems

Using Continuous Action Spaces to Solve Discrete Problems Using Continuous Action Spaces to Solve Discrete Problems Hado van Hasselt Marco A. Wiering Abstract Real-world control problems are often modeled as Markov Decision Processes (MDPs) with discrete action

More information

REINFORCEMENT LEARNING: MDP APPLIED TO AUTONOMOUS NAVIGATION

REINFORCEMENT LEARNING: MDP APPLIED TO AUTONOMOUS NAVIGATION REINFORCEMENT LEARNING: MDP APPLIED TO AUTONOMOUS NAVIGATION ABSTRACT Mark A. Mueller Georgia Institute of Technology, Computer Science, Atlanta, GA USA The problem of autonomous vehicle navigation between

More information

Unsupervised Learning. CS 3793/5233 Artificial Intelligence Unsupervised Learning 1

Unsupervised Learning. CS 3793/5233 Artificial Intelligence Unsupervised Learning 1 Unsupervised CS 3793/5233 Artificial Intelligence Unsupervised 1 EM k-means Procedure Data Random Assignment Assign 1 Assign 2 Soft k-means In clustering, the target feature is not given. Goal: Construct

More information

Problem characteristics. Dynamic Optimization. Examples. Policies vs. Trajectories. Planning using dynamic optimization. Dynamic Optimization Issues

Problem characteristics. Dynamic Optimization. Examples. Policies vs. Trajectories. Planning using dynamic optimization. Dynamic Optimization Issues Problem characteristics Planning using dynamic optimization Chris Atkeson 2004 Want optimal plan, not just feasible plan We will minimize a cost function C(execution). Some examples: C() = c T (x T ) +

More information

Combining Deep Reinforcement Learning and Safety Based Control for Autonomous Driving

Combining Deep Reinforcement Learning and Safety Based Control for Autonomous Driving Combining Deep Reinforcement Learning and Safety Based Control for Autonomous Driving Xi Xiong Jianqiang Wang Fang Zhang Keqiang Li State Key Laboratory of Automotive Safety and Energy, Tsinghua University

More information

HARNESSING CERTAINTY TO SPEED TASK-ALLOCATION ALGORITHMS FOR MULTI-ROBOT SYSTEMS

HARNESSING CERTAINTY TO SPEED TASK-ALLOCATION ALGORITHMS FOR MULTI-ROBOT SYSTEMS HARNESSING CERTAINTY TO SPEED TASK-ALLOCATION ALGORITHMS FOR MULTI-ROBOT SYSTEMS An Undergraduate Research Scholars Thesis by DENISE IRVIN Submitted to the Undergraduate Research Scholars program at Texas

More information

Reinforcement Learning for Appearance Based Visual Servoing in Robotic Manipulation

Reinforcement Learning for Appearance Based Visual Servoing in Robotic Manipulation Reinforcement Learning for Appearance Based Visual Servoing in Robotic Manipulation UMAR KHAN, LIAQUAT ALI KHAN, S. ZAHID HUSSAIN Department of Mechatronics Engineering AIR University E-9, Islamabad PAKISTAN

More information

A Systematic Overview of Data Mining Algorithms

A Systematic Overview of Data Mining Algorithms A Systematic Overview of Data Mining Algorithms 1 Data Mining Algorithm A well-defined procedure that takes data as input and produces output as models or patterns well-defined: precisely encoded as a

More information

6. Dicretization methods 6.1 The purpose of discretization

6. Dicretization methods 6.1 The purpose of discretization 6. Dicretization methods 6.1 The purpose of discretization Often data are given in the form of continuous values. If their number is huge, model building for such data can be difficult. Moreover, many

More information

15-780: MarkovDecisionProcesses

15-780: MarkovDecisionProcesses 15-780: MarkovDecisionProcesses J. Zico Kolter Feburary 29, 2016 1 Outline Introduction Formal definition Value iteration Policy iteration Linear programming for MDPs 2 1988 Judea Pearl publishes Probabilistic

More information

Supervised Learning with Neural Networks. We now look at how an agent might learn to solve a general problem by seeing examples.

Supervised Learning with Neural Networks. We now look at how an agent might learn to solve a general problem by seeing examples. Supervised Learning with Neural Networks We now look at how an agent might learn to solve a general problem by seeing examples. Aims: to present an outline of supervised learning as part of AI; to introduce

More information

Reinforcement Learning (2)

Reinforcement Learning (2) Reinforcement Learning (2) Bruno Bouzy 1 october 2013 This document is the second part of the «Reinforcement Learning» chapter of the «Agent oriented learning» teaching unit of the Master MI computer course.

More information

Instantaneously trained neural networks with complex inputs

Instantaneously trained neural networks with complex inputs Louisiana State University LSU Digital Commons LSU Master's Theses Graduate School 2003 Instantaneously trained neural networks with complex inputs Pritam Rajagopal Louisiana State University and Agricultural

More information

Sparse Distributed Memories in Reinforcement Learning: Case Studies

Sparse Distributed Memories in Reinforcement Learning: Case Studies Sparse Distributed Memories in Reinforcement Learning: Case Studies Bohdana Ratitch Swaminathan Mahadevan Doina Precup {bohdana,smahad,dprecup}@cs.mcgill.ca, McGill University, Canada Abstract In this

More information

YOrk Reinforcement Learning Library (YORLL)

YOrk Reinforcement Learning Library (YORLL) YOrk Reinforcement Learning Library (YORLL) Peter Scopes, Vibhor Agarwal, Sam Devlin, Kirk Efthymiadis, Kleanthis Malialis, Dorothy Thato Kentse, Daniel Kudenko Reinforcement Learning Group, Department

More information

Locally Weighted Learning for Control. Alexander Skoglund Machine Learning Course AASS, June 2005

Locally Weighted Learning for Control. Alexander Skoglund Machine Learning Course AASS, June 2005 Locally Weighted Learning for Control Alexander Skoglund Machine Learning Course AASS, June 2005 Outline Locally Weighted Learning, Christopher G. Atkeson et. al. in Artificial Intelligence Review, 11:11-73,1997

More information

Using Machine Learning to Optimize Storage Systems

Using Machine Learning to Optimize Storage Systems Using Machine Learning to Optimize Storage Systems Dr. Kiran Gunnam 1 Outline 1. Overview 2. Building Flash Models using Logistic Regression. 3. Storage Object classification 4. Storage Allocation recommendation

More information

WestminsterResearch

WestminsterResearch WestminsterResearch http://www.westminster.ac.uk/research/westminsterresearch Reinforcement learning in continuous state- and action-space Barry D. Nichols Faculty of Science and Technology This is an

More information

Formal Model. Figure 1: The target concept T is a subset of the concept S = [0, 1]. The search agent needs to search S for a point in T.

Formal Model. Figure 1: The target concept T is a subset of the concept S = [0, 1]. The search agent needs to search S for a point in T. Although this paper analyzes shaping with respect to its benefits on search problems, the reader should recognize that shaping is often intimately related to reinforcement learning. The objective in reinforcement

More information

Markov Decision Processes (MDPs) (cont.)

Markov Decision Processes (MDPs) (cont.) Markov Decision Processes (MDPs) (cont.) Machine Learning 070/578 Carlos Guestrin Carnegie Mellon University November 29 th, 2007 Markov Decision Process (MDP) Representation State space: Joint state x

More information

A thesis submitted in partial fulfilment of the requirements for the Degree. of Master of Engineering in Mechanical Engineering

A thesis submitted in partial fulfilment of the requirements for the Degree. of Master of Engineering in Mechanical Engineering MACHINE LEARNING FOR INTELLIGENT CONTROL: APPLICATION OF REINFORCEMENT LEARNING TECHNIQUES TO THE DEVELOPMENT OF FLIGHT CONTROL SYSTEMS FOR MINIATURE UAV ROTORCRAFT A thesis submitted in partial fulfilment

More information

A Systematic Overview of Data Mining Algorithms. Sargur Srihari University at Buffalo The State University of New York

A Systematic Overview of Data Mining Algorithms. Sargur Srihari University at Buffalo The State University of New York A Systematic Overview of Data Mining Algorithms Sargur Srihari University at Buffalo The State University of New York 1 Topics Data Mining Algorithm Definition Example of CART Classification Iris, Wine

More information

Reinforcement Learning in Discrete and Continuous Domains Applied to Ship Trajectory Generation

Reinforcement Learning in Discrete and Continuous Domains Applied to Ship Trajectory Generation POLISH MARITIME RESEARCH Special Issue S1 (74) 2012 Vol 19; pp. 31-36 10.2478/v10012-012-0020-8 Reinforcement Learning in Discrete and Continuous Domains Applied to Ship Trajectory Generation Andrzej Rak,

More information

CHAPTER 6 ORTHOGONAL PARTICLE SWARM OPTIMIZATION

CHAPTER 6 ORTHOGONAL PARTICLE SWARM OPTIMIZATION 131 CHAPTER 6 ORTHOGONAL PARTICLE SWARM OPTIMIZATION 6.1 INTRODUCTION The Orthogonal arrays are helpful in guiding the heuristic algorithms to obtain a good solution when applied to NP-hard problems. This

More information

Human-level Control Through Deep Reinforcement Learning (Deep Q Network) Peidong Wang 11/13/2015

Human-level Control Through Deep Reinforcement Learning (Deep Q Network) Peidong Wang 11/13/2015 Human-level Control Through Deep Reinforcement Learning (Deep Q Network) Peidong Wang 11/13/2015 Content Demo Framework Remarks Experiment Discussion Content Demo Framework Remarks Experiment Discussion

More information

Gradient Reinforcement Learning of POMDP Policy Graphs

Gradient Reinforcement Learning of POMDP Policy Graphs 1 Gradient Reinforcement Learning of POMDP Policy Graphs Douglas Aberdeen Research School of Information Science and Engineering Australian National University Jonathan Baxter WhizBang! Labs July 23, 2001

More information

An Actor-Critic Algorithm using a Binary Tree Action Selector

An Actor-Critic Algorithm using a Binary Tree Action Selector Trans. of the Society of Instrument and Control Engineers Vol.E-4, No.1, 1/9 (27) An Actor-Critic Algorithm using a Binary Tree Action Selector Reinforcement Learning to Cope with Enormous Actions Hajime

More information

A Symmetric Multiprocessor Architecture for Multi-Agent Temporal Difference Learning

A Symmetric Multiprocessor Architecture for Multi-Agent Temporal Difference Learning A Symmetric Multiprocessor Architecture for Multi-Agent Temporal Difference Learning Scott Fields, Student Member, IEEE, Itamar Elhanany, Senior Member, IEEE Department of Electrical & Computer Engineering

More information

Homework # 2 Due: October 6. Programming Multiprocessors: Parallelism, Communication, and Synchronization

Homework # 2 Due: October 6. Programming Multiprocessors: Parallelism, Communication, and Synchronization ECE669: Parallel Computer Architecture Fall 2 Handout #2 Homework # 2 Due: October 6 Programming Multiprocessors: Parallelism, Communication, and Synchronization 1 Introduction When developing multiprocessor

More information

Space Filling Curves and Hierarchical Basis. Klaus Speer

Space Filling Curves and Hierarchical Basis. Klaus Speer Space Filling Curves and Hierarchical Basis Klaus Speer Abstract Real world phenomena can be best described using differential equations. After linearisation we have to deal with huge linear systems of

More information

Value Iteration. Reinforcement Learning: Introduction to Machine Learning. Matt Gormley Lecture 23 Apr. 10, 2019

Value Iteration. Reinforcement Learning: Introduction to Machine Learning. Matt Gormley Lecture 23 Apr. 10, 2019 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Reinforcement Learning: Value Iteration Matt Gormley Lecture 23 Apr. 10, 2019 1

More information

Rough Sets-based Prototype Optimization in Kanerva-based Function Approximation

Rough Sets-based Prototype Optimization in Kanerva-based Function Approximation 215 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology Rough Sets-based Prototype Optimization in Kanerva-based Function Approximation Cheng Wu School of Urban Rail

More information

Hierarchical Reinforcement Learning for Robot Navigation

Hierarchical Reinforcement Learning for Robot Navigation Hierarchical Reinforcement Learning for Robot Navigation B. Bischoff 1, D. Nguyen-Tuong 1,I-H.Lee 1, F. Streichert 1 and A. Knoll 2 1- Robert Bosch GmbH - Corporate Research Robert-Bosch-Str. 2, 71701

More information

Machine Learning Reliability Techniques for Composite Materials in Structural Applications.

Machine Learning Reliability Techniques for Composite Materials in Structural Applications. Machine Learning Reliability Techniques for Composite Materials in Structural Applications. Roberto d Ippolito, Keiichi Ito, Silvia Poles, Arnaud Froidmont Noesis Solutions Optimus by Noesis Solutions

More information

Brainstormers Team Description

Brainstormers Team Description Brainstormers 2003 - Team Description M. Riedmiller, A. Merke, M. Nickschas, W. Nowak, and D. Withopf Lehrstuhl Informatik I, Universität Dortmund, 44221 Dortmund, Germany Abstract. The main interest behind

More information

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask Machine Learning and Data Mining Clustering (1): Basics Kalev Kask Unsupervised learning Supervised learning Predict target value ( y ) given features ( x ) Unsupervised learning Understand patterns of

More information

Hierarchical Assignment of Behaviours by Self-Organizing

Hierarchical Assignment of Behaviours by Self-Organizing Hierarchical Assignment of Behaviours by Self-Organizing W. Moerman 1 B. Bakker 2 M. Wiering 3 1 M.Sc. Cognitive Artificial Intelligence Utrecht University 2 Intelligent Autonomous Systems Group University

More information

Using Machine Learning Techniques for Autonomous Planning and Navigation with Groups of Unmanned Vehicles

Using Machine Learning Techniques for Autonomous Planning and Navigation with Groups of Unmanned Vehicles Using Machine Learning Techniques for Autonomous Planning and Navigation with Groups of Unmanned Vehicles Gerben Bergwerff July 19, 2016 Master s Thesis Department of Artificial Intelligence, University

More information

Picture Maze Generation by Repeated Contour Connection and Graph Structure of Maze

Picture Maze Generation by Repeated Contour Connection and Graph Structure of Maze Computer Science and Engineering 2013, 3(3): 76-83 DOI: 10.5923/j.computer.20130303.04 Picture Maze Generation by Repeated Contour Connection and Graph Structure of Maze Tomio Kurokawa Department of Information

More information

When Network Embedding meets Reinforcement Learning?

When Network Embedding meets Reinforcement Learning? When Network Embedding meets Reinforcement Learning? ---Learning Combinatorial Optimization Problems over Graphs Changjun Fan 1 1. An Introduction to (Deep) Reinforcement Learning 2. How to combine NE

More information

The Fly & Anti-Fly Missile

The Fly & Anti-Fly Missile The Fly & Anti-Fly Missile Rick Tilley Florida State University (USA) rt05c@my.fsu.edu Abstract Linear Regression with Gradient Descent are used in many machine learning applications. The algorithms are

More information

Reinforcement Learning and Shape Grammars

Reinforcement Learning and Shape Grammars Reinforcement Learning and Shape Grammars Technical report Author Manuela Ruiz Montiel Date April 15, 2011 Version 1.0 1 Contents 0. Introduction... 3 1. Tabular approach... 4 1.1 Tabular Q-learning...

More information

Approximate Q-Learning 3/23/18

Approximate Q-Learning 3/23/18 Approximate Q-Learning 3/23/18 On-Policy Learning (SARSA) Instead of updating based on the best action from the next state, update based on the action your current policy actually takes from the next state.

More information

arxiv: v1 [cs.cv] 2 Sep 2018

arxiv: v1 [cs.cv] 2 Sep 2018 Natural Language Person Search Using Deep Reinforcement Learning Ankit Shah Language Technologies Institute Carnegie Mellon University aps1@andrew.cmu.edu Tyler Vuong Electrical and Computer Engineering

More information

Line Segment Based Watershed Segmentation

Line Segment Based Watershed Segmentation Line Segment Based Watershed Segmentation Johan De Bock 1 and Wilfried Philips Dep. TELIN/TW07, Ghent University Sint-Pietersnieuwstraat 41, B-9000 Ghent, Belgium jdebock@telin.ugent.be Abstract. In this

More information

Convexization in Markov Chain Monte Carlo

Convexization in Markov Chain Monte Carlo in Markov Chain Monte Carlo 1 IBM T. J. Watson Yorktown Heights, NY 2 Department of Aerospace Engineering Technion, Israel August 23, 2011 Problem Statement MCMC processes in general are governed by non

More information

Self-Organization of Place Cells and Reward-Based Navigation for a Mobile Robot

Self-Organization of Place Cells and Reward-Based Navigation for a Mobile Robot Self-Organization of Place Cells and Reward-Based Navigation for a Mobile Robot Takashi TAKAHASHI Toshio TANAKA Kenji NISHIDA Takio KURITA Postdoctoral Research Fellow of the Japan Society for the Promotion

More information

Advanced Operations Research Prof. G. Srinivasan Department of Management Studies Indian Institute of Technology, Madras

Advanced Operations Research Prof. G. Srinivasan Department of Management Studies Indian Institute of Technology, Madras Advanced Operations Research Prof. G. Srinivasan Department of Management Studies Indian Institute of Technology, Madras Lecture 16 Cutting Plane Algorithm We shall continue the discussion on integer programming,

More information

CS4758: Rovio Augmented Vision Mapping Project

CS4758: Rovio Augmented Vision Mapping Project CS4758: Rovio Augmented Vision Mapping Project Sam Fladung, James Mwaura Abstract The goal of this project is to use the Rovio to create a 2D map of its environment using a camera and a fixed laser pointer

More information

Neuro-Dynamic Programming An Overview

Neuro-Dynamic Programming An Overview 1 Neuro-Dynamic Programming An Overview Dimitri Bertsekas Dept. of Electrical Engineering and Computer Science M.I.T. May 2006 2 BELLMAN AND THE DUAL CURSES Dynamic Programming (DP) is very broadly applicable,

More information

Data Mining and Data Warehousing Classification-Lazy Learners

Data Mining and Data Warehousing Classification-Lazy Learners Motivation Data Mining and Data Warehousing Classification-Lazy Learners Lazy Learners are the most intuitive type of learners and are used in many practical scenarios. The reason of their popularity is

More information

A System for Managing Experiments in Data Mining. A Thesis. Presented to. The Graduate Faculty of The University of Akron. In Partial Fulfillment

A System for Managing Experiments in Data Mining. A Thesis. Presented to. The Graduate Faculty of The University of Akron. In Partial Fulfillment A System for Managing Experiments in Data Mining A Thesis Presented to The Graduate Faculty of The University of Akron In Partial Fulfillment of the Requirements for the Degree Master of Science Greeshma

More information

Introduction to Deep Q-network

Introduction to Deep Q-network Introduction to Deep Q-network Presenter: Yunshu Du CptS 580 Deep Learning 10/10/2016 Deep Q-network (DQN) Deep Q-network (DQN) An artificial agent for general Atari game playing Learn to master 49 different

More information

Instance-Based Action Models for Fast Action Planning

Instance-Based Action Models for Fast Action Planning Instance-Based Action Models for Fast Action Planning Mazda Ahmadi and Peter Stone Department of Computer Sciences The University of Texas at Austin 1 University Station C0500, Austin, TX 78712-0233 Email:{mazda,pstone}@cs.utexas.edu

More information

CONCENTRATIONS: HIGH-PERFORMANCE COMPUTING & BIOINFORMATICS CYBER-SECURITY & NETWORKING

CONCENTRATIONS: HIGH-PERFORMANCE COMPUTING & BIOINFORMATICS CYBER-SECURITY & NETWORKING MAJOR: DEGREE: COMPUTER SCIENCE MASTER OF SCIENCE (M.S.) CONCENTRATIONS: HIGH-PERFORMANCE COMPUTING & BIOINFORMATICS CYBER-SECURITY & NETWORKING The Department of Computer Science offers a Master of Science

More information

Convolutional Restricted Boltzmann Machine Features for TD Learning in Go

Convolutional Restricted Boltzmann Machine Features for TD Learning in Go ConvolutionalRestrictedBoltzmannMachineFeatures fortdlearningingo ByYanLargmanandPeterPham AdvisedbyHonglakLee 1.Background&Motivation AlthoughrecentadvancesinAIhaveallowed Go playing programs to become

More information

Partially Observable Markov Decision Processes. Silvia Cruciani João Carvalho

Partially Observable Markov Decision Processes. Silvia Cruciani João Carvalho Partially Observable Markov Decision Processes Silvia Cruciani João Carvalho MDP A reminder: is a set of states is a set of actions is the state transition function. is the probability of ending in state

More information

Towards Traffic Anomaly Detection via Reinforcement Learning and Data Flow

Towards Traffic Anomaly Detection via Reinforcement Learning and Data Flow Towards Traffic Anomaly Detection via Reinforcement Learning and Data Flow Arturo Servin Computer Science, University of York aservin@cs.york.ac.uk Abstract. Protection of computer networks against security

More information

Programming Reinforcement Learning in Jason

Programming Reinforcement Learning in Jason Programming Reinforcement Learning in Jason Amelia Bădică 1, Costin Bădică 1, Mirjana Ivanović 2 1 University of Craiova, Romania 2 University of Novi Sad, Serbia Talk Outline Introduction, Motivation

More information

AUTOMATED STUDENT S ATTENDANCE ENTERING SYSTEM BY ELIMINATING FORGE SIGNATURES

AUTOMATED STUDENT S ATTENDANCE ENTERING SYSTEM BY ELIMINATING FORGE SIGNATURES AUTOMATED STUDENT S ATTENDANCE ENTERING SYSTEM BY ELIMINATING FORGE SIGNATURES K. P. M. L. P. Weerasinghe 149235H Faculty of Information Technology University of Moratuwa June 2017 AUTOMATED STUDENT S

More information

Clustering with Reinforcement Learning

Clustering with Reinforcement Learning Clustering with Reinforcement Learning Wesam Barbakh and Colin Fyfe, The University of Paisley, Scotland. email:wesam.barbakh,colin.fyfe@paisley.ac.uk Abstract We show how a previously derived method of

More information

Online Graph Exploration

Online Graph Exploration Distributed Computing Online Graph Exploration Semester thesis Simon Hungerbühler simonhu@ethz.ch Distributed Computing Group Computer Engineering and Networks Laboratory ETH Zürich Supervisors: Sebastian

More information

Lab 9. Julia Janicki. Introduction

Lab 9. Julia Janicki. Introduction Lab 9 Julia Janicki Introduction My goal for this project is to map a general land cover in the area of Alexandria in Egypt using supervised classification, specifically the Maximum Likelihood and Support

More information

Reinforcement Learning in Multi-dimensional State-Action Space using Random Rectangular Coarse Coding and Gibbs Sampling

Reinforcement Learning in Multi-dimensional State-Action Space using Random Rectangular Coarse Coding and Gibbs Sampling SICE Annual Conference 27 Sept. 17-2, 27, Kagawa University, Japan Reinforcement Learning in Multi-dimensional State-Action Space using Random Rectangular Coarse Coding and Gibbs Sampling Hajime Kimura

More information

A Comparative study of Clustering Algorithms using MapReduce in Hadoop

A Comparative study of Clustering Algorithms using MapReduce in Hadoop A Comparative study of Clustering Algorithms using MapReduce in Hadoop Dweepna Garg 1, Khushboo Trivedi 2, B.B.Panchal 3 1 Department of Computer Science and Engineering, Parul Institute of Engineering

More information

STABILIZED FINITE ELEMENT METHODS FOR INCOMPRESSIBLE FLOWS WITH EMPHASIS ON MOVING BOUNDARIES AND INTERFACES

STABILIZED FINITE ELEMENT METHODS FOR INCOMPRESSIBLE FLOWS WITH EMPHASIS ON MOVING BOUNDARIES AND INTERFACES STABILIZED FINITE ELEMENT METHODS FOR INCOMPRESSIBLE FLOWS WITH EMPHASIS ON MOVING BOUNDARIES AND INTERFACES A THESIS SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL OF THE UNIVERSITY OF MINNESOTA BY Marek

More information

Q-learning with linear function approximation

Q-learning with linear function approximation Q-learning with linear function approximation Francisco S. Melo and M. Isabel Ribeiro Institute for Systems and Robotics [fmelo,mir]@isr.ist.utl.pt Conference on Learning Theory, COLT 2007 June 14th, 2007

More information

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2017

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2017 CPSC 340: Machine Learning and Data Mining Principal Component Analysis Fall 2017 Assignment 3: 2 late days to hand in tonight. Admin Assignment 4: Due Friday of next week. Last Time: MAP Estimation MAP

More information

Path Planning for a Robot Manipulator based on Probabilistic Roadmap and Reinforcement Learning

Path Planning for a Robot Manipulator based on Probabilistic Roadmap and Reinforcement Learning 674 International Journal Jung-Jun of Control, Park, Automation, Ji-Hun Kim, and and Systems, Jae-Bok vol. Song 5, no. 6, pp. 674-680, December 2007 Path Planning for a Robot Manipulator based on Probabilistic

More information

Operations Research and Optimization: A Primer

Operations Research and Optimization: A Primer Operations Research and Optimization: A Primer Ron Rardin, PhD NSF Program Director, Operations Research and Service Enterprise Engineering also Professor of Industrial Engineering, Purdue University Introduction

More information

CSE446: Linear Regression. Spring 2017

CSE446: Linear Regression. Spring 2017 CSE446: Linear Regression Spring 2017 Ali Farhadi Slides adapted from Carlos Guestrin and Luke Zettlemoyer Prediction of continuous variables Billionaire says: Wait, that s not what I meant! You say: Chill

More information

10703 Deep Reinforcement Learning and Control

10703 Deep Reinforcement Learning and Control 10703 Deep Reinforcement Learning and Control Russ Salakhutdinov Machine Learning Department rsalakhu@cs.cmu.edu Policy Gradient II Used Materials Disclaimer: Much of the material and slides for this lecture

More information

Classification of Optimization Problems and the Place of Calculus of Variations in it

Classification of Optimization Problems and the Place of Calculus of Variations in it Lecture 1 Classification of Optimization Problems and the Place of Calculus of Variations in it ME256 Indian Institute of Science G. K. Ananthasuresh Professor, Mechanical Engineering, Indian Institute

More information

Machine Learning Classifiers and Boosting

Machine Learning Classifiers and Boosting Machine Learning Classifiers and Boosting Reading Ch 18.6-18.12, 20.1-20.3.2 Outline Different types of learning problems Different types of learning algorithms Supervised learning Decision trees Naïve

More information

Shortest-path calculation of first arrival traveltimes by expanding wavefronts

Shortest-path calculation of first arrival traveltimes by expanding wavefronts Stanford Exploration Project, Report 82, May 11, 2001, pages 1 144 Shortest-path calculation of first arrival traveltimes by expanding wavefronts Hector Urdaneta and Biondo Biondi 1 ABSTRACT A new approach

More information

Data mining with Support Vector Machine

Data mining with Support Vector Machine Data mining with Support Vector Machine Ms. Arti Patle IES, IPS Academy Indore (M.P.) artipatle@gmail.com Mr. Deepak Singh Chouhan IES, IPS Academy Indore (M.P.) deepak.schouhan@yahoo.com Abstract: Machine

More information

Approximate Linear Successor Representation

Approximate Linear Successor Representation Approximate Linear Successor Representation Clement A. Gehring Computer Science and Artificial Intelligence Laboratory Massachusetts Institutes of Technology Cambridge, MA 2139 gehring@csail.mit.edu Abstract

More information