Marco Wiering Intelligent Systems Group Utrecht University

Reinforcement Learning for Robot Control Marco Wiering Intelligent Systems Group Utrecht University marco@cs.uu.nl 22-11-2004

Introduction Robots move in the physical environment to perform tasks The environment is continuous and uncertain Programming a robot controller is often difficult Reinforcement learning methods are useful for learning controllers from trial and error We will examine how reinforcement learning can be used for optimizing robot controllers

Contents of this presentation Robot Control Problems Dynamic Programming Reinforcement Learning Reinforcement Learning with Function Approximators Experimental Results Discussion

Robotics A robot is an active moving artificial agent that lives in the physical environment We will concentrate ourselves on autonomous robots: robots that make their own decisions using feedback from their sensors The problem of designing robot controllers is that the real world is highly demanding: (1) The real world cannot be perfectly and completely perceived, (2) It is non-deterministic, (3) It is dynamic, and (4) It is continuous.

Robot Applications Robots are used for many different tasks: Manufacturing: repetitive tasks on a production-belt (car and micro-electronic industry) Construction industry (e.g. cutting sheep wool) Security and post-delivery in buildings Unmanned cars, underwater vehicles, unmanned aeroplanes Rescue operations (e.g. after earthquakes)

Robot Control Problems Robot control problems can be divided into: Manipulation: change the position of objects using e.g. a robot arm Locomotion: change the physical position of the robot by driving (walking) around Manipulation and Locomotion: do both things at the same time (e.g. in Robot Soccer)

Locomotion with legs Most animals use legs for locomotion. Controlling legs is harder than controlling wheels, however. The advantage of using legs is that locomotion on rough surfaces (e.g. with many stones or stairs) becomes possible The Ambler robot (1992) has 6 legs, is almost 10 meters big, and can climb obstacles with a size of 2 meters The Ambler robot is a static stable walker like an AIBO robot. Such robots can stand still, but are usually quite slow and consume a lot of energy

Locomotion with wheels Locomotion with wheels is most practical for most environments. Wheels are easier to build and more efficient (stable, and faster) Robots with wheels also have problems: (x,y) Φ A car can drive to every position and take every orientation so there are three degrees of freedom. But: we can only steer and drive; so there are only two actuators

A non holonomic robot has less degrees of freedom for control than total degrees of freedom A holonomic robot has as many degrees of freedom for control as positional degrees of freedom

Manipulation with robots Manipulators are effectors which can move objects in the environment Kinematics is the study of how movements of the actuators correspond to movements of parts of the robot in the physical space Most manipulators allow rotating or linear translating movements:

Control for Locomotion The goal of locomotion is usually to find a minimal cost path from one state in the environment to another state The difficulty of this is that the robot often is uncertain about its own position Solving path-planning problems with uncertainty is often cast in the framework of partially observable Markov decision processes (POMDPs) Since solving even simpler instances of POMDPs are NP-hard problems, often heuristic algorithms are used

Control for Manipulation The goal of manipulation is often to grasp an object and to put (or fix) it in another place The good thing of manipulation is that usually the complete environment can be observed The difficulty is that there are many degrees of freedom to be controlled Inverse Kinematics could be used for simple (e.g. empty) environments, but learning methods can work better if many obstacles are present

Introduction to Reinforcement Learning Supervised learning: learning from data which are all labelled with the desired outcome/action Reinforcement learning: learning by trial and error. The agent interacts with the environment and may receive reward or punishment Navigate a robot Reward: if the desired position is reached Punishment: if the robot makes a collision Play chess, checkers, backgammon,... Reward if the game is won, Punishment if the game is lost

Reinforcement Learning principles Learn to control an agent by trying out actions and using the obtained feedback (rewards) to strengthen (reinforce) the agent s behavior. The agent interacts with the environment by using its (virtual) sensors and effectors. The reward function determines which agent s behavior is most desired ENVIRONMENT INPUT REWARD ACTION AGENT

Some applications of Reinforcement Learning Game-playing (checkers, backgammon, chess) Elevator control Robot control Combinatorial optimisation Simulated robot soccer Network routing Traffic control

Convergence issues in Reinforcement Learning Reinforcement Learning algorithms with lookup-tables (tabular RL) are proved to converge to an optimal policy after an infinite number of experiences Reinforcement learning with function approximators have sometimes obtained excellent results (e.g. Tesauro s TD-Gammon) However, particular studies have shown that RL with function approximators may diverge (to infinite parameter values)

Markov Decision Problems A Markov decision problem (MDP) consists of: S: Finite set of states {s 1,s 2,...,s n }. A: Finite set of actions. P (i, a, j): probability to make a step to state j if action a is selected in state i. R(i, a, j) reward for making a transition from state i to state j by executing action a γ: discount parameter for future rewards (0 γ 1)

Dynamic Programming and Value Functions (1) The value of a state V π (s) is defined as the expected cumulative discounted future reward starting in state s and following policy π: V π (s) =E( i=0 γ i r i s 0 = s, π) The optimal policy is the one which has the largest state-value in all states: π =argmax V π π

Dynamic Programming and Value Functions (2) We also define the Q-value of a state-action pair as: Q π (s, a) =E( i=0 γ i r i s 0 = s, a 0 = a, π) The Bellman optimality equation relates a state-action value of an optimal Q-function to other optimal state-values: Q (s, a) = s P (s, a, s )(R(s, a, s )+γv (s ))

Dynamic Programming and Value Functions (3) The Bellman equation has led to very efficient dynamic programming (DP) algorithms. Value iteration computes an optimal policy for a known Markov decision problem using: Q k+1 (s, a) := s P (s, a, s )(R(s, a, s )+γv k (s )) Where: V k (s) =max a Q k (s, a) It can be easily shown that lim k Q k = Q.

Reinforcement Learning Dynamic programming is not applicable when: The Markov decision problem is unknown There are too many states (often caused by many state-variables leading to Bellman s curse of dimensionality) There are continuous states or actions For such cases, we can use Reinforcement Learning algorithms. RL algorithms learn from interacting with an environment and can be combined with function approximators.

RL Algorithms: Q-learning A well-known RL algorithm is Q-learning (Watkins, 1989) which updates the Q-function after an experience (s t,a t,r t,s t+1 ) as follows: Q(s t,a t ):=Q(s t,a t )+α(r t + γ max a Where 0 <α 1 is the learning rate. Q(s t+1,a) Q(s t,a t )) Q-learning is an off-policy RL algorithm, meaning that it learns about the optimal value function while following another behavioural policy. Tabular Q-learning converges to the optimal policy after an infinite number of experiences.

RL Algorithms: SARSA SARSA (State-Action-Reward-State-Action) is an example of an on-policy RL algorithm. Q(s t,a t ):=Q(s t,a t )+α(r t + γq(s t+1,a t+1 ) Q(s t,a t )) Since SARSA is an on-policy algorithm it learns the Q-function based on the trajectories generated by the behavioural policy. To make SARSA converge, the exploration policy should be GLIE (Greedy in the Limit of Infinite Exploration), see (Singh et al., 2000)

Reinforcement Learning with Function Approximators To learn value functions for problems with many (or continuous) state variables, we have to combine reinforcement learning with function approximators. We concentrate on linear function approximators using a parameter vector θ. The value of a state-action pair is: Q(s, a) = i θ i,a φ i (s) Where the state-vector s t which is received by the agent at time t is mapped upon a feature-vector φ(s t ).

Linear Function Approximators The linear function approximator looks as follows: Action values Learnable weights Feature vector Fixed method Input (State)

Standard Q-learning with Linear FAs When standard Q-learning is used, we can update the parameter vector as follows: θ i,at := θ i,at + α(r t + γ max a Q(s t+1,a) Q(s t,a t ))φ i (s t ) Q-learning using this update-rule may diverge to infinite values.

Standard SARSA with Linear FAs When standard SARSA is used, we use the following update: θ i,at := θ i,at + α(r t + γq(s t+1,a t+1 ) Q(s t,a t ))φ i (s t ) Gordon (2002) proved that using this update for a fixed policy the parameters will converge, and for a changing policy the parameters will stay fixed in a bounded region. Perkins and Precup (2002) proved that the parameters will converge if the policy improvement operator produces ɛ-soft policies and is Lipschitz continuous in the action values with a constant that is not too large.

Averaging RL with Linear FAs In averaging RL updates do not use the difference between two successor state-values, but between the successor state-value and the parameter s value itself. Reynolds (2002) has shown that averaging RL will not diverge if the feature-vector for each state is normalised, but the value functions will remain in some region. We assume normalised feature vectors, so we will use φ i (s) as the normalised feature-vector computed as: φ i (s) = φ i(s) j φ j(s)

Averaging Q-learning with Linear FAs The averaging Q-learning rule looks as follows for all i: θ i,at := θ i,at + α(r t + γ max a Q(s t+1,a) θ i,at )φ i(s t ) The θ parameters learn towards the desired value and are not cooperating in the learning updates. Under a fixed stationary distribution with interpolative function approximators, averaging Q-learning has been proved to converge (Szepesvari and Smart, 2004).

Sample-based Dynamic Programming In case we have a model of the environment, we can also use sample-based Dynamic Programming as was proposed by Boyan and Moore (1995) and Gordon (1995). Sample-based value iteration goes as follows: First we select a subset of states S 0 S. OftenS 0 is chosen arbitrarily, but in general S 0 should be large enough and be spread over the state space. Then we are going to use the states s S 0 for updating the value function using a model of the environment.

Averaging Q-value iteration with linear FAs Averaging sample-based DP uses the following update, where T (.) is the backup operator: 1 T (θ i,a ) = s S 0 φ i (s) γ max b j s S 0 φ i(s) φ j(s )θ j,b ) s P (s, a, s )(R(s, a, s )+ This DP-update rule is obtained from the averaging Q-learning rule with linear function approximators (where we set the learning rate α to 1).

Deriving the Averaging Q-value iteration update Remember that we had: θ i,at We rewrite this as: := θ i,at +(r t + γ max a Q(s t+1,a) θ i,at )φ i(s t ) θ i,at := θ i,at +(r t + γ max Q(s t+1,a))φ i(s t ) θ i,at φ i(s t ) a θ i,at := (r t + γ max a Q(s t+1,a))φ i (s t) φ i (s t) The Q-value iteration backup operator is obtained by averaging over all states in the sweep.

Analysing the Averaging Q-value iteration update This is averaging DP, because we take the weighted average of all targets irrespective of other parameter values for the current state, so parameter estimation is done in a non co-operative way. Note that averaging RL has a problem. Consider two examples: 0.5 1and1 2. The best value of the parameter would be 2, but averaging Q-value iteration will compute the value 5 3. Note that this would also be the fixed point if averaging Q-learning is used: (1-5/3)*0.5 + (2-5/3)*1 = 0.

Standard Q-value iteration with linear FAs (1) For standard value-iteration with linear function approximators, the resulting algorithm will look a bit non-conventional. First of all we introduce the value of a state if a particular parameter i is not used as: Q i (s, a) = j i θ j,a φ j(s) Q i (s, a) = 0 when only feature i active.

Standard Q-value iteration with linear FAs (2) The standard Q-value iteration algorithm that is co-operative, but may diverge, is: 1 T (θ i,a ) = s S 0 φ i (s)2 s S 0 φ i(s) (R(s, a, s )+γ max b j s P (s, a, s ) φ j(s )θ j,b Q i (s, a))

Deriving the Standard Q-value iteration update This rule is obtained by examining the Standard Q-learning rule with linear function approximators (with α = 1). Remember that we had: θ i,a := θ i,a +(r t + γ max b Q(s t+1,b) Q(s t,a))φ i(s t ) We can rewrite this as: θ i,a += (r t + γ max b θ i,a := Q(s t+1,b) Q i (s t,a))φ i(s t ) θ i,a φ i(s t ) 2 (r t + γ max b Q(s t+1,b) Q i (s t,a))φ i(s t ) φ i (s t) 2

Analysing the Standard Q-value iteration update In the case of a tabular representation with only one state active, the resulting algorithm is again the same algorithm as conventional Q-value iteration. If we examine the same examples: 0.5 1and1 2, we can see that standard DP would compute the value 2, which is correct. The problem of this standard value-iteration algorithm with function approximators is that it may diverge, just as Q-learning may diverge.

Divergence of Q-learning with FAs Off-policy standard RL methods such as online value iteration or standard Q-learning can diverge for a large number of function approximators such as: Linear neural networks Locally weighted learning Radial basis networks We will show an example in which infinite parameters will be obtained when online value iteration is used Online value iteration uses a model of the environment but only updates on visited states

Example demonstrating divergence The example is very simple: There is one state s with value 0.5 or 1. φ(s) :=s The agent can select actions a t from {0.1, 0.2, 0.3,...,0.9, 1.0}. An absorbing state has been reached if s =1,elseweset s t+1 =0.5 ifa t < 1ands t+1 =1ifa t =1. The reward on all transitions is 0 The initial state s 0 =0.5.

Proof that divergence will be the result The algorithm computes the following update if s = 0.5: θ = θ + α(γθ 0.5θ)0.5 And the following update if s =1: θ = θ + α(0 θ) If the agent often selects random actions, in many cases the agent will be in state s =0.5. Suppose it stays on average h times in state s =0.5 beforemakingasteptos =1. Then it will make the following average update: θ = θ + α((γθ 0.5θ)0.5h θ)

This will lead to ever increasing values of initial positive values of θ if: h> 1 0.5γ 0.25 Here γ is larger than 0.5

Explaining divergence In this example, for large enough γ and exploration, the parameter will always increase. If the estimated value of the current state grows, then the estimated value of the best next state grows at the same time. This may lead to divergence. If instead we use a Monte-Carlo return, and do not use bootstrapping at all, we will not have the problem. Also if we use the averaging RL update or on-policy RL divergence is prevented.

Experiments We executed a number of experiments to compare standard and averaging DP with CMACs as a linear function approximator. The environment is a maze of size 51 51 with one goal state. Each state can be a starting state except for the goal state. There are deterministic single-step actions North, East, South, and West. The reward for every action is -1.

Experimental results As function approximator we use CMACs in which one tiling encodes the x-position and the other tiling the y-position. Tiling 1 Tiling 2 So there are two tilings with 51 states each. We first examine the error of the computed value function for standard and averaging Q-value iteration in an empty maze:

Experimental results Results in the Empty Maze Convergence in the Empty Maze 1e+06 10000 Averaging DP Standard DP 10000 Averaging DP Standard DP Value function error 100 1 0.01 Value function error 100 1 0.01 0.0001 0.0001 0 0.2 0.4 0.6 0.8 1 Discount Factor 0 8000 16000 Number of iterations

Discussion empty maze (1) In the empty maze we observe the following: For γ = 0, both algorithms compute the optimal value function. For γ = 1, standard DP is able to compute the perfect value function, but averaging DP makes a large error. For γ values in between 0.05 and 0.9 the value function obtained by averaging DP is slightly better than the one obtained with standard DP.

Discussion empty maze (2) We also studied the influence of initial parameter values: As expected averaging DP always converged to the same value function. For standard DP, random initialisations of 0 to 100 and 0 to 1000 led to convergence For standard DP, if we initialized the parameters to values between 0 and 10000, the parameters always diverged.

Experiments on a maze containing a wall We also examine the total number of steps from all initial states to the goal states for a maze with a wall. G 3e+06 2.5e+06 Results in the Wall Maze Averaging DP Standard DP Total Nr. of Steps 2e+06 1.5e+06 1e+06 500000 0 0 0.2 0.4 0.6 0.8 1 Discount Factor

Other projects We want to use averaging RL for solving path-planning problems in outer space involving gravity forces from planets: Goal Such a problem features continuous inputs (position, velocity, direction). Crashes into planets should be circumvented We plan to solve it with averaging RL combined with multi-layer perceptrons

Other projects Project of Sander Maas for Philips: learn to control a robot arm For each joint two muscles are used. These are so-called McKibben actuators which work on air-pressure. This resembles human muscles, but due to different causes the behavior of these muscles is highly non-linear

Discussion For the two robot control problem; Manipulation and Locomotion, manipulation seems most suitable for reinforcement learning For Locomotion, noise in the actions and sensors of the robot, cause the controller to deal with a probabilistic state description RL approaches using recurrent neural networks or hidden Markov models can be useful for locomotion We have extended averaging RL to feed-forward neural networks and plan to use this algorithm for robot control