Performance Comparison of Sarsa(λ) and Watkin s Q(λ) Algorithms

Similar documents
ADAPTIVE TILE CODING METHODS FOR THE GENERALIZATION OF VALUE FUNCTIONS IN THE RL STATE SPACE A THESIS SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL

Reinforcement Learning and Shape Grammars

Algorithms for Solving RL: Temporal Difference Learning (TD) Reinforcement Learning Lecture 10

Per-decision Multi-step Temporal Difference Learning with Control Variates

Generalization in Reinforcement Learning: Successful Examples Using Sparse Coarse Coding

A Brief Introduction to Reinforcement Learning

Adaptive Building of Decision Trees by Reinforcement Learning

Reinforcement learning of competitive and cooperative skills in soccer agents

10703 Deep Reinforcement Learning and Control

Competition Between Reinforcement Learning Methods in a Predator-Prey Grid World

Markov Decision Processes (MDPs) (cont.)

Pascal De Beck-Courcelle. Master in Applied Science. Electrical and Computer Engineering

Markov Decision Processes and Reinforcement Learning

Reinforcement Learning (INF11010) Lecture 6: Dynamic Programming for Reinforcement Learning (extended)

Introduction to Reinforcement Learning. J. Zico Kolter Carnegie Mellon University

WestminsterResearch

Approximate Linear Successor Representation

A Symmetric Multiprocessor Architecture for Multi-Agent Temporal Difference Learning

Unsupervised Learning. CS 3793/5233 Artificial Intelligence Unsupervised Learning 1

Strengths, Weaknesses, and Combinations of Model-based and Model-free Reinforcement Learning. Kavosh Asadi

Reinforcement Learning: A brief introduction. Mihaela van der Schaar

An Actor-Critic Algorithm using a Binary Tree Action Selector

Sample-Efficient Reinforcement Learning for Walking Robots

Using Continuous Action Spaces to Solve Discrete Problems

Reinforcement Learning (2)

A DECENTRALIZED REINFORCEMENT LEARNING CONTROLLER FOR COLLABORATIVE DRIVING. Luke Ng, Chris Clark, Jan P. Huissoon

Apprenticeship Learning for Reinforcement Learning. with application to RC helicopter flight Ritwik Anand, Nick Haliday, Audrey Huang

SIMULATION BASED REINFORCEMENT LEARNING FOR PATH TRACKING ROBOT

Sparse Distributed Memories in Reinforcement Learning: Case Studies

Learning State-Action Basis Functions for Hierarchical MDPs

Formal Model. Figure 1: The target concept T is a subset of the concept S = [0, 1]. The search agent needs to search S for a point in T.

ˆ The exam is closed book, closed calculator, and closed notes except your one-page crib sheet.

Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound. Lecture 12: Deep Reinforcement Learning

Hybrid Quasi-Monte Carlo Method for the Simulation of State Space Models

Convolutional Restricted Boltzmann Machine Features for TD Learning in Go

Marco Wiering Intelligent Systems Group Utrecht University

Strengths, Weaknesses, and Combinations of Model-based and Model-free Reinforcement Learning. Kavosh Asadi Atui

Exploring Unknown Environments with Real-Time Search or Reinforcement Learning

Deep Reinforcement Learning

Slides credited from Dr. David Silver & Hung-Yi Lee

Simulated Transfer Learning Through Deep Reinforcement Learning

Robotic Search & Rescue via Online Multi-task Reinforcement Learning

Reinforcement Learning for Appearance Based Visual Servoing in Robotic Manipulation

Markov Decision Processes. (Slides from Mausam)

Path Planning for a Robot Manipulator based on Probabilistic Roadmap and Reinforcement Learning

Common Subspace Transfer for Reinforcement Learning Tasks

Reinforcement Learning-Based Path Planning for Autonomous Robots

The exam is closed book, closed calculator, and closed notes except your one-page crib sheet.

arxiv: v1 [cs.cv] 2 Sep 2018

Neuro-Dynamic Programming An Overview

Q-learning with linear function approximation

Project Whitepaper: Reinforcement Learning Language

Combining Deep Reinforcement Learning and Safety Based Control for Autonomous Driving

Logic Programming and MDPs for Planning. Alborz Geramifard

Residual Advantage Learning Applied to a Differential Game

Reinforcement Learning in Continuous Action Spaces through Sequential Monte Carlo Methods

EE 553 Term Project Report Particle Swarm Optimization (PSO) and PSO with Cross-over

A Reinforcement Learning Approach to Automated GUI Robustness Testing

Human-level Control Through Deep Reinforcement Learning (Deep Q Network) Peidong Wang 11/13/2015

The Parti-game Algorithm for Variable Resolution Reinforcement Learning in Multidimensional State-spaces

Model-Based Exploration in Continuous State Spaces

Assignment 4: CS Machine Learning

arxiv: v1 [cs.ne] 22 Mar 2016

Rough Sets-based Prototype Optimization in Kanerva-based Function Approximation

Advanced Operations Research Prof. G. Srinivasan Department of Management Studies Indian Institute of Technology, Madras

Natural Actor-Critic. Authors: Jan Peters and Stefan Schaal Neurocomputing, Cognitive robotics 2008/2009 Wouter Klijn

RESEARCH OF AN OFF-POLICY BF OPTIMIZATION ALGORITHM BASED ON CROSS-ENTROPY METHOD

A thesis submitted in partial fulfilment of the requirements for the Degree. of Master of Engineering in Mechanical Engineering

Hierarchical Assignment of Behaviours by Self-Organizing

Approximate Q-Learning 3/23/18

Time Series prediction with Feed-Forward Neural Networks -A Beginners Guide and Tutorial for Neuroph. Laura E. Carter-Greaves

Representations and Control in Atari Games Using Reinforcement Learning

Reinforcement Learning with Parameterized Actions

CS545 Contents IX. Inverse Kinematics. Reading Assignment for Next Class. Analytical Methods Iterative (Differential) Methods

PLite.jl. Release 1.0

Planning and Control: Markov Decision Processes

Implementation of Enhanced Web Crawler for Deep-Web Interfaces

Reinforcement Learning in Discrete and Continuous Domains Applied to Ship Trajectory Generation

Reinforcement Learning for Network Routing

Monte Carlo Tree Search PAH 2015

Adversarial Policy Switching with Application to RTS Games

e-ccc-biclustering: Related work on biclustering algorithms for time series gene expression data

Reinforcement Learning

CS 687 Jana Kosecka. Reinforcement Learning Continuous State MDP s Value Function approximation

Simultaneous Perturbation Stochastic Approximation Algorithm Combined with Neural Network and Fuzzy Simulation

Algorithms and Data Structures

Deep Reinforcement Learning for Urban Traffic Light Control

Solving Large Imperfect Information Games Using CFR +

arxiv: v1 [cs.ai] 18 Apr 2017

Partially Observable Markov Decision Processes. Silvia Cruciani João Carvalho

Nested Sampling: Introduction and Implementation

arxiv: v1 [cs.db] 22 Mar 2018

LAO*, RLAO*, or BLAO*?

Vision-Based Reinforcement Learning using Approximate Policy Iteration

Framework for Design of Dynamic Programming Algorithms

Rollout Algorithms for Stochastic Scheduling Problems

REINFORCEMENT LEARNING WITH HIGH-DIMENSIONAL, CONTINUOUS ACTIONS

0.1. alpha(n): learning rate

Learning to bounce a ball with a robotic arm

A NETWORK SIMPLEX ALGORITHM FOR SOLVING THE MINIMUM DISTRIBUTION COST PROBLEM. I-Lin Wang and Shiou-Jie Lin. (Communicated by Shu-Cherng Fang)

Transcription:

Performance Comparison of Sarsa(λ) and Watkin s Q(λ) Algorithms Karan M. Gupta Department of Computer Science Texas Tech University Lubbock, TX 7949-314 gupta@cs.ttu.edu Abstract This paper presents a performance comparison between two popular Reinforcement Learning algorithms: the Sarsa-Lambda algorithm and Watkin s Q-Lambda algorithm. The algorithms were implemented on two classic reinforcement learning problems the Pole Balancing problem and the Mountain Car problem. Further work and improvements have been suggested in the conclusion. 1 Introduction It is necessary to know which algorithm to use for a specific reinforcement learning problem. This study aims to indicate which one is a better algorithm, and under which parameters, for two types of problems. The mountain car problem is an example of an episodic control task where things have to get worse (moving away from the goal) before they can get better. The pole balancing problem is an example of an episodic control task where the agent tries to reach the goal at all times. This study is also a part of my course curriculum at the Department of Computer Science, Texas Tech University. 2 Theoretical Background 2.1 Reinforcement Learning Reinforcement Learning is that branch of Artificial Intelligence where the agent learns with experience, and with experience only. No plans are given, there are no explicitly defined constraints or facts. It is a computational approach to learning from interaction. The key feature of reinforcement learning is that an active decision-making agent works towards achieving some reward, which will be available only upon reaching the goal. The agent is not told which actions it should take to reach the goal, but instead it discovers the best actions to take at different states by learning from its mistakes. The agent monitors its environment all times, because actions taken by the agent may change the environment and hence affect the actions available to the agent from the environment. The agent learns by assigning values to states and actions associated with every state. When the agent reaches a state that it has already learnt about, it can exploit its knowledge of the state space to take the best action. At times the agent takes random actions this is called exploration. While exploring the agent learns about those regions of the state space that it would otherwise ignore if it only followed the best actions. By keeping a good balance between eploitation and exploration, the agent is able to learn the optimal policy to reach the goal. In all reinforcement learning problems, the agent uses its experience to improve its performance over time. For a detailed study of Reinforcement Learning please refer to [1]. The Sarsa algorithm (short for state, action, reward, state, action) is an on-policy temporal difference control algorithm. It is an on-policy algorithm in the sense that it learns the action-value function for the policy which is being followed by the algorithm to make the transition from a state-action pair to the next state-action pair. It learns an action-value function for a policy π. This algorithm considers tran- 1

sitions from state-action pair to state-action pair and learns the value of visited state-action pairs. The algorithm coninuously estimates Q π for a behaviour policy π, and at the same time changes π toward greediness with respect to the learned action-value function Q π. The Q-learning algorithm is an off-policy temporal difference control algorithm. This algorithm learns the optimal policy while following any other (non-greedy, e.g. ε-greedy, etc.) policy. The learned action-value function Q directly approximates Q*. The policy being followed changes as the actionvalue function is learnt and more accurate values for every state-action pair are generated. When the Sarsa and Q-learning algorithm are augmented with Eligibility Traces and TD(λ) methods, then they are known as Sarsa(λ) and Q(λ) algorithms respectively. The Q-learning algorithm was developed by Watkins in 1989, hence the name, Watkin s Q(λ). The λ in both the algorithms refers to the n- step backups for Q π. These are still temporal difference learning methods because they change an earlier estimate based on how it differs from a later estimate. The TD(λ) methods differ from the basic TD method because the backup is made after n steps and not after every one step. The value of n is given by the value of λ. The practical way of implementing this kind of method is by using Eligibility Traces. An eligibility trace is associated with every stateaction pair in the form of an additional memory variable. The eligibility trace for every state-action pair decays at every step by γλ, where γ is the discount rate. The eligibility trace for the state visited on that step is changed depending on the kind of eligibility traces being implemented. Ther are two ways to implement eligibility traces, Accumulating Traces and Replacing Traces. If using Accumulating Traces, the eligibility trace for the visited state-action pair is incremented by 1. If Replacing Traces are used, the eligibility Trace for the visited state is set to 1. In all their essence, eligibility traces keep a record of stateaction pairs which have recently been visited, and the degree for which each state-action pair is eligible for undergoing learning changes. e t s a 1 ReplacingTraces γ δe t 1 s a 1 AccumulatingTraces γ δe t 1 s a Decay 1 λ Σ = 1 (1 λ) λ 2.2 Sarsa(λ) Sarsa(λ) (1 λ) λ 2 λ T-t-1 Figure 1: Sarsa(λ) Backup Diagram s, a When eligibility traces are added to the Sarsa algorithm it becomes the Sarsa(λ) algorithm. The basic algorithm is similar to the Sarsa algorithm, except that backups are carried out over n-steps and not just over one step. An eligibility trace is kept for every state-action pair. The update rule for this algorithm is: Q t 1 s a Q t s a α δ t e t s a for all s, a where δ t r t 1 γ Q t s t 1 a t 1 Q t s t a t and e t s a 1 i f s s t and a a t γδe t 1 s a otherwise The backup diagram for the Sarsa(λ) algorithm is shown in Figure 1. The Sarsa(λ) algorithm is shown in Figure 2. This algorithm uses replacing traces. The algorithm runs for a large number of episodes until the action-value function converges to a reasonably accurate value. At every step it chooses an action a t 1 for the state s t 1 that is the next state. This decision is made based on the policy that is being followed by the algorithm. Then it calculates δ t based on the above equation. Depending on the kind of traces being used, the value of the eligibility trace for the current state-action pair e s t a t is calculated. In the t s T t 2

Initialize Q(s,a) arbitrarily and e(s,a)=, for all s,a Repeat(for each episode) Initialize s,a Repeat(for each step of epsiode) Take action a, observe r, s Choose a from s using the behaviour policy δ r γq(s,a )-Q(s,a) e(s,a) 1 For all s,a: Q(s,a) Q(s,a) +αδe(s,a) e(s,a) γλe(s,a) s s ; a a until s is terminal 1 λ (1 λ) λ Watkins's Q(λ) (1 λ) λ 2 λ T-t-1 OR λ n 1 s, a t t s t+n first non-greedy action Figure 3: Watkin s Q(λ) Backup Diagram Figure 2: Tabular Sarsa(λ) next step the action-value for every state-action pair is calculated using the above equation. Also, the eligibility trace for every state-action pair is decayed by γλ. This process continues iteratively for every episode until a terminal state is encountered. 2.3 Watkin s Q(λ) The Q(λ) algorithm is similar to the Q-learning algorithm except that it uses eligibility traces and learning for an episode stops at the first non-greedy action taken. Watkin s Q(λ) is an off-policy method. As long as the policy being followed selects greedy actions the algorithm keeps learning the action-value function for the greedy policy. But when an exploratory action is selected by the behaviour policy, the eligibility traces for all state-action pairs are set to zero, hence learning stops for that episode. If a t n is the first exploratory action, then the longest backup is toward r t 1 γr t 2 γ n 1 r t n γ n max a Q t s t n a The backup diagram for this algorithm is shown in Figure 3. The eligibility traces are updated in two steps. If an exploratory action was taken, they are set to zero for all state-action pairs. Otherwise the eligibility traces for all state-action pairs are decayed by γλ. In the second step, the eligibility trace value for the current state-action pair is either incremented by 1 in the case of accumulating traces or set to 1 in the case of replacing traces. e t s a I sst I aat 1 i f Q t 1 s t a t max a Q t 1 s t a ; otherwise The action-value function is defined as Q t 1 s a Q t s a α δ t e t s a, where δ t r t 1 γ max a Q t s t 1 a Q t s t a t The complete algorithm is shown in Figure 4. The algorithm is conceptually the same as Sarsa(λ), but it updates the action-value function using the value of the greedy action at the current state. 3 Implementation The algorithms were implemented on the Pole Balancing and Mountain Car problems. The code is written in C++. The code was compiled using version 3.2 of the g++ compiler. A makefile is included with the code. The code has been tried on RedHat Linux 8 and on Sun Machines running Solaris 9. It works comfortably on both, but takes longer to converge on the Sun machines. 3

3.1 Pole Balancing Problem Initialize Q(s,a) arbitrarily and e(s,a)=, for all s,a Repeat(for each episode) Initialize s,a Repeat(for each step of epsiode) Take action a, observe r, s Choose a from s using policy Q(ε-greedy) a arg max b Q(s,b) (if a ties for max, then a a ) δ γq(s,a )-Q(s,a) e(s,a) 1 For all s,a: Q(s,a) Q(s,a) +αδe(s,a) if a =a, then e(s,a) γλe(s,a) else e(s,a) s s ; a a until s is terminal Figure 4: Tabular Watkin s Q(λ) Algorithm The pole balancing problem is a classic problem of reinforcement learning. The basic idea is to balance a pole which is attached to a hinge on a movable cart. Forces are applied to the cart, which can only move along a 2 dimensional track, to prevent the pole from falling over. A failure occurs if the pole falls past a given angle from vertical. If the cart reaches the end of the track, that also accounts as a failure. The pole is reset to vertical after each failure. The implementation treats the problem as an episodic task, without discounting. The episode terminates when a failure occurs. The return is maximized by keeping the pole balanced for as long as possible. A graphical view is shown in Figure 5. The entire state space is represented as a combination of four variables, x ẋ θ θ. There are two actions: apply force to the left of the cart, and apply force to the right of the cart. Eligibility traces are associated with every state-action pair. The actionvalue function Q is maintained as a two dimensional array which maps to every state-action pair. The Q function and the eligibility traces are thus defined in the program as such: Q state action and ET state action. The program starts out with an ε value of 99999 for the first one hundred episodes. After that the epsilon value is decayed every hundred episodes by multiplying it with itself. The pseudo-code for this is: if (ε ) if (episode mod 1 equals ) ε ε ε Figure 5: Pole-Balancing Problem The state space is divided into regions. It would be computationally impossible to generate a single state for every possible combination x ẋ θ θ. To facilitate computation, the range for every variable is divided into 15 equal regions. Every value for every parameter is assigned an index based on the corresponding region it falls in. The four indexes are then combined to map to a single state. By this method there are 15 15 15 15 5 625 unique states, which is a computationally managable number. This mapping is carried out by the getstate x ẋ θ θ function. This function takes in the values of the variables as arguments. It returns a unique number 4

3.2 Mountain Car Problem Figure 6: Mountain-Car Problem which is the state. This function is represented mathematically as, where, xindex int ceil xrange minx maxx, xdiv xrange 15 minx x xdiv for the value of x. The values of ẋindex θindex and θ are calculated similarily. Finally the stateindex is calculated by a simple polynomial function: stateindex int 15 ẋindex θindex 15 3 xindex 15 2 θindex The egreedy action is decided using the function egreedy Val1 Val2 ε. This function returns the action number to be taken. Val1 and Val2 are actionvalues for the state at which the decision is to be taken. To take a greedy action a value of. for ε is passed as an argument. The code was written for both Sarsa(λ) and Q(λ) algorithms. Both the implementations use Replacing Traces to update the eligibility trace. For both the algorithms the iteration only ends if the action-value function converges to fixed values. This was deduced by running the algorithms until the number of steps taken remains a constant for 1 consecutive episodes. This is the converging condition for the algorithms. The Mountain Car problem is another classic problem in reinforcement learning. It involves driving an underpowered car up a steep mountain road. However, the car s engine is not strong enough to make the car accelerate up the steep slope. The car must back up onto an opposite slope on the left. Then the car must build up enough intertia to carry itself up to the goal. By alternatingly moving forward and backward, repeatedly, between the two slopes the car can eventually build up enough inertia to reach the top of the steep hill. The idea behind this experiment is to make the car learn how to reach its goal while minimizing the number of oscillations between the two slopes. The basic idea is shown in Figure 6. The action-value function for the state-action pairs is again a two dimensional array represented as Q state action. There is an eligibility trace associated with every state-action pair represented as ET state action. The state is a function of two parameters: the position of the car and the velocity of the car. Like the Pole-balancing problem, here also the state space is divided into equal regions for reasons of computational feasibility. However, the state space here is divided into 3 regions and not 15 regions for each parameter. This makes the state space more accurate towards the mapping to a single unique state. There are 3 3 9 unique states. The mapping from different values of the parameters to a single unique state is carried out by the function getstate pos vel. This function returns a unique value for the state depending on the values of pos and vel. This function works similarily to the getstate function used in the pole balancing problem. First the posindex and the velindex are calculated using the equation: posindex where int ceil POSRANGE pos posdiv posrange POSRANGE 1 POSRANGE and 5

posdiv posrange 3. 185 18 "POLE-L..sarsa" "POLE-L..q" velindex is calculated similarily. Finally stateindex is calculated using the equation: stateindex int 3 posindex velindex The egreedy action is decided using the function egreedy Q state ε. This function returns the action number to be taken. Q is the action-value function and state is the state for which the decision is to be taken. To take a greedy action a value of. for ε is passed as an argument. The code was written for both Sarsa(λ) and Q(λ) algorithms. Both the implementations use Replacing Traces to update the eligibility trace. For both the algorithms the iteration only ends if the action-value function converges to fixed values. This was deduced by running the algorithms until the number of steps taken remains a constant for 1 consecutive episodes. This is the converging condition for the algorithms. 4 Results The results have been averaged over 2 episodes to get the resulting graphs. Starting from a value of.99999, ε was decayed every 1 episodes. Other values are: γ 1 and α 1. Watkin s Q(λ) algorithm was observed to be much more stable for the Mountain Car Problem. α and γ were kept constant for the experiment. The different values of λ that were used are:.,.2,.5,.7,.9 and 1.. All graphs are based on these values and are between Number of Steps Taken on the Y- axis and The Number of Episodes on the X-axis. 4.1 Pole Balancing Problem The results for the Pole Balancing problem are skewed because the program took too long to converge. It had not yet converged for some values of λ at the time of writing of this document. Modifications to the code could be possible solutions to this problem. 4.2 Mountain Car Problem The results for the Mountain Car problem are very precise and informative. We must remember that the 175 17 165 155 145 14 135 185 18 175 17 165 155 18 14 12 8 6 4 2 185 18 175 17 165 155 145 2 4 6 8 12 Figure 7: Pole Balancing, λ "POLE-L.2.sarsa" "POLE-L.2.q" 1 2 3 4 6 7 8 Figure 8: Pole Balancing, λ 2 "POLE-L.5.sarsa" "POLE-L.5.q" 2 4 6 8 12 14 Figure 9: Pole Balancing, λ 5 "POLE-L.7.sarsa" "POLE-L.7.q" 5 1 15 2 25 3 35 4 45 Figure 1: Pole Balancing, λ 7 6

195 19 "POLE-L.9.sarsa" "POLE-L.9.q" 4 "MC-L.2.sarsa" "MC-L.2.q" 185 4 18 3 175 17 3 165 2 155 145 14 5 1 15 2 25 3 35 4 45 2 3 3 Figure 11: Pole Balancing, λ 9 Figure 14: Mountain Car, λ 2 185 "POLE-L1..sarsa" "POLE-L1..q" "MC-L.5.sarsa" "MC-L.5.q" 18 4 4 175 3 17 3 165 2 155 145 2 4 6 8 12 2 3 3 Figure 12: Pole Balancing, λ 1 mountain car problem is a unique type of problem because the agent has to move away from the goal to get to it. All results have been averaged over 2 episodes. The maximum number of steps was. If we look at Figure 13 Sarsa(λ) converges earlier than Q(λ) at episode 225, whereas Q(λ) converges at episode 285. However, Q(λ) converges to a lower value, and hence has learnt better than Sarsa(λ). Comparing Figures 13, 14 15, we observe a trend in the way the resulta are obtained. The 4 4 3 3 2 Figure 15: Mountain Car, λ 5 "MC-L.7.sarsa" "MC-L.7.q" 2 3 3 4 Figure 16: Mountain Car, λ 7 4 "MC-L..sarsa" "MC-L..q" 4 "MC-L.9.sarsa" "MC-L.9.q" 4 4 3 3 3 3 2 2 2 3 2 Figure 13: Mountain Car, λ Figure 17: Mountain Car, λ 9 7

4 4 "MC-L1..sarsa" "MC-L1..q" I would like to thank Dr. Larry Pyeatt for providing the simulation code for both the problems. The figures for the backup diagrams and the drawings for the pole-balancing and mountain car problems have been borrowed from Dr. Sutton and Dr. Barto s website, http://www-anw.cs.umass.edu/ rich/book/thebook.html 3 3 2 References 2 [1] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. The MIT Press, Cambridge, Massachusetts, England, 22. Figure 18: Mountain Car, λ 1 Q(λ) algorithm always converges to a value lower than Sarsa(λ) for the same value of λ. However, Sarsa(λ) always converges earlier than Q(λ). As λ increases both the algorithms converge to a lower value than what they had with a lower λ. However, both the algorithms now take longer to converge. Unfortunately, the programs for Sarsa(λ) for λ =.7 and λ =.9 had not converged at the time of writing of this paper. But they had been observed to have converged at an earlier time during the experiment. Interesting results were observed for Sarsa(λ) with λ = 1.. The relevant graph is shown in Figure 18. This is a Monte Carlo implementation. The value function converges to the worst possible value. It does not seem to be affected by any changes in ε. 5 Conclusion Improvements to the algorithms are desired. Further experiments could include decaying the value of α during the course of a run. Furthermore, I believe that the value of α used in these experiments was too high. Also, the decay of ε could be modified to get better results. 6 Acknowledgements 8