Reinforcement Learning on the Lego Mindstorms NXT Robot. Analysis and Implementation.

Size: px

Start display at page:

Download "Reinforcement Learning on the Lego Mindstorms NXT Robot. Analysis and Implementation."

Cassandra Mosley
6 years ago
Views:

Analysis and Implementation. Author: Ángel Martínez-Tenor Supervisors: Dr.

1 ESCUELA TÉCNICA SUPERIOR DE INGENIEROS INDUSTRIALES Departamento de Ingeniería de Sistemas y Automática Master Thesis Reinforcement Learning on the Lego Mindstorms NXT Robot. Analysis and Implementation. Author: Ángel Martínez-Tenor Supervisors: Dr. Juan Antonio Fernández-Madrigal Dr. Ana Cruz-Martín Master Degree in Mechatronics Engineering Málaga. May 13, 2013

2 Master Thesis ii University of Málaga

3 Extended Abstract This Master Thesis deals with the use of reinforcement learning (RL) methods in a small mobile robot. In particular we are interested in those methods that do not require any model of the system and are based on a set of rewards defining the task to learn. RL allows the robot to find quasi-optimal actuation strategies based on the interaction robot-environment. This work proposes the implementation of a learning method for an obstacle-avoidance wandering task based on the Q-learning algorithm for the Lego Mindstorms NXT robot. A state of the art review and a research proposal are included. Keywords. Q-learning, reinforcement learning, artificial intelligence, Mindstorms, mobile robot, TD method (Temporal Difference), MDP (Markov Decision Process). State of the art The present work pursues that a mobile robot learns by itself the effects of its actions. Reinforcement learning (RL) methods, based on stochastic actuation models, will be employed as the mechanism that will allow the robot to feedback the effects of its actions, thus leading its own development. There is a growing interest in mobile robot applications in which the environment is not previously prepared for the robot, as it occurs in home service robotics, where human presence is common [1] [2] [3] [4]. Most service mobile robots are explicitly preprogrammed with highlevel control architectures, requiring advanced software development techniques [5]. It would be desirable to have methods that allow the robot itself to evolve starting from the most basic control level up to higher control architectures; this idea would avoid the engineers to spend too much time in developing and using these advanced techniques. The autonomous learning concept dates back to Alan Turing s idea of robots learning in a way similar to a child [6]. This topic evolved to the current concept of development robotics [1], a today emergent paradigm analogue to the one of development in human psychology. In autonomous learning, mechanisms for decision-making under uncertainty are generally employed [7] [2] [1]. These mechanisms represent cognitive processes capable of selecting a sequence of actions that we expect to lead to an specific outcome. Examples of these processes can be found in economic and military decisions [8]. In particular, Markov Decision Processes represent the most exploited subfield. They are Dynamic Bayesian Networks (DBN) in which the nodes represent system states and the arcs actions along with their expected reward. A selected action is executed at each step, reaching a new state and obtaining a reward, both stochastically [8] [7]. The objective in a decision-making problem is calculating a policy, a stationary function defining which action a is executed when the agent is in the state s. Classic algorithms such as Value iteration and Policy iteration are used to converge to the optimal policy given a maximum distance error. However, a detailed stochastic model of the system is required for obtaining reliable results. 1

4 The alternative model-free decision-making process has no information about the transition T (s, a, s ) and reward R(s, a, s ) functions of the DBN. The solution is: knowing that we are in state s, execute action a and read which state s we reach, and which associated reward R(s s, a) we obtain. In other words, the lack of model is solved by making observations. This is the basis of the RL concept. Hence, a RL problem is a continuous process which evolves from an initial policy to a nearoptimal policy by executing actions, making observations, and updating their policy values [8] [7]. In the last decades, RL has helped to solve many problems in different fields, such as game theory, control engineering and statistics. That makes RL an interesting candidate as a basic learning mechanism for mobile robots in the context of developmental robotics. Many RL publications have demonstrated the efficacy of Watkins Q-learning algorithm [9], within the Temporal difference learning group of methods, among others [8] [2] [10]. In parallel to Temporal difference learning, Monte Carlo methods, Dynamic programming and Exhaustive search can also be found in RL approaches [8]. All these methods have a solid theoretical basis and have been addressed by many researchers in the last two decades. Moreover, there is a tendency in extending and combining these techniques, resulting in hierarchical Q-learning, neural Q-learning, partially observable Markov decision process, policy gradient RL, quantuminspired RL, and actor critrics methods [2] [3]. In robotics, however, RL has been used for solving isolated problems mostly [11] [4] [12], such as navigation. Particularly, the use of the Q-learning algorithm as a learning method in a real robot interacting with its environment, although represents an under-exploited topic, can be found in some Lego Mindstorms NXT applications, such as obstacle avoidance [13], line following [14], walking [15], phototaxis behavior [16], and pattern-based navigation [17]. There are also more advanced RL works in real robots, including the Lego Mindstorms NXT, which are out of the scope of this work. A review of the above publications reveals that the difficulty of convergence to quasi-optimal solutions in RL, including Q-learning, is a well-known problem. The learning process often results incomplete or falls into local maxima when the Q-learning parameters are not properly tuned [10]. In a real scenario, these problems are increased by the use of real time. In that case, the Q-learning parameters define whether the learning process can produce appropriate results in a few minutes or in several months instead. Thus, real robot Q-learning parameter tuning can be quite different from simulation parameter tuning. Besides, real robot-based works do not address the step of evolving from Q-learning off-line simulations to robot implementations from a global perspective. As a result, the Q- learning parameters need to be analyzed and tuned for each task/system combination. The lack of a flexible generic mechanism in this topic is the main motivation of this proposal. Q-learning parameters are summarized here: [10] [8] Reward, the instant recompense obtained in each learning step. The task an agent has to learn is indirectly defined by a set of rewards. Exploitation/exploration strategy, the rule that selects the action to perform at each learning step. This strategy determines whether the agent exploits its learned policy or explores new unknown actions. Learning rate, or how much the new knowledge influences the learning process at the expense of previously acquired knowledge [18]. Discount rate, the forecast degree of the learning process. It sets how the current reward will influence the Q-matrix values and policy values, both representing the internal state of Q-learning algorithm, in the next steps with respect to future steps. This parameter Master Thesis 2 University of Málaga

5 defines whether the learning process will generate strategies with small or large look ahead sequences of steps, and the amount of steps needed for learning an adequate policy. Research proposal The proposed work addresses the design and implementation of a Q-learning based method in a small mobile robot so as to autonomously develop an obstacle avoidance task. The educative robot Lego Mindstorms NXT meets both the hardware and the software requirements of this work. Our setup proposal employs a differential-vehicle configuration. This framework allows us to perform parallel studies based on both robot and environment models. In addition, this work involves solving computing problems caused by the lack of real number processing in the robot: a right quantization of sensor values and available actions, and the employment of a fixed-point or a floating-point notation system must be developed. The study will begin with a simple wandering task employing a reduced number of states and actions. This configuration will allow us to implement an off-line learning process model in which the Q-learning parameters can be analyzed and tuned for the learning process to converge within a small amount of time. In the next step, the resulting parameters will be used into the real robot. An analysis of divergences with the off-line learning process will be accomplished to detect implications on both the temporary nature of real environment experimentation and the above mentioned numerical problem. Afterwards, the number of states and actions of the system will be increased in order to evaluate its learning capacity in a complete obstacleavoidance navigation task. As a result, this work leads to the design of a modified Q-learning based method for a real small robot, focused on simplicity and applicability, and with enough flexibility for adapting the resulting techniques to other tasks and different systems. Master Thesis 3 University of Málaga

6 Master Thesis 4 University of Málaga

7 Contents 1 Introduction Description of the work Objectives Methodology Scheduling Materials Content of the thesis Preliminary study. Reinforcement learning and the Q-learning algorithm A brief introduction Model-based decision-making processes Model-free decision-making processes. The Q-learning algorithm Q-learning: The algorithm Q-learning practical implementations on Lego Mindstorms NXT robots Other reinforcement learning techniques Simple wandering task (I). Design and Simulation Mobile robot configuration Task definition: 4 states + 4 actions Models Reward function Transition function and system model Simulation. Q-learning parameters tuning Reward tuning Exploitation-exploration strategy Learning rate and discount rate tuning Simple wandering task (II): Robot implementation Translation of simulation code into NXC NXT io library Main program Q-learning algorithm with CPU limitations Overflow analysis Precision Analysis Memory limits Implementation results Complex wandering task Wander-2 task Obstacle avoidance task learning: wander-3 implementation and results Summary

8 6 Conclusions and future work 57 A Implementation details 59 A.1 Debugging, compilation, and NXT communication A.2 Datalogs A.3 Brick keys events A.4 Online information through the display and the loudspeaker B Octave/Matlab scripts, functions, and files generated by the robot 63 B.1 modeldata.m (wander-1) B.2 simulatesimpletask ConceptualModel.m (wander-1) B.3 simulatesimpletask FrequentistModel.m (wander-1) B.4 Q learning simpletask Conceptual.m (wander-1) B.5 Q learning simpletask Frequentist.m (wander-1) C Source codes developed for robot implementations(nxc) 71 C.1 NXT io.h C.2 simple task learning.nxc (wander-1) C.3 5s learning.nxc (wander-2) C.4 final learning.nxc (wander-3) C.5 simpletaskdatacollector.nxc (wander-1) Bibliography 103 Master Thesis 6 University of Málaga

9 List of Figures 1.1 Robot Lego Mindstorms NXT MDP robotic example Q-learning algorithm NXT basic mobile configuration NXT Two-contact bumper NXT final setup Reward function Wander-1 transition function T (s, a, s ) represented as P (s s, a) in the conceptual model Wander-1 transition function T (s, a, s ) represented as P (s s, a). Frequentist model T (s, a, s ) differences between conceptual and frequentist model Q-learning algorithm used in all simulations Wander-1 simulation: results of learning rate α and discount rate γ combinations for the conceptual model Wander-1 simulation: results of learning rate α and discount rate γ combinations for the conceptual model Wander-1 simulation: learning rate α and discount rate γ results for the frequentist model Wander-1 simulation: learning rate α and discount rate γ results for the frequentist model Q-learning algorithm in NXC Function observestate() written in NXC Function obtainreward() written in NXC Q-learning algorithm implemented for the analysis of precision Q-matrix differences between offline and NXT simulation Step in which the optimal policy was learned for different FP, γ and α in the NXT robot Q-matrix values with FP=1000 and FP= Q-matrix values are multiplied by FP Differences in Q-matrix values between FP=1000 and FP= NXT memory restrictions for states/actions: example of memory needed (left), number of states and actions available (center and right) Scenario A for the wander-1 task implementation Wander-1 learning implementation. Summary of experiments (* Divided by FP) Wander-1 learning implementation. Examples of the resulting Q-matrix Wander-1 task implementation. Robot during the learning process Wander-1 task implementation. Robot exploiting the learned policy

10 5.1 Angle restriction of the ultrasonic sensor for detecting frontal obstacles ObserveState() routine for the wander-2 task Scenario B for the wander-2 implementation Wander-2 learning implementation: Summary of experiments for scenario A (* Divided by FP) Wander-2 learning implementation: Summary of experiments for scenario B (* Divided by FP) Wander-2 task implementation. Robot during the learning process Wander-2 task implementation. Robot exploiting the learned policy Implementation of the exploration/exploitation strategy for the wander-3 task Scenario C for the wander-3 task implementation Results of the wander-3 task implementation test in scenario A Results of the wander-3 task implementation tests in Scenario C Wander-3 task implementation. Robot exploiting the learned policy (top to bottom, left to right) A.1 Brick Command Center v A.2 Datafile generated by the wander-2 (five-states) task program on the robot A.3 NXT display with the learning information of the wander-1 task Master Thesis 8 University of Málaga

11 List of Tables 3.1 Wander-1 task description Adjustment of rewards Wander-1 optimal policy Best Q-learning parameters selected after simulation tests Simulation results with the selected parameters. Frequentist Model Additional implemented techniques for experimenting with the Lego robot Content of the NXT io.h library Rewards definition in the robot Limits in the Q-matrix values to avoid overflow when using a fixed-point notation system Q-matrix reference values obtained from Octave simulation. The optimal policy (values in bold) was learned in step Q-matrix values from the NXT simulation. The optimal policy (values in bold) was learned in step 298 (note: Q-matrix values have been multiplied by FP) Parameters of the wander-1 task implementation Wander-2 task description Wander-2 task optimal policy Wander-2 task implementation parameters Wander-3 task. Description of states and actions Wander-3 task distance ranges considered in the ultrasonic sensor Wander-3 task implementation parameters

12 Master Thesis 10 University of Málaga

13 Chapter 1 Introduction This chapter contains a description of a thesis developed for the MSc degree in Mechatronics Engineering of the University of Málaga. The objectives, methodology, project phases and materials involved in this work are summarized here. The content of the other chapters is briefly described in the last section. 1.1 Description of the work This master thesis aims at the design and implementation on the educative robot Lego Mindstorms NXT of reinforcement learning methods capable of automatically learn the best way of develop specific tasks. The selected learning method has the advantage of not requiring a model of the environment in which the robot will operate, but is based in the implementation of a series of rewards which define the objective of the task we want to learn. In general, reinforcement learning methods (RL) allow finding quasi-optimal actuation strategies based on environment-robot interaction. Among reinforcement learning methods, this work focuses on Watkins Q-learning algorithm [9]. Publications about this topic show the simplicity and efficacy of Q-learning algorithm implementations compared to other methods. Most publications, however, address the theoretical and mathematical sides of the methodology, with studies that are based generally on simulations. Works based on real implementations generally combine reinforcement learning with other advanced techniques resulting in very complex methods, such as hierarchical Q-learning, neural Q-learning, policy Gradient Reinforcement Learning, quantum Inspired RL, actor critics methods, Monte Carlo and TD methods combination. In the present work we are interested in the use of the Q-learning algorithm as a learning technique on a real robot in continuous interaction with its environment, in particular when the robot has important software and hardware limitations like the Lego Mindstorms NXT. A review of the state of the art on the use of RL with Lego robots has shown how it is generally pursued the learning of isolated tasks such as floor-marked path following, obstacle avoidance or phototaxis behavior. Here, we will focus on a minimalist implementation of a Q-learning method that does not go into advanced or combined techniques, leaving them outside the scope of our study. In order to address this problem correctly, we will take into account the following considerations: - According to the context of this work, belonging to the Master Degree in Mechatronics Engineering, it will be interesting to employ the same differential vehicle configuration as the one used in classes. This framework will allow us, firstly, to develop on a well-known configuration, and secondly, to design Q-learning practice exercises for future Master alumni. 11

14 - This work also involves the resolution of computing problems caused by the impossibility of the robot CPU to process real numbers. Hence, a study leading to the right quantization of sensor values and available actions is essential along with an analysis of the loss of precision due to fixed-point or floating-point notation systems, needed by the Q-learning algorithm. We will begin this work learning a simple wandering task with a reduced number of states and actions. This set-up will allow us to implement an off-line model of the learning process, in which the Q-learning parameters can be studied. These parameters can be adjusted in order to favor an optimal or suboptimal learning convergence within a relatively small amount of time. After that, we implement and execute the obtained solution on the robot, so as to analyze the divergences with the off-line process. This analysis will focus, on the one hand, on the real-time nature of the experimentation in a physical environment, and, on the other hand, on the above mentioned numerical problems. The study of this last topic is particularly interesting considering that it would allows us to evaluate the use of these techniques in other microcontroller-based systems. Once the above problems are solved, we will proceed to gradually enlarge the number of states-actions of our study adding ultrasonic, wheels-encoders and contacts sensors to our system. In this way, we evaluate the capacity of learning a relatively complex action sequence for an obstacle-avoidance task. 1.2 Objectives The main objective of this thesis is the implementation of a Q-learning based method, for the robot Lego Mindstorms NXT to learn an obstacle avoidance wandering task. The difficulty of the convergence to optimal or quasi-optimal solutions in reinforcement learning, including the Q-learning algorithm, is a well-known problem, since the learning process itself may result insufficient or fall into local maxima. This occurs easily when the Q-learning parameters, or even the rewards, are not properly tuned. In a real robot these problems are enhanced by the constraint of the real time: the above parameters along with the selected exploration/exploitation strategy will determine if the process of learning a reasonable policy can be made in a few minutes or can be delayed up to several months. In order to achieve the goal of our work efficiently avoiding an unnecessary backtracking, we have established the following milestones: 1. Simple wandering task learning (simulation). Using two frontal contact sensors and four movement actions (stop, forward, left-turn and right-turn), the robot will learn how to wander dodging obstacles after touching them. As an example, it is expected that in a state corresponding to a collision with the left front contact, the robot learns that turning right is the most likely action that will lead to its optimal policy. A numerical computing environment (Octave/Matlab) will be used to achieve this milestone analyzing and tuning the Q-learning parameters. The aim of this first stage is to obtain an algorithm that learns the optimal policy in a relatively low number of steps in most experiments. The findings will be used as a reference in subsequent studies. 2. Simple wandering task learning implementation in the Lego Mindstorms NXT. This stage requires an adaptation of the Q-learning algorithm so as to be processed by a microcontroller. It implies the usage of fixed-point or floating-point numbers. Thus, it will be necessary to perform an analysis of the effects of this on our learning process because of the loss of precision in the Q-matrix values. The milestone here is that the robot learns Master Thesis 12 University of Málaga

15 the optimal policy in a relatively small period of time so we can perform a comparative study between simulated and experimental data. Implementation of some auxiliary tools is required for proper experimentation: debugging routines, datalogs, etc. 3. Shaping from the learning of the basic wandering task to the final wandering task by adding an ultrasonic sensor and incrementing the number of states. This will be gradually implemented following the previous milestones in each shaping step. 1.3 Methodology The solution developed in this work focuses on simplicity and applicability. Besides, we will emphasize in having enough flexibility for adapting the resulting techniques so as to learn another tasks on different systems that also share the limits found in this study. Off-line simulations will be implemented in scripts and functions in Octave/Matlab, while real experiments will be developed in the NXC (Not exactly C) programming language for the robot. The employment of the basic differential vehicle configuration and the NXC language will allow us to maintain the framework generally used in the rest of courses of the master in Mechatronics Engineering. 1.4 Scheduling This thesis has been developed following a number of stages: 1. Q-learning theoretical and related work study. 2. Simple wandering task learning (simulation). 3. Simple wandering task learning implementation on Lego Mindstorms NXT robot 4. Complex wandering task learning. 1.5 Materials The hardware and software used in this thesis are enumerated here: Robot Lego Mindstorms NXT (Kit NXT Education Base Set 9797): CPU 32bits ARM7 48MHz. Firmware v1.31. Memory Flash memory (program) 256KB. RAM memory (variables) 64KB (see figure 1.1). Personal computer: Intel Core 2 Duo CPU 2.26GHz. OS: Ubuntu / Windows XP. Octave / Matlab 7.10(R2010a). NBC r4 and NeXTTool / Brick Command Center v3.3(build ). Master Thesis 13 University of Málaga

16 Figure 1.1: Robot Lego Mindstorms NXT. 1.6 Content of the thesis The content of this work is summarized as follows: Chapter 2 provides an introduction to reinforcement learning and the Q-learning algorithm, along with a review of the state of the art of the present proposal. The developed work begins in chapter 3 with the study of the parameters involved in the process of learning a simple simulated obstacle-avoidance wandering task. Chapter 4 translates the results of the previous chapter to the robot and performs a comparative analysis between the Q-learning simulation and the real robot implementation. By knowing the implications and limitations of the Q-learning algorithm in a simple wandering task, chapter 5 shows how the before tuned learning method behaves when learning more complex tasks. This will be performed by adding new states and actions. Chapter 6 contains the conclusions of this thesis and some proposed further work. Appendix A shows some relevant implementation details, which improve the performance of the experiments with the robot. Finally, appendixes B and C collect all the code written in the thesis, both as Octave/- Matlab simulation script, and of the robot application written in NXC. Master Thesis 14 University of Málaga

17 Chapter 2 Preliminary study. Reinforcement learning and the Q-learning algorithm This chapter is an introduction to reinforcement learning and the Q-learning algorithm that are the basic approach of this work. A description of the Q-learning parameters and a review of the state of the art in these techniques are included. 2.1 A brief introduction An introduction and practical approach to reinforcement learning problem and the Q-learning algorithm make up this section. Basic statistics and probability theory concepts are required for a correct understanding of this topic. For a more detailed introduction to reinforcement learning, we suggest [8] and [7]. We begin by introducing the concept of Decision-Making under uncertainty as a cognitive process capable of selecting a set of actions we expect they will lead to specific outcomes. Examples of the application of these processes can be found in economic and military decisions, for instance. The decision-making concept distinguishes between stochastic and deterministic processes, according to whether there is randomness at some point or not. Another distinction with special interest for our work is concerned with the availability of a model of the system, resulting in model-based or model-free decision-making processes Model-based decision-making processes To begin with it is necessary to introduce the concept of Markov Decision Process (MDP), defined as a Dynamic Bayesian Network (DBN) [19] whose nodes define the system states, and arcs define the actions along with their associated rewards (see fig. 2.1). Formally, an MDP is a tuple (S, A, T, R), where S is a finite set of states, A a finite set of actions, T the transition function, and R the reward function defined as R : S A S R. Some relevant characteristics of MDPs are: The Markov property is applicable to these networks, meaning that every state has no history dependence except for its prior state: P (s k a, s 0:k 1 ) = P (s k a, s k 1 ) (2.1) When executing a sequence of actions, a special DBN called Markov Chain is obtained, where actions and rewards are no longer variable. 15

18 Figure 2.1: MDP robotic example. Each time an action is executed, stochasticity will be present in both the new state reached and the obtained reward. We assume total observability; consequently the current state will always be perfectly known. An important concept in machine learning with MDPs is the one of a policy. We call (stationary) policy to a function π : S A defining the action a to execute when the agent is in the state s. Thus, a policy is the core of a decision-making process. The term policy value or V π indicates the goodness of a policy measured through the expected rewards when it is executed. Among the different implementations of policy values, we will use the total expected discounted reward for our reinforcement learning problem, in which the policy value is the expected reward gathered after infinite steps of the decision process, decreasing the importance of future rewards as we move forward in time. This can be expressed as: V π (s 0 ) = E [R(s 0, π(s 0 ), s 1 )] + γ E [R(s 1, π(s 1 ), s 2 )] + γ 2 (2.2) being γ (0, 1), and resulting in a an exponentially weighted average. A more succinct expression can be obtained by separating the first term in (2.2) in order to define Q π : Q π (s, a) = s succ(s) T (s, a, s ) E [R(s, a, s )] + s succ(s) γ T (s, a, s ) V π (s ) (2.3) It can be demonstrated that at least one optimal policy π exists, that is, it has the greatest value Vπ. In the case that two or more optimal policies appear, their values will be identical. The following expressions will serve to calculate both the optimal policy value and the optimal policy: Vπ (s 0 ) = max a Q π(s 0, a) (2.4) π (s 0 ) = argmax a Q π(s 0, a) (2.5) These expressions, called the Bellman equations, can be used recursively for improving an arbitrary initial policy. Practical implementations involve an intermediate definition: the expected reward of executing action a when the agent is in state s: R(s, a) = E [R(s, a, s )] = s succ(s) T (s, a, s ) E [R(s, a, s )] (2.6) Master Thesis 16 University of Málaga

19 Simplifying (2.3) through (2.6): Therefore, V Q π (s, a) = R(s, a) + π (s 0 ) = max a R(s, a) + s succ(s) s succ(s) γ T (s, a, s ) V π (s ) (2.7) γ T (s, a, s ) Vπ (s ) (2.8) Classic algorithms for finding optimal policies using the Bellman equations are Value iteration and Policy iteration, that converge into the optimal policy up to certain error. Unfortunately, there is no guaranty that neither of such algorithms converge in a short time. This is one of the reasons of the growing interest in improving these algorithms for their use in robots. The value iteration algorithm is: 1. a, s Q k (s, a) = R(s, a) + γ s T (s, a, s ) V k 1 (s ) 2. s, V k (s) = max a Q k (s, a), π k (s) = argmax a Q k (s, a) 3. Repeat the loop until V k (s) V k 1 < ɛ, s Model-free decision-making processes. The Q-learning algorithm Equation (2.6) can also be expressed as: Similarly: R(s, a) = E s s succ [R(s, a)] E s [R(s s, a)] (2.9) γ T (s, a, s ) V π (s ) γ E s [V π (s )] (2.10) Hence, Q k (s, a) has the same behavior as an expectation: Q k (s, a) = E s [R(s s, a)] + γ E s [V π (s )] (2.11) We will use this expression for deciding what to do when the model is not available. In that case, there is no information about the transition function T (s, a, s ), but we can estimate it from Q k (s, a) by modifying the value iteration algorithm like this: ˆV k (s) = max a ˆQk (s, a) (2.12) ˆπ k (s) = argmax a ˆQk (s, a) (2.13) [ ˆQ k (s, a) = E s R(s s, a) + γ ˆV k 1 (s a) ] (2.14) ˆQ k (s, a) can be obtained as an average of some observations gathered from the real system: Oi ˆQ k (s, a) = n, O i = R(s s, a) + γ ˆV k 1 (s, a) (2.15) Along with the transition function T (s, a), the complete reward function R(s, a) is not available in a model-free decision-making process. In other words, the lack of a model and reward knowledge is solved by making observations. This is the basis of the reinforcement learning concept. Expression (2.15) is similar to a batch sample average. However, a reinforcement learning problem is a continuous process evolving from an initial policy to a pseudo-optimal policy by Master Thesis 17 University of Málaga

20 executing sequentially actions, making observations, and updating its policy values. Therefore, a recursive or sequential estimation of the average must be employed instead, based on: Defining α k = 1 k : ν k = 1 k Oi, (2.16) k i=1 ν k 1 = 1 ( k 1 1 k Oi = Oi O ) k k k 1 i=1 k i=1 k k 1 = ( = ν k O ) n k n k 1 = k 1 ν k k k O k (2.17) Finally, applying this ν k expression to ˆQ k (s, a): ν k = (1 α k ) ν k 1 + α k O k (2.18) ˆQ k (s, a) = (1 α k ) ˆQ k 1 (s, a) + α k [R(s, a, s ) + γ V k 1 (s )] (2.19) being V k 1 (s ) = max a ˆQk 1 (s, a ) according to Bellman equations. Equation (2.19) represents the general form of the Q-learning algorithm, and it will be used repeatedly throughout this work. 2.2 Q-learning: The algorithm A practical Q-learning algorithm structure employed in this thesis is shown in figure 2.2. As stated before, the success of a reinforcement learning process is subjected to the accurate choice of its parameters. Here we summarize these parameters. A more detailed explanation of this topic can be found in [10]. Reward. The task we want an agent to learn in a reinforcement learning problem is defined by assigning proper rewards as a function of the current state, the action executed, and the new reached state. Exploitation/exploration strategy. It decides whether the agent should exploit its current learned policy, or experiment unknown actions at each learning step. Learning rate (α). It establishes how new knowledge influences the global learning process. When α = 1, only brand new knowledge is taken into account (the one of the current step). On the contrary, when α = 0, new actions do not affect the learning process at all (no learning). Discount rate (γ). It regulates the look-ahead of the learning process by determining how much the current reward will influence the Q-matrix values in the next steps. When γ = 0, only immediate rewards are taken into account, hence our agent will only be able to learn strategies with a sequence of one step. On the other hand, when γ values are close to 1, the learning process will allow strategies with larger sequences of steps, although that involves a longer process of learning for obtaining a reasonable policy. Master Thesis 18 University of Málaga

21 % Q-learning algorithm parameters N_STATES, N_ACTIONS,INITIAL_POLICY, GAMMA % Experiment parameters N_STEPS, STEP_TIME % System parameters % Variables s,a,sp % (s,a,s ) R % Obtained Reward alpha % Learning rate parameter Policy % Current Policy V % Value function Q % Q-matrix step % Current step % Initial state V = 0, Q = 0, Policy = INITIAL_POLICY s = Observe_state() % sensors-based observation for step = 1:N_STEPS % Main loop a = Exploitation_exploration_strategy() robot_execute_action(a), wait(step_time) sp = observe_state(), R = obtain_reward() Q(s,a) = (1-alpha)*Q(s,a) + alpha*(r+gamma*v(sp)) V(s)_update() Policy(s)_update() alpha_update() s = sp % update current state end % Main loop end Figure 2.2: Q-learning algorithm. 2.3 Q-learning practical implementations on Lego Mindstorms NXT robots A brief review of relevant scientific publications about implementations of Q-learning on Lego Mindstorms NXT robots follows: Obstacle avoidance task learning [13]: This paper presents the design of 36 encoding states and 3 actions employing 1 ultrasonic and 2 contact sensors. Programming language: LeJOS. A comparative study offers better results in traditional Q-learning than Q-learning with Self Organizing Map. Line follower task learning [14]: Q-learning algorithm implemented in Matlab using USB communication with the robot. Definition of 4 states encoding 2 light sensors and 3 available actions. This design includes an ultrasonic sensor for reactive behavior only. Walking and line follower tasks learning [15]: SARSA algorithm on NXT robot though LegOS programming language. 4 legs mounting moved by 2 servomotors and 1 light sensor aiming to a grey-gradient floor for the walking task. 4 wheels vehicle setup and 2 light sensors for the line follower task. Master Thesis 19 University of Málaga

22 Phototaxis behavior [16]. Implementation of three different layers on NXT robot: Low level navigation, exploration heuristic and Q-learning layer. Almost every functions of the robot are executed in a computer communicated with the robot by Bluetooth. This setup includes 11 encoding states based on 2 light and 2 contact sensors, and 8 actions representing wheels power levels. Another Q-learning on NXT robot related works can be found on several proposal projects in robotics courses, including wandering, wall following, pursuit and maintaining a straight path. All these works are focused on learning very specific tasks, usually taking advantage of heuristics or another techniques to improve the learning process. 2.4 Other reinforcement learning techniques Although advanced reinforcement learning techniques have been left out of the scope of this work for the limitations of the Lego robot, it is important to mention that the majority of publications on this topic cover techniques also belonging to the temporal difference learning (TD) methods [8]. These methods are based on estimations of the final reward for each state, which are updated in every iteration of the learning process. Along with temporal difference learning techniques, works including Monte Carlo methods, dynamic programming and exhaustive search [8] can be easily found in reinforcement learning studies. All these techniques have a solid theoretical basis and have been addressed by many researchers. Also, there is a growing interest in extending and combining these techniques, resulting in new methods such as hierarchical Q-learning, neural Q-learning, RL and partially observable Markov decision process (POMDP), Peng and Naïve (McGovern) Q-learning, policy gradient RL, emotionally motivated RL, quantum Inspired RL, quantum parallelized hierarchical, Monte Carlo and TD methods combination, curiosity and RL, and actor critics method [7] [22] [2]. Master Thesis 20 University of Málaga

23 Chapter 3 Simple wandering task (I). Design and Simulation This chapter addresses the study of the Q-learning algorithm in a simulated scenario for a simple obstacle-avoidance wandering task, from now on called wander-1. It includes the particular robot configuration, the model of the system, and a sensitivity analysis of the parameters of the Q-learning algorithm. The work described here will be used as a reference for the real robot implementation in the next chapter. 3.1 Mobile robot configuration As discussed in the introduction chapter, we have employed the standard differential drive mobile configuration suggested in the Lego Mindstorms NXT 9797 kit as the base assembly for our robot. This setup includes two driving wheels separated 11cm, a caster wheel, and an ultrasonic sensor or sonar (see fig. 3.1). Since the 9797 kit only provides one sonar for obstacle detection, our proposal includes adding two contact sensors located on the front side of the robot (see fig. 3.2). These sensors will allow the learning process to find out the more convenient turning direction to execute in order to dodge an object after colliding with it. Besides, the sonar only will be able to detect objects located just in front of the robot; the contact sensors can mitigate this problem. Using the 9797 kit pieces, we have built a solid 2-contacts bumper accessory which can be easily coupled at the front of the basic setup, placed instead of the light sensor as shown in the 9797 kit guide (see fig. 3.3). This configuration has been employed in all the tasks developed in this work, although the ultrasonic sensor has not been necessary for the wander-1 learning described in chapters 3 and Task definition: 4 states + 4 actions To achieve the main objective of this thesis efficiently, we begin by studying the reinforcement learning of a simple wandering task, wander-1, using a reduced number of states and actions. This configuration will allow us to implement an off-line learning process model so that the Q-learning algorithm parameters can be easily separated and analyzed. With 2 frontal contact sensors and 4 movement actions (stop, left-turn, right-turn and move forward), we intend the robot to learn how to wander dodging obstacles after touching them (see table 3.1). As stated in the introduction chapter, we expect that in a state with a left front 21

Reinforcement Learning on the Lego Mindstorms NXT robot Figure 3.1: NXT basic mobile configuration. Angel Martı nez-tenor Figure 3.2: NXT Two-contact bumper. Figure 3.3: NXT final setup.

Task Objective Wandering evading obstacles after touching them s1 s2 s3 s4 States no contact pressed left bumper contact pressed right bumper contact pressed both contacts pressed a1 a2 a3 a4 Actions

24 Reinforcement Learning on the Lego Mindstorms NXT robot Figure 3.1: NXT basic mobile configuration. Angel Martı nez-tenor Figure 3.2: NXT Two-contact bumper. Figure 3.3: NXT final setup. contact activated, the robot learns that turning to the right is the action that will lead to its optimal policy. The ultrasonic sensor is not needed in the wander-1 design. Task Objective Wandering evading obstacles after touching them s1 s2 s3 s4 States no contact pressed left bumper contact pressed right bumper contact pressed both contacts pressed a1 a2 a3 a4 Actions stop left-turn right-turn move forward Table 3.1: Wander-1 task description. 3.3 Models An estimation of the rewards and the transition function is necessary so as to simulate a realistic reinforcement learning process. Both the reward and the transition functions described below are based on preliminary tests performed on the real robot Reward function We will take advantage of the servomotor encoders of the wheels of the Lego robot to calculate the rewards: in each step, we will measure the relative increase of the encoder, assigning larger Master Thesis 22 University of Ma laga

25 reward to both wheels having rotated forward a greater angle (above a fixed threshold). We intended this simple reward would be enough for the robot to learn the avoidance obstacles wandering task: the lack of forward motion, which happens when the robot collides with a wall or an object, is interpreted as an absence of positive reward, leading to the learning process to select another action. However, preliminary implementation tests showed that once the robot collided with obstacles fixed on the ground, the chance of keep rotating both wheels forward, sliding on the floor, while receiving positive reward, was so high that we were forced to resign this single concept. We decided then to add a soft penalization when one frontal contact is activated, and a large penalization in case of both contacts being on. In those situations, any positive reward due to the servomotor encoders measurements is rejected. The final reward function of the wander-1 learning simulation has been implemented as shown in the Octave/Matlab code of fig function R=obtainReward_simpleTask(s,a,sp) % Input: s: state (not used here), a: action, sp: new reached state % Output: Gathered Reward R=0; switch sp case 1 % No contact: R depends directly on the executed action switch a case 1 R=0; case 2 R=0.1; % Turn: small reward case 3 R=0.1; % Turn: small reward case 4 R=1; % Robot stopped: no reward % Forward motion: large reward end case 2 % Left bumper contact: small penalization R=-0.2; case 3 % Right bumper contact: small penalization R=-0.2; case 4 % Both bumper contacts: large penalization R=-0.7; end Figure 3.4: Reward function. The reason of the particular values assigned to R will be explained in section Transition function and system model In order to perform any off-line simulation, we need some information about the system and the environment. Regarding the environment, a 70x70 cm enclosure will be used for the wander-1 task. The off-line model has been implemented as a function that simulates the execution of the selected action a from the state s, resulting in a new reached state s. This is the transition function T (s, a, s ) of the system. The simulations used a conceptual model, which was replaced later by a more detailed frequentist model, based on the data collected by the robot. Both models were used and studied separately in the wander-1 simulation. Master Thesis 23 University of Málaga

A: Conceptual model In this model the transition function is the result of a preliminary and simplified analysis performed on the robot within its environment, by observing its general behavior and

26 A: Conceptual model In this model the transition function is the result of a preliminary and simplified analysis performed on the robot within its environment, by observing its general behavior and extracting intuitive deductions about it. Figure 3.5 shows the transition function as a conditional probability chart in which each cell represents P (s s, a). Figure 3.5: Wander-1 transition function T (s, a, s ) represented as P (s s, a) in the conceptual model. As an example, when the simulated robot is in state 3 (right bumper contact) and it executes the action 3 (right-turn), the probability of reaching s = 4 (both bumper contacts) has considered to be 0.9. The simulator code of appendix B.2 implements the above transition function in an intuitive way so as to allow performing minor adjustments in the case we want to explore some states more deeply. This code was later replaced by a generic algorithm that generates the transition function from frequentist data obtained by the robot in the real environment. B: Frequentist model The NXC code of appendix C.5 was written for collecting the necessary data for designing a more realistic frequentist model of the robot in its environment. At each step, an action is selected according to a discrete uniform distribution. After compiling and running C.5 in the 70x70 cm environment, we obtained the file listed in appendix B.1, containing the variable data(s, a, s ) with the statistical information of the system. As an example, data(1, 2, 3) = 12 means that in state s1 the action a2 led to s3 up to 12 times. A simple algorithm listed in appendix B.3 was then used to generate the transition function T (s s, a) from this statistical data, shown in figure 3.6. The differences between the conceptual and frequentist models are shown in figure 3.7. Logically, the frequentist model offers an approximation to the real situation closer than the conceptual one. Moreover, 30% of the values of the transition function differs in more than 0.2 (up to 0.85) when comparing both models. Even when in this first study we are dealing with a basic task in a simple environment, we have proven that an intuitive conceptual model can be completely inaccurate. Section 3.4 will demonstrate that both models result in different Q-learning parameters for learning the same task. Master Thesis 24 University of Málaga

Figure 3.6: Wander-1 transition function T (s, a, s ) represented as P (s s, a). Frequentist model. Figure 3.7: T (s, a, s ) differences between conceptual and frequentist model. 3.4 Simulation.

27 Figure 3.6: Wander-1 transition function T (s, a, s ) represented as P (s s, a). Frequentist model. Figure 3.7: T (s, a, s ) differences between conceptual and frequentist model. 3.4 Simulation. Q-learning parameters tuning For implementing the simulation, we need to translate the Q-learning pseudo-code shown in section 2.2 to a script in our simulation language. The numerical computation platforms Octave and Matlab were employed in this work. Minor modifications have been made on the scripts and functions so as to maintain the compatibility in both programming languages. Thus, every experiment can be reproduced executing the Octave/Matlab scripts shown in this report without any change. The structure of the code is similar to the C programming language, without taking advantage of the matrix calculation in the former. This structure encourages comparative studies between Octave/Matlab and NXC, the language used in the robot. The main loop containing the Q-learning algorithm implemented and employed in all simulations is as shown in fig That code has been extracted from Q learning simpletask Frequentist.m listed in appendix B.5. Analogously, Q learning simpletask Conceptual.m is shown in appendix B.4. We call the simulate simpletask(s,a) and obtainreward simpletask(s,a,sp) functions, both described in the previous section. The next step concerns the adjustment of the parameters of the learning process. Since any Q-learning parameter can modify the behavior of the rest, we have settled the reward values and the exploitation-exploration strategy before tuning the learning rate α and discount rate γ. This procedure allows us to obtain the optimal values of α and γ by performing experiments under the same conditions once a valid strategy that explores all the Q-matrix cells, and an Master Thesis 25 University of Málaga

28 for step = 1:N_STEPS % selectaction (Exploitation/exploration strategy) if ceil(unifrnd(0,100)) > 70 % simple e-greedy a = ceil(unifrnd(0,4)); else a = Policy(s); end % executeaction(),wait() and observestate() simulated: sp = simulatesimpletask_frequentistmodel(n_states,s,a,t); R = obtainreward_simpletask(s,a,sp); % update Q-matrix Q(s,a) = (1 - alpha) * Q(s,a) + alpha * (R + GAMMA * V(sp)); % V_update(), Policy_update(). Refer to equations (2.4) and (2.5) V(s) = Q(s,1); Policy(s) = 1; for i = 2:N_ACTIONS if(q(s,i)>v(s)) V(s) = Q(s,i); Policy(s) = i; end end s = sp; % update state Figure 3.8: Q-learning algorithm used in all simulations. suitable set of rewards that satisfy the requirements discussed in the subsection have been settled Reward tuning The rewards assigned to the wander-1 learning process were normalized in [0,1] for a better analysis of the variations of the Q-matrix values. Besides, this will help to overcome the limitations of the robot related to the fixed-point numeric system employed, as discussed later on. Since the objective of the task is defined by the reward function, we must be careful in the choice of the set of instant rewards so as to avoid falling into local maxima [10]. As a result, the highest reward, obtained when the robot moves forward without colliding, is selected ten times higher than the reward assigned when the robot turns. Preliminary tests show that as we increase this secondary reward, the chance to learn a turning policy increases, even when no obstacles are found in the path of the robot. After a set of simulations, the reward values of table (3.2) were chosen. This set of normalized rewards were tuned to obtain the optimal policy in all the tests. As an example, although minor variations of these values can be also valid, simulations with R=0.2 for the event turned without colliding sometimes resulted in non-optimal turning policies Exploitation-exploration strategy We have selected a fixed ɛ-greedy strategy, meaning that the agent will usually exploit the best policy learned at each step but with a probability of ɛ of exploring any action randomly. Master Thesis 26 University of Málaga

29 R Event 1 moved forward (no collide) 0.1 turned (no collide) -0.2 one bumper collided -0.7 both bumpers collided Table 3.2: Adjustment of rewards. The selected value ɛ = 30% fits in the wander-1 learning process, allowing the robot to explore all the possible combinations of states and actions in all the tests. Improvements of this strategy will be discussed later Learning rate and discount rate tuning In chapter 2 it was demonstrated that the theoretical learning rate of the Q-learning algorithm should be α k = 1, being k the current step, so as to approximate an average estimation. k However, preliminary simulations, and especially real robot tests, showed that the number of steps needed in learning an accurate policy had a strong dependence on the first states explored, sometimes resulting in the inability of learning a simple policy even after hours. Therefore we have chosen a different decreasing profile for α k. On the other hand, the CPU of the robot will limit the number of decimal places we can employ for our variables. We will analyze and justify this topic in the next chapter. The final values used for α k in this study are: α = 0.01, 0.02, 0.05, 0.1, 0.2, 0.4 (3.1) Analogously, the discount rate values to study different depths in the look-ahead strategy are: γ = 0, 0.2, 0.4, 0.6, 0.8, 0.9 (3.2) We have found that values beyond γ = 0.9 can cause numerical overflow problems in the robot. This topic also will be discussed in the next chapter. Other relevant features of the simulations described in this chapter are summarized in the following: The task to learn is so simple that it allows us to predict the optimal policy before running any experiment. That will be the one shown in table (3.3). State s1 (no obstacle) s2 (left-contact) s3 (right-contact) s4 (both contacts) Action a4 (move forward) a3 (right-turn) a2 (left-turn) a2 or a3 Table 3.3: Wander-1 optimal policy. We will use, as the main indicator of the goodness of the learning process, the latest iteration in which the agent passed from a non-optimal policy to an optimal one. An additional custom Q-estimator has been defined as the rate between the optimal value and the second highest value of the Q-matrix corresponding to state s1 (no contact). We have selected this indicator after checking that almost all failed tests learned a policy different to a4 (moving forward) in s1, due to the proximities of their Q-matrix values. Master Thesis 27 University of Málaga

Each simulation has been performed with an specific selection of learning rate α and discount rate γ, making a total of 36 simulations.

The resulting indicators shown here correspond to the average of the 40 indicators obtained in each experiment. The Octave/Matlab script of appendix B.

Likewise, the results of running the simulation with the frequentist model listed in appendix B.5 are shown in figures 3.11 and 3.12. Figure 3.

30 Each simulation has been performed with an specific selection of learning rate α and discount rate γ, making a total of 36 simulations. Each simulation consisted of 40 experiments, corresponding to 40 complete learning processes. Each learning process were conducted during 2000 steps. The resulting indicators shown here correspond to the average of the 40 indicators obtained in each experiment. The Octave/Matlab script of appendix B.4 was used for simulating the conceptual model. The results are shown in figures 3.9 and Likewise, the results of running the simulation with the frequentist model listed in appendix B.5 are shown in figures 3.11 and Figure 3.9: Wander-1 simulation: results of learning rate α and discount rate γ combinations for the conceptual model. Figure 3.10: Wander-1 simulation: results of learning rate α and discount rate γ combinations for the conceptual model. Figure 3.11: Wander-1 simulation: learning rate α and discount rate γ results for the frequentist model. Master Thesis 28 University of Málaga

31 Figure 3.12: Wander-1 simulation: learning rate α and discount rate γ results for the frequentist model. According to the number of steps needed for learning the optimal policy, simulations on the conceptual model gave as a result that there is a large window to choose a discount rate (γ 0.2), and learning rate (α 0.1) so as to obtain the optimal policy efficiently. The additional indicator Q-estimator showed that γ 0.8 combined with α 0.05 resulted in Q-matrix values so similar to each other that minor losses of precision could lead to nonoptimal policies. Hence, we have to take into account this indicator when working on the robot. As to the more realistic frequentist model, simulations narrowed the above windows, requiring γ 0.6, and revealing some patterns in this numeric sensitivity study that had not appeared before: according to the number of steps needed for obtaining an optimal learning, we highlight the presence of two well-distinguished minima at [α = 0.02, γ = 0.8] and [α = 0.02, γ = 0.9]. Since the frequentist model is based on real robot and environment data, and they were collected with a step time of 250ms (appendix C.5), the simulation gave a reference of 65 seconds (260 steps) as the lowest average time needed in learning the optimal policy (remember that each value is an average of 40 experiments). This value, much lower than we expected, and hard to improve, will be a demanding reference in later experimentations with the robot. The above simulations were repeated many more times, obtaining similar evidences. The results let us choose between γ = 0.8 or γ = 0.9 interchangeably. However, as we are interested in learning more complex tasks, we prefer working with higher discount rates, which will allow the robot to have a longer look-ahead horizon in its strategies. To sum up, we will select the parameters of table 3.4 as the starting point in the implementation of the Q-learning algorithm for the wander-1 task in the real robot. Parameter Value Rewards 1, 0.1, -0.2, -0.7 Exploitation/exploration ɛ-greedy, 30% exploration Learning Rate α 0.02 Discount Rate γ 0.9 Table 3.4: Best Q-learning parameters selected after simulation tests. Master Thesis 29 University of Málaga

32 Indicator Result Success in learning optimal policy 40 of 40 experiments (100%) Average number of steps needed 260 (69 seconds) Table 3.5: Simulation results with the selected parameters. Frequentist Model. Master Thesis 30 University of Málaga

33 Chapter 4 Simple wandering task (II): Robot implementation The present chapter describes the transition between Q-learning simulation and real robot implementation. After describing a library and the main program for the robot, this chapter proceeds with an analysis of the limitations of the platform when using this learning method. We restrict the parameters values so as to avoid overflows and relevant losses of precision. In the last section we show the results of the wander-1 learning process in the robot. 4.1 Translation of simulation code into NXC First of all, the most relevant difference between a Q-learning real implementation and the simulation, to keep in mind when working on the robot, is time. Preliminary tests were performed with a step time of one second, meaning that the robot executed the selected action during that time at each step. Then, it observes the reached state and gets the associated reward. The CPU of the robot (48 MHz) allows us to neglect this step execution time, making unnecessary to stop the movement after each action, and thus, favoring a smoother movement. After a few tests, we have decided to lower the step time to 250 ms to speed up experimentation. The following software was employed when working with the robot: NBC r4 and NeXTTool under Ubuntu Brick Command Center v3.3(build ) under Windows XP (just for verifying that the tests presented in this work can be reproduced in other platforms without making any change in the source code). Additional techniques developed to improve the experimentation with the robot are described in appendix A and summarized in table 4.1. Concept Section Debug, compilation and NXT communications A.1 Datalogs (Q-matrix, learned policy...) A.2 Brick keys events A.3 Online information though display & loudspeaker A.4 Table 4.1: Additional implemented techniques for experimenting with the Lego robot. 31

34 The programs for the robot have been written in Not exactly C (NXC), a programming language invented by John Hansen strongly based on C [20]. Two files were developed thinking in further implementations on other systems: <MainProgram>.nxc, and the library NXT io.h, briefly described in the following NXT io library NXT io.h has been written in this work to isolate the hardware from the Q-learning algorithm, being a specific library for the NXT robot hardly portable to other systems. This library also includes some constants and functions needed in the learning process implemented in this work, hence, it has been employed extensively. The only parameter of the library to be modified in our experiments is SPEED, the power of the driving wheels, as will be discussed later. The complete code can be consulted in appendix C.1, and its content is enumerated in table 4.2. NXT io.h NXT actuators parameters: SPEED, L WHEEL, LEFT BUMPER... NXT loudspeaker parameters: VOL, Notes(C5, A5...), Notes length(half, QUARTER...) void NXT mapsensors(void) /* Sonar and contact sensors mapping */ byte getsonardistance(void) void executenxt(string command) /* stop, turnright... */ void showmessage(const string &m, byte row, bool wait) NXTcheckMemory(long memoryneeded, string FILE) bool pauseresumebutton(void), exploitepolicybutton(void), saveandstopbutton(void) void pausenxt(void) Table 4.2: Content of the NXT io.h library Main program The main program for the robot contains, among other functions and characteristics, the Q- learning algorithm, its parameters, and the definition of the states and actions. This program can be easily translated into other programming languages, and it will be modified in each learning task. The whole code simple task learning.nxc, the core of this section, is shown in appendix C.2. Here, the most relevant issues involved in the Q-learning algorithm in NXC are described. The main loop, where the learning process takes place, is shown in fig Zero-valued indexes have been omitted in states/actions codification for a better compatibility with Octave/Matlab code. NXT is a microcontrolled-based robot without FPU (Floating-Point Unit). Therefore, the NXC programming language does not support real numbers needed in the learning algorithm. This is why the expression that updates Q in figure 4.1 has been translated from the theoretical equation (2.19) used in Octave/Matlab simulations, repeated here for convenience: to: Q(s, a) = (1 α) Q(s, a) + α [R + γ V (sp)] (Octave/Matlab) (4.1) Q[s][a] F P = (F P α F P ) Q[s][a] F P F P + α F P (R F P + γ F P V [sp] F P F P ) F P (4.2) Master Thesis 32 University of Málaga

35 for(step=1; step<n_steps+1; step++) a = selectaction(s); // Exploitation/exploration strategy executeaction(a); Wait(STEP_TIME); sp = observestate(); R = obtainreward(s,a,sp); // Update Q-matrix Q[s][a]=((FP-alpha)*Q[s][a])/FP + (alpha*(r+(gamma*v[sp])/fp))/fp; // Update V and Policy V[s] = Q[s][1]; Policy[s] = 1; for(i=2; i<n_actions+1; i++) if(q[s][i] > V[s]) V[s] = Q[s][i]; Policy[s] = i; s = sp; // Update state Figure 4.1: Q-learning algorithm in NXC. where FP is a constant used to emulate real numbers with integers when using a fixed-point notation. In the next section we will show how this notation system, with scaling factor 1/10000 (FP = ), will be valid to perform the experiments on the robot. Since the reward R, the learning rate α, and the discount rate γ are real numbers, all of them must be expressed in our notation system (multiplied by FP). Q[s, a] and V [s] are represented in this notation too. Thus equation (4.2) can be expressed as: Q[s][a] F P = (F P α F P ) Q[s][a] F P F P + α F P (R F P + γ F P V [sp] F P F P ) F P (4.3) The above expression (4.3) solves the computation of Q[s][a] in our fixed-point notation, in which each product must be later divided by FP. The operations have been associated so as to minimize the impact of overflows, as will be discussed in next section. A short description of the functions called from the main loop of our program is given here: selectaction(s): The same ɛ-greedy function used in simulation. executeacion(a): Calls to NXT io servomotor functions: executenxt("forward")... observestate(): Returns the codification of states based on sensor values; as shown in figure 4.2. obtainreward(s,a,sp): Unlike the reward function used in simulation, and after several tests, we decided to add a practical function which does not need to be redefined after a change of states and/or actions. Based on wheels encoders and contact sensors, the reward function was defined as shown in table 4.3, and it was implemented as figure 4.3. Master Thesis 33 University of Málaga

36 byte observestate(void) // Returns the state of the robot by encoding the information measured from // the contacts sensors. In case the number of states or their definitions // change, this function must be updated. // States discretization: // s1: left_contact=0,right_contact=0 // s2: left_contact=1,right_contact=0 // s3: left_contact=0,right_contact=1 // s4: left_contact=1,right_contact=1 byte state; state = 1 + (LEFT_BUMPER + 2 * RIGHT_BUMPER); // defined in NXT_io.h return(state); Figure 4.2: Function observestate() written in NXC. long obtainreward(byte s, byte a, byte sp) // Input: s,a,s not used directly here since we look at the motion of // wheels and sensors (that would be a and s ). Encoders and // contact sensors result in a better function R(s,a,s ) // Output: Obtained Reward long R; // Fixed-point number long treshold; // Reference: 1 second at SPEED 100 gives above 700 degrees (full batery) treshold=30; // Valid for [40 < SPEED < 80] R=0; if(motorrotationcount(l_wheel) > treshold && MotorRotationCount(R_WHEEL) > treshold) R = FP; // R=1 else if(motorrotationcount(r_wheel) > treshold) R = FP / 10; // R=0.1 else if(motorrotationcount(l_wheel) > treshold) R = FP / 10; // R=0.1 if(left_bumper RIGHT_BUMPER) if(left_bumper && RIGHT_BUMPER) R=-FP/2-FP/5; // R = -0.7 else R=-FP/5; // R = -0.2 ResetRotationCount(L_WHEEL); ResetRotationCount(R_WHEEL); return(r); Figure 4.3: Function obtainreward() written in NXC. The code of fig. 4.3 generates the reward function without coding all the (s, a, s ) combinations that take place in the learning process. Nevertheless, it depends on the variables (s, a, s ), resulting in a theoretically valid reward function. All the rewards are returned in fixed-point notation, as stated above. It is important to mention that the threshold for detecting whether or not a wheel moved forward was tuned and tested for wheel powers from 40 to 80. Outside this range, the above threshold should be readjusted. Master Thesis 34 University of Málaga

37 R Event 1 Both wheels moved forward above a threshold. No contact activated 0.1 Only a wheel moved forward above a threshold. No contact activated -0.2 One contact activated -0.7 Both contacts activated Table 4.3: Rewards definition in the robot. 4.2 Q-learning algorithm with CPU limitations This section analyzes the limitations of our learning process when working with the real robot. These limitations include overflows and precision losses in computation, and also storage capacity, which leads to a maximum number of states and actions that can be implemented without exceeding the memory of the robot Overflow analysis In order to detect overflows, we must analyze the three factors involving fixed-point numbers from expression (4.3): (a) (F P α F P ) Q[s][a] F P (b) γ F P V [sp] F P (c) α F P (R F P + γ F P V [sp] F P F P ) In this work R has been normalized to 1 (R F P F P ) and the best learning processes have been obtained when α << 1. Thus the greatest factor results F P Q[s][a] F P from (a). Note that the same restriction can be found in (b) in the case γ 1. On the other hand, since the NXT robot has a 32-bit CPU, NXC signed long variables can store values up to = 2, 147, 483, 647 = MaxLongNXC. This integer capacity will be limited in our notation system so as the integer part of any cell in Q must be M axlongn XC/F P (fixed-point variables are already multiplied by FP). As an example, if we select FP=10000, the greatest integer part involved in the learning should not exceed the value of 214,748 to avoid overflows. The largest Q-matrix values were obtained by simulating the previous frequentist model from chapter 3. The worst case scenario occurs in the unlikely event that the system always gets the greatest reward at every step, since the higher the reward, the higher the Q-matrix values (refer to eq. (4.1)). We have run simulations for steps for the following values of α and γ: α = 0.01, 0.02, 0.05, 0.1, 0.2, 0.4 (4.4) γ = 0.5, 0.6, 0.7, 0.8, 0.9, 0.95, 0.99 (4.5) The results of this analysis are summarized in table 4.4, where Q max /(F P R) represents the value to which the Q-matrix converges. Different values of parameter α for a fixed γ resulted in the same value of convergence, thus α has been left out of the table 4.4. γ Q max /(F P R) Table 4.4: Limits in the Q-matrix values to avoid overflow when using a fixed-point notation system. Master Thesis 35 University of Málaga

The data obtained in 4.4 along with the restriction Q MaxLongNXC/F P allow us to determine whether our system will cause overflows for any set of γ, R and F P.

38 The data obtained in 4.4 along with the restriction Q MaxLongNXC/F P allow us to determine whether our system will cause overflows for any set of γ, R and F P. As an example we check our optimal values from chapter 3: γ = 0.9, R = 1, and FP = From the table 4.4, Q max = 10 R F P = 100, 000. These parameters satisfy the above requirement, since Q max is lower than MaxLongNXC/F P (100, 000 < 214, 748). The importance of normalizing the rewards to 1 can be seen with a simple example: if rewards rise up to R = 10, the Q-matrix values will be ten times the above values, and the requirement MaxQvalue = MaxLongNXC/F P will not be accomplished. Analogously, with other values for γ, R, and FP, we can deduce the following: FP = (5 decimal places) will result in overflow, since the condition Q max = X R 100, , being X any value from table 4.4, involves reducing R to 0.1 and γ to 0.5, leading to a poor look-ahead learning process. FP = (4 decimal places) allows for γ values up to 0.95 (γ = 0.96 causes overflows in WCS). FP = and γ = 0.9 allows for reward values up to 2, without causing overflow. FP = 1000 (3 decimal places) allows for γ values up to 0.999, although this is associated with serious precision problems, as will be described in the next subsection. FP = 100 practically have no γ restrictions, but also results in precision problems. Table 4.4 also helps to establish the limits for 16-bit systems (here MaxIntNXC = = 32767). Using the same procedure as above, we found that only FP = 100 and γ < 0.7 can be used to avoid overflow. Finally, no 8-bit system (MaxIntNXC = 255) can hold a correct Q-learning algorithm using the notation system employed here Precision Analysis The study of numeric precision begins with a simulation of the frequentist model explained in the previous chapter and listed in appendix B.3, collecting the following data: step in which the optimal policy was learned and kept, the resulting Q-matrix, and two arrays with the history of the executed actions and states reached in each step. We will use these results, shown in table 4.5, as a reference for comparing with the data obtained by the robot later on. Table 4.5: Q-matrix reference values obtained from Octave simulation. The optimal policy (values in bold) was learned in step 298. We have modified the previous simple task learning.nxc code by using the arrays history a and history sp from the simulation. In this case, the action executed by the robot is selected by accessing the proper history a value. The same applies to the reached state. Meanwhile, the robot remains stopped. In summary, we are reproducing the same learning process made in simulation, now inside the NXT robot, with all its limitations. This allows us to compare the resulting Q-matrix of both the robot and the offline simulation. Master Thesis 36 University of Málaga

// Modified Q-learning algorithm, main loop (precision study)--------------- for(step=1; step<n_steps+1; step++) //a = selectaction(s); // Exploitation/exploration strategy a = history_a[step-1];

39 // Modified Q-learning algorithm, main loop (precision study) for(step=1; step<n_steps+1; step++) //a = selectaction(s); // Exploitation/exploration strategy a = history_a[step-1]; //executeaction(a); //Wait(STEP_TIME); //sp = observestate(); sp = history_sp[step-1]; R = simulatereward(s,a,sp); // Update Q-matrix Q[s][a]=((FP-alpha)*Q[s][a])/FP + (alpha*(r+(gamma*v[sp])/fp))/fp; // Update V and Policy V[s] = Q[s][1]; Policy[s] = 1; for(i=2; i<n_actions+1; i++) if(q[s][i] > V[s]) V[s] = Q[s][i]; Policy[s] = i; // Update step s = sp; Figure 4.4: Q-learning algorithm implemented for the analysis of precision. The main learning loop of the robot implemented for this purpose is the one shown in figure 4.4. Running the code on the robot, and keeping the same parameters selected so far, we obtain the values shown in table 4.6. Table 4.6: Q-matrix values from the NXT simulation. The optimal policy (values in bold) was learned in step 298 (note: Q-matrix values have been multiplied by FP). There is no need to repeat the learning process on the robot several times, since the fixed sequence of states and actions will always lead to the same Q-matrix after 2000 steps. Now both simulations, the one of Octave with 64-bit floating-point and the one in the NXT with 32-bit fixed-point, have a comparable Q-matrix. Their differences are shown in figure 4.5. Our results show that both tests entered in the optimal policy at the same step (298). On the other hand, the greatest difference of Q-matrix values, after 2000 steps, was , which represents a deviation of 2.6% of the greatest possible reward in a single step. When referring to relative deviations, the greatest difference was detected in Q(3, 1), with a 3.5% error. These differences were so small that did not affect the learning process at all, reaching the optimal policy in the exact same iteration. Hence, we will keep using the previous set of parameters (γ = 0.9, α = 0.02, F P = 10000). Master Thesis 37 University of Málaga

Figure 4.5: Q-matrix differences between offline and NXT simulation. Nevertheless, we wanted to know the precision error committed by the robot when working with other parameters.

Thus we can study how sensitive are other combinations of γ and α to the fixed-point notation by changing F P alone. Figure 4.

40 Figure 4.5: Q-matrix differences between offline and NXT simulation. Nevertheless, we wanted to know the precision error committed by the robot when working with other parameters. A set of experiments involving different γ, α, and F P have been performed on the robot using the same procedure. Thus we can study how sensitive are other combinations of γ and α to the fixed-point notation by changing F P alone. Figure 4.6 compares the results of these tests according to the step in which the optimal policy was reached and maintained. For convenience, we have employed powers of 10 as scaling factors for the fixed-point notation system. From the obtained data, we can deduce the following: Figure 4.6: Step in which the optimal policy was learned for different FP, γ and α in the NXT robot. In 32-bit systems without FPU, as in our case, fixed-point notation systems with scale factors different to 1/1000 and 1/10000 (3 or 4 decimal places) should not be used, either because of lack or precision or overflow. The parameter α should be smaller than 0.1 for reaching the optimal policy in fewer iterations. The best results of this analysis occurred with α = When α = 0, 01, better results were obtained with γ = 0.9 than with γ = Although our standard case, α = 0, 02, was not affected by this, it is expected that further complex tasks will be more sensitive to this parameter, which should lead to the choice of a large γ. Master Thesis 38 University of Málaga

F P = 1000 gave better results than F P = 10000 because of precision errors (bold text in fig. 4.6).

41 F P = 1000 gave better results than F P = because of precision errors (bold text in fig. 4.6). Actually, the similarities in the Q-matrix values fortuitously gave the optimal policy in an earlier step than we expected. 16-bits systems, restricted to F P = 100 as discussed in section 4.2.1, require a fixed α = 0.1. Smaller values cause precision problems of such a magnitude that they can hardly change the Q-matrix values, remaining most of them invariables. On the other hand, greater values result in serious problems of convergence. At this point we can perform a precision analysis regarding the values obtained in the Q- matrix. In order to get a better understanding of this topic, the number of the steps of the learning process has been reduced to 100. Thus, we can analyze the deviations produced in the first part of learning. Figures 4.7 and 4.8 show the results of these tests. Figure 4.7: Q-matrix values with FP=1000 and FP= Q-matrix values are multiplied by FP. The main finding of this comparative study was the evidence that the smallest valid learning rate which eludes severe losses of precision is limited by the number of decimals settled in our fixed-point system. Figure 4.8 shows that as we decrease the number of decimals, the error due to the loss of precision rises, being this error greater when α is small. In our example, when α = 0.02, the average error obtained with 3 decimals with respect to 4 decimals after 100 steps was 2.5%, with a maximum of 4.4%. This is five times greater than the one obtained when working with α = 0.1. Thus, although previous tests gave better results when working with α = 0.02 and F P = 1000, we can state that the error of their Q-matrix values were relatively high. This fact is important when designing further complex tasks with F P = 1000, where α = 0.1 should be preferred to improve the accuracy of the learning process. On the other hand, the choice of γ = 0.75 or γ = 0.9 had no effects in this comparative study. Master Thesis 39 University of Málaga

Figure 4.8: Differences in Q-matrix values between FP=1000 and FP=10000. 4.2.

42 Figure 4.8: Differences in Q-matrix values between FP=1000 and FP= Memory limits The greatest number of states and actions allowed for our robot have been obtained after addressing the following considerations: The size of the Q-matrix will be limited by the memory of the robot, of 64 kb, which must also store the CPU stack and other global variables. The cell size of the Q-matrix will be of 32 bits, as 16 bits cells would result in less efficient and limited learning process, as explained in the two previous sections. Later on it will be necessary to allocate memory for auxiliary variables, including an exploration matrix introduced in the next chapter. Figure 4.9 shows the size occupied for the different variables in a particular case, along with some pairs of values of the number of states and actions that can be employed without exceeding the memory of the robot. The conclusion is that the amount of states and actions will not represent any problem in this work: the most complex task implemented here has 16 states and 4 actions, which is less than 2% of the total memory available. 4.3 Implementation results At this point, we have obtained all the information about the parameters that best suit on the robot, as well as their restrictions when they are combined in the same learning structure. Table 4.7 summarizes this data, which represents the input parameters for the wander-1 implementation. Master Thesis 40 University of Málaga

02 FP 10000 (4 decimals in fixed-point notation) Q-matrix cell size 4 bytes (long) Table 4.7: Parameters of the wander-1 task implementation.

43 Figure 4.9: NXT memory restrictions for states/actions: example of memory needed (left), number of states and actions available (center and right). Concept Value Robot Speed 80 of a range [0,100] Step Time 250 ms Number of Steps 2000 Exploitation/exploration rate ɛ-greedy 30% exploration Discount rate γ 0.9 Learning rate α 0.02 FP (4 decimals in fixed-point notation) Q-matrix cell size 4 bytes (long) Table 4.7: Parameters of the wander-1 task implementation. The full NXC source code was explained in section 4.1 and can be consulted in appendix C.2. The real tests were performed in a 70x70 cm square enclosure shown in figure Figure 4.10: Scenario A for the wander-1 task implementation. Master Thesis 41 University of Málaga

44 As a reference, the frequentist model simulations gave an average number of steps needed for learning the task of 260 (recall section 3.5). However, after running six real tests on the robot, the results have been unexpectedly better than the simulated ones. Figure 4.11 collects these results. Figure 4.11: Wander-1 learning implementation. Summary of experiments (* Divided by FP). Figure 4.12: Wander-1 learning implementation. Examples of the resulting Q-matrix. Figures 4.13 and 4.14 show the robot during the learning process and exploiting the learned policy for the wander-1 task. It seems that the real environment offered a better model of learning for the robot than the one designed for the simulation in the wander-1 task. Since all the tests with the real robot learned the optimal policy efficiently, we can consider that we are ready to move onto more complex task implementations. Figure 4.13: Wander-1 task implementation. Robot during the learning process. Master Thesis 42 University of Málaga

45 Figure 4.14: Wander-1 task implementation. Robot exploiting the learned policy. Master Thesis 43 University of Málaga

46 Master Thesis 44 University of Málaga

47 Chapter 5 Complex wandering task In chapter 4 we discussed a successful and efficient learning processes for a simple wandering task. Some interesting conclusions were that the real world scenario resulted in a slightly better model for the robot than those created for simulations. After the work described in sections 4.1 and 4.2, the time employed in designing and simulating the wander-1 task model, including data collectors, took much longer than the one needed for implementing and testing the learning algorithm in the robot. This chapter addresses two new learning processes designed and studied directly on the robot, one with 5 states and 4 actions called wander-2 (section 5.1), and a more complex implementation with 16 states an 9 actions called wander-3 (section 5.2). The last section summarizes some improvements that can be made to the basic Q-learning algorithm related to this work. 5.1 Wander-2 task A wandering task with five states, wander-2, has been proposed and designed for taking advantage of the ultrasonic sensor mounted, but not employed, in the previous task. Thus, with 1 ultrasonic sensor, 2 frontal contact sensors and 4 movement actions (stop, left-turn, right-turn and move forward), we intend the robot to learn how to wander avoiding obstacles. As shown in the wander-1 task learning, we expect that in a state with a right-front contact activated, the robot learns that turning to the left will lead to its optimal policy; also, if a wall or obstacle is detected close to the robot, a turning action will be most likely the best decision to achieve the optimal policy. Neither rewards nor actions changed in this new setting, just a fifth state has been added that is reached when the sonar measures a distance above a given threshold with no collisions. In case that the measured distance is below this threshold without colliding, the agent would remain in state s1. This new discretization of states and actions is shown in table 5.1. Another characteristic of this task is related to the measured distance. Considering the separation existing between the ultrasonic sensor and the front contacts (13 cm), the threshold distance has been tuned to 25 cm. The power of the wheels has been reduced from 80 to 50. This empirical combination of threshold and speed allows the robot to get close enough to an obstacle to maximize the area of the wandering path. However, after a few tests, we detected some accuracy problems from the ultrasonic sensor that may affect the learning process. Addressing these issues will help in the understanding of the remainder tests described in this chapter; thus we enumerate them here: Considering the robot is facing a straight wall, it is unable to detect it if the angle formed by the wall and the line representing the sonar propagation beam (same orientation as 45

Task Objective Wandering avoiding obstacles (5 states) s1 s2 s3 s4 s5 States No contact & Obstacle near Left bumper contact Right bumper contact Both contacts No contact & Obstacle far a1 a2 a3 a4

48 Task Objective Wandering avoiding obstacles (5 states) s1 s2 s3 s4 s5 States No contact & Obstacle near Left bumper contact Right bumper contact Both contacts No contact & Obstacle far a1 a2 a3 a4 Actions Stop Left-turn Right-turn Move forward Table 5.1: Wander-2 task description. the robot) is below 42 degrees approx. In other words, starting from a setup in which the robot faces a wall, with an orientation perpendicular to this wall, the obstacle will be detected by the sonar provided that the robot does not change its orientation above = 48 degrees. This situation is depicted in fig Figure 5.1: Angle restriction of the ultrasonic sensor for detecting frontal obstacles. The highest value detected by the ultrasonic sensor is about 147 cm. Distances above that were not detected (the sensor returns 255 cm). We also found that, frequently, distances greater than 142 cm were not detected either. There were no problems with measuring the lowest distance. The wander-2 task continues being simple enough to guess its optimal policy intuitively. That policy would be the one shown in table 5.2. Again, it has been used for constructing an indicator of goodness of the learning process. State s1 (obstacle near) s2 (left-contact) s3 (right-contact) s4 (both contacts) s5 (obstacle far) Action a2 or a3 (Left or Right-turn) a3 (Right-turn) a2 (Left-turn) a2 or a3 (Left or Right-turn) a4 (Move forward) Table 5.2: Wander-2 task optimal policy. Regarding the robot implementation of the learning process, a few changes have been made from the wander-1 code, including the new observestate() routine shown in fig 5.2. Master Thesis 46 University of Málaga

49 byte observestate(void) // Returns the state of the robot by encoding the information measured from // the ultrasonic and the contacts sensors. In case the number of states // or their definitions change, this function must be updated. // States discretization: // s1: left_contact=0,right_contact=0 Obstacle near // s2: left_contact=1,right_contact=0 // s3: left_contact=0,right_contact=1 // s4: left_contact=1,right_contact=1 // s5: left_contact=0,right_contact=0 Non Obstacle near */ byte state; byte sonardistance; byte sonarstate; sonardistance = getsonardistance(); if(sonardistance <= DISTANCE_0) // DISTANCE_0 is the threshold distance defined in NXT_io.h used to // distinguish between states s1 and s5 sonarstate = 0; else sonarstate = 1; state = 1 + (LEFT_BUMPER + 2 * RIGHT_BUMPER); // defined in NXT_io.h if(state==1 && sonarstate==1) state=5; return(state); Figure 5.2: ObserveState() routine for the wander-2 task. The rest of the code can be consulted in appendix C.3. Table 5.3 summarizes the parameters employed in all wander-2 task experiments. Parameter Value Robot Speed 50 (in a range [0,100]) Step Time 250 ms Number of Steps 2000 Exploitation/exploration ɛ-greedy 30% exploration Discount rate γ 0.9 Learning rate α 0.02 FP (4 decimals in fixed-point notation) Q-matrix cell size 4 bytes (long) Table 5.3: Wander-2 task implementation parameters. The present task was tested in two scenarios: Scenario A: The same 70x70cm square enclosure for wander-1, already shown in figure Scenario B: 105x70cm enclosure with a 35x8cm obstacle, as shown in figure 5.3. Figures 5.4 and 5.5 display the results of the tests in both scenarios A and B. The learning process of the wander-2 task in scenario A reached the optimal policy in an average number Master Thesis 47 University of Málaga

Figure 5.3: Scenario B for the wander-2 implementation. of steps of 531 (133 seconds) ranging from 128 (32s) to 1154 steps (289s).

wander-1 task with just one state fewer. Besides, the dispersion of the results found in the wander-2 learning was larger.

The learning process cannot improve this behavior unless we increase the number of states and actions.

50 Figure 5.3: Scenario B for the wander-2 implementation. of steps of 531 (133 seconds) ranging from 128 (32s) to 1154 steps (289s). These results imply that the learning process, although with successful in the four experiments performed, needed almost three times more iterations respect to the learning process of the wander-1 task with just one state fewer. Besides, the dispersion of the results found in the wander-2 learning was larger. However, the number of states and actions used so far is so small that the maneuvers of the robot lacked precision, especially when avoiding certain corners. The learning process cannot improve this behavior unless we increase the number of states and actions. The next section addresses a more complex task with 16 states and 9 actions, which involves a large increment in the size of the Q-matrix, from 16 or 20 elements of the previous task, to a new Q-matrix of 144 elements. Figure 5.4: Wander-2 learning implementation: Summary of experiments for scenario A (* Divided by FP). Figure 5.5: Wander-2 learning implementation: Summary of experiments for scenario B (* Divided by FP). Master Thesis 48 University of Málaga

Figures 5.6 and 5.7 show the robot during the learning process and exploiting the learned policy for the wander-2 task. Figure 5.6: Wander-2 task implementation. Robot during the learning process.

51 Figures 5.6 and 5.7 show the robot during the learning process and exploiting the learned policy for the wander-2 task. Figure 5.6: Wander-2 task implementation. Robot during the learning process. Figure 5.7: Wander-2 task implementation. Robot exploiting the learned policy. 5.2 Obstacle avoidance task learning: wander-3 implementation and results The final developed task, wander-3, defines 4 distance ranges from the ultrasonic sensor. Taking into account each distance combined with the 4 possible combinations of the contact sensors, results in 16 states. On the other hand, each wheel will be able to move forward, move Master Thesis 49 University of Málaga

52 backward or remain stopped independently of the other wheel, thus the robot has 9 different actions to execute in each learning step. Our goal in this task is that the robot learns to wander avoiding obstacles with smoother and more precise movements than those obtained in previous tasks. For example, it will be able to turn in one direction in three ways: by moving one wheel forward, the other one backward, or both wheel at the same time. The task definition along with the discretization of states and actions is shown in table 5.4. Task Objective Wandering avoiding obstacles (16 states & 9 actions) States s1 No contact & obstacle range 0 s2 Left contact & obstacle range 0 s3 Right contact & obstacle range 0 s4 Both contacts & obstacle range 0 s5 No contact & obstacle range 1 s6 Left contact & obstacle range 1 s7 Right contact & obstacle range 1 s8 Both contacts & obstacle range 1 s9 No contact & obstacle range 2 s10 Left contact & obstacle range 2 s11 Right contact & obstacle range 2 s12 Both contacts & obstacle range 2 s13 No contact & obstacle range 3 s14 Left contact & obstacle range 3 s15 Right contact & obstacle range 3 s16 Both contacts & obstacle range 3 a1 a2 a3 a4 a5 a6 a7 a8 a9 Actions Stop Left-turn (both wheels) Right-turn (both wheels) Move forward Left wheel forward Right wheel forward Left wheel backward Right wheel backward Move backward Table 5.4: Wander-3 task. Description of states and actions. The new distance ranges of the sonar are found in table 5.5. The rest of the parameters remain the same as in the previous wander-2 task. Range Distance (cm) 0 < 25cm cm cm 3 > 75cm Table 5.5: Wander-3 task distance ranges considered in the ultrasonic sensor. This task has the disadvantage of being too complex to guess the optimal policy intuitively, hence we will execute the tests for 2000 and steps (> 1 hour) of learning. Regarding the implementation on the robot, besides rewriting the functions observestate() and executeaction(byte a) for including the new states and actions, it has been necessary to change the explotation/exploration strategy so as to avoid that some actions remained unexplored after the learning process. The code in fig 5.8 shows this modification. The modified ɛ greedy strategy favors the chance of selecting the least explored action when the robot is in a given state. A simple exploration matrix, parallel to the Q-matrix, contains the information needed for implementing this strategy. Master Thesis 50 University of Málaga

53 byte selectaction (byte s) // Input: Current State // Output: Action selected byte selectedaction; byte i; if(random(100)<70) // e-greedy // 70% probaility of exploiting the current learned policy. selectedaction = Policy[s]; else // 30 % of exploring if(random(100)<60) // Improvement to simple e_greedy: enhances exploring actions posibly // not too often visited. When exploring there is a 60% probability of // selecting the least explored action for the current state. selectedaction =1; for(i=2; i<n_actions+1; i++) if (exploration[s][i] < exploration[s][selectedaction]) // exploration[state][action] is the exploration matrix // used to count the number of times that the cell // Q[s][a] has been explored (and thus updated). selectedaction = i; else selectedaction=random(n_actions+1); // (1,2,3...N_ACTIONS) // When exploring, there is a 40% probaility of selecting a random // action return(selectedaction); Figure 5.8: Implementation of the exploration/exploitation strategy for the wander-3 task. Table 5.6 summarizes the input parameters used in the learning process of the wander-3 task. The complete code used in these tests can be consulted in appendix C.4. Parameters Value Robot Speed 50 (in a range [0,100]) Step Time 250 ms Number of Steps 2000 & Exploitation/exploration ɛ-greedy 30% exploration (60% the least explored) Discount rate γ 0.9 Learning rate α 0.02 FP (4 decimals in our fixed-point notation) Q-matrix cell size 4 bytes (long) Table 5.6: Wander-3 task implementation parameters. The experiments were placed on three different scenarios: Scenario A: 70x70cm square enclosure used in both wander-1 and wander-2 (figure 4.10). Master Thesis 51 University of Málaga

9: Scenario C for the wander-3 task implementation. Figure 5.

54 Scenario B: 105x70cm enclosure with 35x8cm obstacle in the center, already used in the wander-2 task (figure 5.3). Scenario C: A new 200x125cm enclosure with four 135 degree corners and two small obstacles (see figure 5.9). Figure 5.9: Scenario C for the wander-3 task implementation. Figure 5.10 shows the result obtained of a test performed in scenario A for above 9000 steps ( 40 minutes), performed for comparing the wander-3 task with the previous tasks on the same scenario. Figure 5.11 shows the results of the three tests performed in scenario C, one with 2000 steps ( 8 minutes), and the others with steps (> 60 minutes). Data from scenario B were not collected since the learning process resulted very similar to the one in scenario A. Figure 5.10: Results of the wander-3 task implementation test in scenario A. A total of six videos for the three tasks covered in this work, showing the evolution of the learning process and the exploitation of the learned policy, have been recorded for improving further analysis of the learning methods developed. Figure 5.12 shows some scenes of the robot Master Thesis 52 University of Málaga

Figure 5.12: Wander-3 task implementation. Robot exploiting the learned policy (top to bottom, left to right).

55 Figure 5.11: Results of the wander-3 task implementation tests in Scenario C. exploiting the learned policy of the wander-3 task. The following conclusions are based on the study of these recordings along with the interpretation of the data displayed in the figures 5.10 and Figure 5.12: Wander-3 task implementation. Robot exploiting the learned policy (top to bottom, left to right). Tests performed in scenario A show that the time needed in learning an accurate policy of the wander-3 task was much longer than in the two previous tasks. This behavior Master Thesis 53 University of Málaga

56 was expected, since the number of states and actions is now higher. However, in the test shown in figure 5.10, some well-explored states were not able to achieve what might be considered a reasonable policy, even after 9000 steps. As an example, the state s3 (right contact pressed) resulted in action a3 (righ-turn), a policy that hardly will lead to high rewards in both the short and the long run. A special case was detected in scenario A. State s13 (no contact pressed and obstacle far) resulted in a learned action a2, i.e., left turn. Later tests discovered that the failure of detecting the wall due to the sonar angle issue, explained in the previous section, was involved in this problem: it is likely that moving forward from s13 resulted in the colliding state s11 for the 35 times recorded in the exploration matrix, and thus, left turn actions represented a better choice in the long run. States s8, s12 and s16 did not learn an accurate action. In scenario A, they were not even explored, something that is reasonable since it involves both frontal bumpers colliding and sonar distances outside the range 0. In scenario C, s12 was reached up to 6 times in steps, a too low count even when exploring all available actions. The learning processes of 2000 steps in scenario C generally gave better results than the previous tests in scenario A with more than 9000 steps. We thus can state that larger and more complex scenarios offer better models of learning for the wander-3 task than a minimalist simple scenario. The simple scenario A has many redundant states in the wander-3 setup, making the learning inefficient. This issue could be improved by using advanced techniques, such as state approximation that are out of the scope of this work. Although a basis for a correct policy is reached in 2000 steps, figure 5.11 shows that when passing from 2000 to steps the selected actions evolve into a more accurate policy. In particular, experiment #3 resulted in the pattern a4 (move forward), a3 (right-turn), a2 (left-turn), a1 (stop) repeated in s5,s6, s7, s8, s9,s10, s11, s12 and s13,s14, s15, s16, all representing the states no contact, left contact, right contact, both contacts when the distance measured by the sonar belongs to the range 1, 2 and 3 respectively. The learning process was able to establish that the robot should follow a specific policy when the obstacle was detected in the range 0, and another one when located in ranges 1,2 or 3. Hence, the tests result in a consistent clustering of three states into one set. Among the tests performed in scenario C with steps, there are some minor differences in their learned policies. Especially, state s2 (object near and left bumper contacted) usually resulted in action a3 (right-turn) as in previous tasks, but sometimes led to a7 (left wheel backwards), as shown in This particular behavior, reproduced in several tests, also offers a consistent policy, since it leads to a non-collision state quickly, usually from s2 to s1. The robot often executes a two-wheel turn (a2 or a3) in the next step, avoiding the obstacle, and thus resulting in an strategy with a larger look ahead sequence of steps. Every test achieves sequences of movements including turns and backwards displacements, preventing the robot from getting stuck, which generally occurs if state s4 is reached. Another special case was detected when working with wander-3 in scenario B. After 1000 steps, the robot had learned a policy that turned whenever it reached state s13, in which the distance measured belonged to the range 3 (furthest). Thus, the robot kept wandering in a small area between the walls and the obstacle. The problem here was that the robot barely could explore any action when it was oriented to the furthest wall, because it only reached that state once or twice every 40 steps (approx.). To solve this Master Thesis 54 University of Málaga

57 problem, the robot was extracted to an open space where the so far rarely explored state was frequently reached. A few minutes later, the robot had learned to move forwards from state s13. Then, it robot was returned to scenario B, and described a path around the whole enclosure. Moving forward actions usually result in everything but a straight path: the robot makes small turns to the right; the lower the battery, the higher the drift. This data could justify why in most of the experiments of wander-2 and wander-3 the learned policies circle the scenarios clockwise, due to this irregularity of the model. The alternative anticlockwise policy would lead to sequences of actions involving more turns and less moving forward, since this drift will cause the robot to face to the external walls of the scenarios more often. Finally, some tests performed in scenario C resulted in turning policies (a3) when the robot reached states representing sonar distances belonging to the ranges 0 and 2 (s1 and s9). By turning at those distances from the obstacles, the robot sometimes avoided to wander on narrow paths between the wall an the obstacles of this scenario, locations in which the measurement errors of the sonar were frequent and some collisions were inevitable. 5.3 Summary Taking into account all the results discussed in the previous sections, we summarize here the following improvements used in the basic Q-learning algorithm: Exploration-exploitation strategy: An exploration matrix parallel to the Q-matrix that stores the number of times each Q-matrix cell is updated has been created to support the least explored actions. Learning rate parameter (α): Chapters 3 and 4 proved that the use of a fixed α = 0.02 resulted in an efficient learning process, and also reduces the precision problems in the NXT robot. Besides, this value was large enough to adapt the policy to changes in the environment in relatively short periods. Discount rate (γ): The impact on the numerical problem of loss of precision, along with the results obtained from simulation, give γ = 0.9 as the value that best fits this work. Rewards: The individual values of the rewards were normalized and tunned to avoid falling into non-optimal policies and overflows. Master Thesis 55 University of Málaga

58 Master Thesis 56 University of Málaga

59 Chapter 6 Conclusions and future work This master thesis has addressed the design and implementation of a learning method based on the Q-learning algorithm for an obstacle-avoidance wandering task on the educative robot Lego Mindstorms NXT. This learning method has been able to learn optimal or pseudo-optimal policies in three tasks of different complexity, maintaining the same values for the parameters. Simulations based on a frequentist model, built from the data obtained by the robot in its environment, resulted in a set of parameters used later on in the real robot with success. The employment of a fixed-point notation system has also been evaluated and verified, since the robot is based on a microcontroller without FPU. The design of a very simple wandering task (wander-1 ) allowed us to carry out a comparative study between simulation and real implementation of the learning method. The limitations of our real agent regarding overflows and precision losses were identified in this study. The non-sonar experiments exposed in chapter 4 were aimed at learning to wander inside small scenarios, moving forward whenever no contact with obstacles were detected, and turning to the right direction after colliding, usually with the same bumper. The results show that the number of steps needed in learning the wander-1 task in the real world scenario are slightly less than those needed in simulation, in most tests. Thus, we can state that the real environment offered a slightly better model of learning than the simulator. Besides, based on the studies described on chapter 4 regarding losses of precision and overflows, we can state that the time needed in implementing the learning algorithm in the robot is shorter than the time needed in simulation. The experiments including the sonar sensor described in chapter 5, wander-2 and wander- 3, are aimed at maximizing the paths followed by the robot in the scenarios, especially the wander-3 task, in which more precise movements sequences are available. The learned policies allow the robot to maximize the amount of steps moving forward, receiving then the greatest rewards, while avoiding the penalties incurred by turning in the case that a moving forward action leads to a collision. To sum up, we have obtained a Q-learning based method for the NXT robot that is flexible enough to be adapted to other tasks on different systems, and that has served to explore thoroughly the limitations of small robotic platforms for practical reinforcement learning. Further work could also evaluate the findings of this research in other small-scale systems. The work presented in this master thesis could become part of a larger project oriented to learn more complex tasks. To carry out such proposal, approximation techniques such as neurodynamic programming [21] will be essential in order to approximate states; otherwise our reinforcement learning problem would be intractable due to the large number of states that could arise. This has been shown in the learning process of the final wandering task in this work (wander-3 ), in which, although obtaining pseudo-optimal policies, several redundant states appeared that reduced the efficiency of the learning process. Another sort of techniques 57

60 such as hierarchical reinforcement learning [22] could be necessary so as to learn more complex tasks. Finally, our proposal for future work includes adding two techniques which we consider would have a great impact in the learning method: An improved exploration-exploitation strategy: the implementation of an algorithm that governs the exploration/exploitation rate of the simple ɛ-greedy used in this work. This algorithm should be based on both Q-matrix and exploration matrix. Related strategies including Bolzmann methods should be addressed here. Adding a stage at the end of the learning loop that checks for convergence or for potential changes in the model. This is a generalization of the previous strategy. The design of algorithms based on the Q-matrix and the exploration matrix could lead to greater α and higher exploration rates in both the beginning of the learning process and after detecting a relevant change in the model. On the other hand, as the learning process converges into stable policies, α and exploration rates should be reduced. Master Thesis 58 University of Málaga

Appendix A Implementation details Since this work involves the performance of many tests on the robot, it has been necessary to implement some techniques for supporting debugging and data collection

61 Appendix A Implementation details Since this work involves the performance of many tests on the robot, it has been necessary to implement some techniques for supporting debugging and data collection for subsequent analyses. This chapter briefly describes some of these techniques already implemented in the experiments developed in chapters 4 and 5 and listed in appendix C. A.1 Debugging, compilation, and NXT communication As stated in chapter 4, the code implemented on the robot has been developed in the Not exactly C programming language [20]. A standard PC under two operating system was employed for software development: MS Windows XP and Ubuntu The IDE Brick Command Center v3.3(build ) under Windows XP has been used for debugging, compiling, exporting the rxe binaries files to the robot, and importing the resulting data files. Figure A.1 shows a screenshot of this program. Figure A.1: Brick Command Center v3.3. When working under Linux, the compiler NBC r4 has been used for debugging and sending the binary files to the robot, by typing in a terminal: nbc sourcefilename.nxc -d -S=usb; in that case, the results were imported from the robot with the communication tool NeXTTool , by typing: nexttool /COM=USB0 -upload="*.log". Any test presented in this work can be reproduced in both operating systems without making any change in the source code. 59

62 A.2 Datalogs Once all the steps of the learning process are executed, the function void NXTsaveData(long memoryneeded) is called. This routine, after checking if the output file has been successfully created with the required size, saves the following information in it: 0. Learning process name and number of steps executed. 1. Learned policy. 2. Resulting value function. 3. Resulting Q-matrix. 4. Exploration matrix. 5. Last step in which the policy changed into the optimal policy. The source codes implementing the three tasks developed in the present work contains this function. An example of a resulting datafile generated by the robot is shown in fig. A.2. % 5 States Task NXT NumberOfSteps=2001; Policy = [ ]; V = [ ]; Q = [ ]; exploration = [ ]; % Optimal Policy learned in step: 495 Figure A.2: Datafile generated by the wander-2 (five-states) task program on the robot. A.3 Brick keys events In order to study the evolution of any learning process, three user events that interrupt the learning process by pressing the buttons of the brick have been included in the programs. These are used for: 1. Exploiting the learned policy so far, without disturbing the learning process, and with the opportunity to continue learning after another button event. 2. Pausing and resuming the learning process, which also helps us to check the information shown in the NXT display and translating the robot to a new location. Master Thesis 60 University of Málaga

63 3. Ending the current process and generating the data file, as through the last learning step were reached. These buttons events were materialized in the following lines of code just at the end of the main learning loop: if (saveandstopbutton()) break; // Right button (NXT_io.h) if (exploitepolicybutton()) exploitepolicy(s); // Left button (NXT_io.h) if (pauseresumebutton()) // Center button (NXT_io.h) Wait(1000); executeaction(initial_policy); // INITIAL_POLICY: wheels stopped pausenxt(); Also, the robot enters a pause state after finishing any learning process, waiting for button events. This characteristic has been employed for executing the learned policy after the whole process has finished. A.4 Online information through the display and the loudspeaker The function void showmessage(const string &m, byte row, bool wait), implemented in the library NXT io.h, is called from the main code at several points to display the current state of the learning process (see fig. A.3). This function is also used in paused mode to display the available user events. Figure A.3: NXT display with the learning information of the wander-1 task. The complementary transmission of information through the NXT loudspeaker allows the user a better understanding of the learning process without interrupting the robot. We have defined notes and note lengths constants in the alphabetic music notation in NXT io.h. Preliminary tests reproduced high pitch notes when large rewards were received, and lower pitch notes in case of large penalizations. Any change in the learned policy was also transmitted with two high pitch notes. The use of sound emissions accelerated and improved the adjustment of rewards, speeds, distance ranges, and above all, error detection. The code listed in this work contains a reduced Master Thesis 61 University of Málaga

Markov Decision Processes and Reinforcement Learning

Markov Decision Processes and Reinforcement Learning Lecture 14 and Marco Chiarandini Department of Mathematics & Computer Science University of Southern Denmark Slides by Stuart Russell and Peter Norvig Course Overview Introduction Artificial Intelligence