Active Learning in Motor Control

Size: px

Start display at page:

Download "Active Learning in Motor Control"

Christal Howard
6 years ago
Views:

1 Active Learning in Motor Control Philipp Robbel E H U N I V E R S I T Y T O H F R G E D I N B U Master of Science Artificial Intelligence School of Informatics University of Edinburgh 2005

2 Abstract In this dissertation we consider how the performance of a learning control system can be improved with an efficient exploration strategy. We apply principled approaches from active learning to realise task-specific exploration with the LWPR on-line learning scheme. Our algorithm is based on the confidence in the model predictions and directs exploration to areas of high uncertainty. To the best of our knowledge, this is the first clear strategy for active learning in low-level motor control scenarios. Using two simulations, we show that learning the inverse dynamics of a movement system can benefit from an active data selection strategy. Both simulations a Newtonian particle and a compliant two-joint robot arm also provide an intuitive real-time visualisation of the LWPR confidence bounds and the explored space. The suggested algorithm is shown to be superior to simpler exploration schemes such as random flailing of the robot arm. i

3 Acknowledgements During the course of this thesis, I have been supported by a great number of people, all of whom I wish to thank in the following: Professor Sethu Vijayakumar and Dr. Marc Toussaint acted as the advisors for this thesis, thus accompanying me throughout the whole time. Without them, this thesis would have been impossible. They helped me with all issues encountered, and their advice, comments, and knowledge was of invaluable worthiness to me. Many of the ideas presented in this thesis spawned from our insightful meetings throughout the project. I am also thankful to a number of other friends and supporters: Giorgos Petkos for his continued knowledgeable and friendly advice, Frank Walter for our invaluable discussions, and Athanasios Noulas for a such a great time in Edinburgh. Finally, I like to thank for the support of my parents and my two brothers. They have continuously stood to my side and never stopped to amaze me. I am thankful for the great time in Edinburgh! Philipp Robbel ii

4 Declaration I declare that this thesis was composed by myself, that the work contained herein is my own except where explicitly stated otherwise in the text, and that this work has not been submitted for any other degree or professional qualification except as specified. (Philipp Robbel) iii

5 Table of Contents 1 Introduction Active Learning in Control: Motivation Problem Statement Accomplishments Dissertation Outline Learning for Control Robot Control Problems Fundamentals Trajectory following Other control problems Learning Control Locally Weighted Regression The Role of Exploration in Learning Control Active Learning Active Learning Implementations Supervised Learning Reinforcement Learning Data Generation for Learning Control Exploring Exploration Active Learning with LWPR The Shifting Setpoint Algorithm iv

6 4.1.2 Our Exploration Algorithm: Description Our Exploration Algorithm: Details Simulation Particle Simulation Robot Arm Simulation Experimental Results Particle Simulation Observations and Discussion Robot Arm Simulation Observations and Discussion Conclusion 53 Bibliography 55 v

7 Chapter 1 Introduction This dissertation is about learning to control high-dimensional movement systems. An example for such a system is a multiple-joint robot arm that has to be directed along a trajectory in its configuration space (see Figure 1.1). Following such a task-specific trajectory requires the controller to send appropriate sensorimotor commands to the individual arm joints at all times. Inappropriately chosen commands may result in (potentially dangerous) movements that do not resemble the planned trajectory and prevent completion of the task altogether. Control problems of this kind can be solved if the robot is able adapt the motor commands based on the following factors: The physical properties of the arm, such as joint types, lengths and their mass The physical properties of the environment, such as gravity or other forces that arise from the task Other physical phenomena, such as the signal speed between controller and the joint motors In short, stable, accurate and timely motion requires the controller to account for the dynamics of the robot arm and to know the sensorimotor laws that apply in a given context. 1

8 Chapter 1. Introduction 2 Controller Figure 1.1: A multiple-joint robot arm and a desired path in space. A traditional approach to problems of this kind has been to solve the forward or inverse dynamics of the movement system analytically based on the equations of motion and the properties of the controlled body. The applicability of this method to real-time control of more complex systems is limited, however, mostly due to the computational cost required to solve these mathematical models. Other deficiencies of this method are that in order to be tractable simplifications are commonly introduced into the model and physical factors such as friction and ageing effects are neglected (Moore, 1990). A relatively new development in the field of computational motor control has been to apply approaches in machine learning to the control problem. This is possible since the robot arm creates a multitude of data during its movements. While data about the joint positions in external world coordinates is not directly observable, the robot arm can record the same information in proximal space, i.e., joint angles for revolute and joint lengths for prismatic joints. Based on the known motor control signals and the observed joint accelerations one can in principle obtain a large number of training examples from which generalisations about the dynamics of the controlled device can be made. The success of this approach to controlling complex movement systems has been demonstrated in a

9 Chapter 1. Introduction 3 large amount of published work and the juggling robot of Schaal and Atkeson (1994) is a frequently cited example. It has been shown in other supervised and unsupervised learning scenarios that the generalisability and speed of a learning task can greatly benefit from an efficient data selection and presentation scheme. Although sensorimotor control is a distinct scenario as motivated in the following sections, the expectation that similar increases in efficiency carry over into this field is an appealing one and forms the main foundation of this dissertation. 1.1 Active Learning in Control: Motivation A feature that characterises human motor control has become known as compliance. It refers to the fact that our movements are not stiff but have some degree of elasticity as we operate under a number of different contexts. As it is visible from our own example, compliance does not necessitate slow movements: in general, humans have no problem maintaining fast and accurate movements under changing loads on the arm, for example. In order to realise motions of similar quality in humanoid and other biomimetic systems, we resort to feedforward controllers. These controllers maintain an internal model of the device s inverse dynamics to predict appropriate motor commands before issuing them. A feedback controller may then correct for the remaining discrepancies between actual and predicted outcome of the command. The stored model of the inverse dynamics is strongly dependent on the current context and it is desirable to quickly learn and adapt the model to changing external and internal conditions. An exemplary case is a humanoid robot arm that attempts to move objects of different mass along a fixed trajectory. Without adapting the internal model to the different dynamics, the arm may lose its desired properties of precision, timeliness or compliance. As described in the previous section, automatic learning systems require data to construct a model of the world. Intuitively, it is desirable to exert some degree of control over which data is collected where, i.e., how the world should be

10 Chapter 1. Introduction 4 explored. This factor is of particular importance if we wish to acquire accurate models in short periods of time just as in the control problem outlined above. Frequent among machine learning problems is a passive scenario, however, where the learner is merely the recipient of a stream of data over which it has no control. In the active learning paradigm, on the other hand, the learner is concerned about the question which data point to query next and may therefore direct exploration to regions where greatest information can be expected. Active learning has been demonstrated to be successful in reducing the number of training data for tasks as diverse as simple concept learning (Cohn et al., 1994), parameter estimation in Bayesian networks (Tong and Koller, 2000), or supervised learning with support vector machines and multi-layer neural networks (Schohn and Cohn, 2000; Sollich, 1997). The motivation for designing robot controllers with similar capabilities for active learning is first and foremost a practical one. If one aims to construct autonomous and human-like robots that can adapt their behaviour to different contexts in real-time, a method to quickly revise the world model is necessary. We consider the purposeful direction of exploration via active learning to be one step into this direction. Without these methods in place, the high dimensionality and redundancy of movement systems of this kind may otherwise render learning infeasible altogether. 1.2 Problem Statement The problem that this dissertation addresses is the implementation of an active learning scheme for motor control. This brings about some unique challenges: Firstly, the data is generated by the movement system itself during a number of exploratory movements. This order-sensitive scenario is different from an order-free one where we can query freely from the input distribution and obtain the corresponding predictions from an external oracle. Instead, exploration to desired places in space requires a sequence of control commands to be executed.

11 Chapter 1. Introduction 5 Secondly, we are trying to learn the dynamics of the system on-line. Since active learning selects queries to maximise knowledge gain, exploration may focus on regions where we do not yet control the system, i.e., those regions where our current model of the dynamics is not accurate enough. Thirdly, exploration to states that the system will never experience during operation is wasteful and data support may only be required for the particular task at hand. This is of particular importance for high-dimensional movement systems where learning the inverse dynamics has to occur in a task-specific fashion in order to be feasible. These factors make the automated selection of query points a nontrivial problem. Besides simpler exploration techniques like random drifting or oscillating motor signals, a major alternative option is to manually guide the robot through parts of the space for purposes of data acquisition (kinesthetic demonstration). Ideally, however, the robot would be able to set task-specific exploratory goals itself and achieve controllability through continuous improvement of its world model in these regions. The approach to active learning that we implement is theoretically grounded in the concept of statistical confidence. We exploit the current world model as long as our confidence in it is sufficiently large and revert to exploration in the opposite case. The learner attempts continuously to extend its region of familiarity in the space that a particular task dictates. This is analogous to a biological system that attempts to control its immediate surroundings before shifting its focus of exploration to more distant points in space. 1.3 Accomplishments In this thesis we provide a review of active data selection schemes that have appeared in the literature on supervised and reinforcement learning. It is shown that many of these are not directly transferable to the motor control field as they neglect the order-senstive nature of data generation in control. We then present and implement an active data selection method for the LWPR algorithm which

12 Chapter 1. Introduction 6 has been shown to be suitable for on-line learning in high-dimensional spaces. Our algorithm is based on the confidence in the LWPR model predictions and directs exploration to areas of high uncertainty. To the best of our knowledge, this is the first clear strategy for active learning in on-line low-level motor control scenarios. Using two simulations developed for this dissertation, we show that learning the inverse dynamics of a movement system can benefit from an active data selection strategy. Both simulations a Newtonian particle and a compliant two-joint robot arm also provide an intuitive real-time visualisation of the LWPR confidence bounds and the explored space. The suggested algorithm demonstrates its performance in comparison to simpler exploration schemes such as random flailing of the robot arm. 1.4 Dissertation Outline In Chapter 2, we introduce the fundamental concepts of motor control and how machine learning techniques may find application in this field. Included is a presentation of the inverse dynamics problem, a discussion of common robot control tasks, and a description of the locally weighted projection regression (LWPR) algorithm. In Chapter 3, we describe active learning methods that have appeared in the literature. This comprises schemes in supervised and unsupervised learning, as well as analogous concepts from the field of reinforcement learning. The particularities of data generation for learning control are also outlined here. In Chapter 4, we attempt to join both topics and develop an active data selection scheme for learning in motor control. Next to the description of our algorithm, we also include an overview of the experimental simulations that were developed as part of this dissertation. In Chapter 5, we present the experimental results of our algorithm in two simulated scenarios. The first is a simulated Newtonian particle and the second is a more sophisticated planar two-joint robot arm. The results encourage the use of active data selection in control learning problems.

13 Chapter 2 Learning for Control This chapter introduces some of the fundamental control problems in robotics and how machine learning algorithms can be applied to them. While issues such as perception and actuation certainly play a part in control problems like localisation, we narrow our review to the more traditional understanding of control. This comprises subjects such as kinematics and dynamics and also includes a brief outline of open-loop, feedforward, and feedback control. In the course of the chapter it is demonstrated that modelling of the kinematics or dynamics can be understood as a regression problem to which machine learning algorithms can be applied. The special nature of real-time control for high-dimensional movement systems introduces a set of requirements that these learning schemes have to fulfil. One of the algorithms that has been shown to be particularly well-suited is locally weighted projection regression (LWPR) which forms the basis for our examination of the exploration problem in the successive chapters. 2.1 Robot Control Problems Fundamentals Common in robot control are two different frames of reference. The first is defined by the task space that the robot is operating in which could correspond to a two- 7

14 Chapter 2. Learning for Control 8 dimensional plane for a planar robot arm or a three-dimensional working space for a more sophisticated assembly line robot. Positions in this space are frequently described with Cartesian coordinates and are referred to as world coordinates or as coordinates in distal space. The second reference frame is described by the internal parameters of the robot, such as joint orientations, and measured in joint angles or lengths. As this space is solely defined by the joint variables, it is also referred to as proximal space. It is a common scenario for a robot controller to convert between coordinates in distal and proximal space. This is due to the fact that a trajectory may first be specified in world coordinates before being translated into the concrete joint positions and orientations that the robot has to achieve. Depending on the direction of the conversion, we can distinguish between The kinematics problem the conversion of coordinates in proximal to distal space (forward model) The inverse kinematics problem the conversion of coordinates in distal to proximal space (inverse model) Both operations rely on our knowledge of the exact robot design, i.e., joint types and link lengths and commonly require solving a set of homogeneous transformation equations when computed analytically. A closed form solution to the inverse problem cannot be computed for all manipulator types so that manufactures commonly choose the robot design to guarantee that inverse solutions can be derived. For redundant manipulator types such as the human arm, which can arrive at the same position with more than one joint configuration, further constraints may have to be introduced during the inverse model computation (McKerrow, 1991). The fields of statics and dynamics complement the study of position in robotics. They stand for the analysis of objects under the influence of forces and build upon Newton s three laws of motion. For our purposes, we will consider the second field only and, similarly to the case above, distinguish between two formulations:

15 Chapter 2. Learning for Control 9 The dynamics problem the calculation of the robot motion given a set of joint torques The inverse dynamics problem the calculation of the joint torques required to achieve desired joint positions, velocities and accelerations Even though efficient recursive formulations have been developed, it is still an unsolved problem how the dynamics equations of high-dimensional movement systems can be computed quickly enough for real-time control. Even for industrial robots the analytical solution can quickly result in complex expressions and may lead the modeller or the robot manufacturer to introduce simplifications into the model or the robot design Trajectory following Traversing a trajectory is a fundamental task in robotics and requires the controller to realise a number of desired joint accelerations at particular points in time. Along the lines of Moore (1990) we can formally introduce this task, beginning with the definition of a system s state below. Definition The state of a robotic manipulator is defined by the vectors q and q referring to the current joint positions and velocities, respectively. joint accelerations vector q is found as the time derivative of the velocity vector q. The trajectory can then be given in two equivalent formulations, the first listing the number of desired states at given times and the second recording the desired state changes. definition. The The dependence on time is not made explicit in the following Definition A trajectory is a temporal state sequence ((q 0, q 0 ), (q 1, q 1 ),...) that corresponds to the joint accelerations ( q 0, q 1,...) where q i = q i+1 q i. The h sampling frequency follows from the time step h as 1. h Generating such a state sequence is a planning problem in robotics. Depending on the requirements of the task, we may select the desired joint angles

16 Chapter 2. Learning for Control 10 manually, use a sinusoidal signal as the source for a repeating pattern, or compute the optimal trajectory based on an optimisation criterion. Suitable criteria for optimisation are, for example, to minimise the time to reach the destination point or to enforce minimum joint acceleration changes throughout the robot movement. The applicability of these criteria is not only determined by the task at hand but also depends on the mechanical constraints that the manipulator imposes. Once a trajectory has been selected, a controller can implement different strategies to track it. We can make a basic distinction between feedforward and feedback control depending on whether an analytical or learned inverse model of the system is utilised to control the plant or not. The typical outline of a feedback controller is given in Figure 2.1. Here, the controller receives a desired output and compares it to the feedback signal obtained from the sensors. In our case of trajectory following, the controller may receive the desired joint position vector q i at the ith step of the trajectory and compare it to the observed current positions q. The control signal to the plant is then computed as a function of the error between desired and actual states. It is commonly generated as the sum of any of the following three terms. First, we can calculate the term K p (q i q) which is directly proportional to the position error on the trajectory. This proportional control action multiplies the error with the gain constant K p and drives the controller output towards or back to the desired level q i. d(q Second, we may choose to include the derivative control term K i q) d which dt is the product of the current velocity error on the trajectory, q i q, and a gain constant K d. It has the effect of achieving stability by damping the control signal over time. The third term is an integral control term given as K i (qi q) dt. It accumulates the position error over time and can be used enforce convergence to the desired output after the system has settled into a steady-state behaviour. A PID controller, for example, employs all three terms and produces the torque τ = K p (q i q) + K i (qi q) dt + K d ( q i q) at the ith trajectory step.

17 Chapter 2. Learning for Control 11 Contrary to the closed-loop system above, an open-loop feedforward controller (Figure 2.2) does not utilise a sensor feedback signal for its control decisions. Instead, it maintains an inverse model of the system dynamics to produce the desired states on the trajectory. Given the desired joint accelerations q i it may precompute the corresponding torques τ i and play back the generated commands during trajectory following. It is intuitively clear that tracking via open-loop feedforward control will result in large deviations from the trajectory if the model of the inverse dynamics is incorrect. On the other hand, feedback control can suffer from delayed or noisy sensor feedback and produce a ringing signal that continuously varies above and below the desired output. Also, since feedback control does not maintain any model at all, fast movements require large gains to quickly cancel out any errors between desired and actual states. This is contrary to our low stiffness requirement introduced in the previous chapter. A viable alternative is therefore to join both feedforward and feedback controller to arrive at a composite controller type. The design of a composite controller is shown in Figure 2.3. It is the straightforward combination of the previous two controller types where the feedback controller only has to account for the errors between predicted and actual states. The presence of the inverse model allows two modes of operation (Moore, 1990): Similarly to the open-loop control type, we may choose to precompute the desired torques τ i once. A combination with a PID feedback controller would then produce the output τ out = τ i + K p (q i q) + K i (qi q) dt + K d ( q i q) at the ith trajectory step. Alternatively, we could compute the accelerations q out to achieve the desired acceleration at the ith step of the trajectory as q out = q i + K p (q i q) + K i (qi q) dt + K d ( q i q) and continuously query the inverse model for the corresponding torques. This is known as computed torque control. While the former can account for errors in the inverse model, it commonly requires tweaking of the gain constants to achieve stable behaviour with low rise times. The latter method, on the other hand, suffers from higher computational cost.

18 Chapter 2. Learning for Control 12 Input + _ Controller Plant Output Sensor Figure 2.1: Operation of the feedback controller. Input Inverse model Plant Output Figure 2.2: Operation of the open-loop feedforward controller. Inverse model Input + _ Controller + _ Plant Output Sensor Figure 2.3: Operation of the composite controller.

19 Chapter 2. Learning for Control Other control problems Other problems in control can frequently be reduced to the basic cases presented above. An exemplary higher-level task is catching a moving object. To map the object position in world space to joint angles requires solving the inverse kinematics problem. A trajectory planner may then be used to obtain a temporal sequence of desired joint positions and velocities to approach the object. Lastly, we could use a model of the robot arm s inverse dynamics to predict the necessary torques to achieve the trajectory and then track it with a composite controller. At an even lower level, the joint motors may be controlled with a simple feedback loop to achieve the desired torques transmitted by the composite controller. Because of this reason we focus on the basic control problems in this dissertation. In the following section we present the rationale for using machine learning algorithms to solve the inverse dynamics problem. 2.2 Learning Control We have previously mentioned that the analytical solution of the inverse dynamics of a non-trivial movement system quickly results in complex expressions. Added to this comes the fact that even after a mathematical model has been constructed, the high computational cost at look-up time makes it unsuitable for real-time use. Because the movement system makes a lot of data available during normal operation, machine learning algorithms can be used as an alternative to the analytical modelling approach. At each point in time the joint variables, such as joint angles or joint lengths, are available to the controller in principle. In practice, we may choose to query the system for position information at a fixed sampling frequency and compute joint velocities and accelerations as changes in position and velocity, respectively. Based on this data we can attempt to approximate the true inverse dynamics function τ = g(q, q, q), a mapping that predicts the motor torque required to achieve the desired acceleration q when the system is in state (q, q). There exists a variety of machine learning algorithms that are applicable to

20 Chapter 2. Learning for Control 14 regression problems of this kind in principle. Our domain of real-time robot control demands two properties of the algorithm, namely fast learning rates and high look-up speeds at run-time. This has to hold even when applied to highdimensional problems if more complex systems, such as a humanoid robot arm, are to be controlled. Algorithms that have been applied include k-nearest neighbour (Moore, 1990), memory-based neural networks (Atkeson and Schaal, 1995), and different variations of memory-based or incremental locally weighted regression (for an overview, see e.g., Vijayakumar et al. (2005), Schaal and Atkeson (1998) and Atkeson et al. (1997a)). In this dissertation we focus on locally weighted projection regression (LWPR) that has been demonstrated to be well-suited for learning in high dimensions. This learning system exploits the fact that globally non-linear and high-dimensional functions can frequently be approximated by locally linear patches of reduced dimensionality. In addition to that it has the desirable property that it supports online learning so that the learned model can be adapted to changes in dynamics in real-time Locally Weighted Regression Locally weighted regression (LWR) is a method from nonparametric statistics that performs a number of linear regressions in local regions of the input space. Different from parametric statistics, there is no a-priori commitment to a particular model class and complexity. Instead, the complexity of the regression function varies throughout the learning process as new local models are introduced. LWR shares this least commitment approach with other lazy learning methods such as k-nearest-neighbour. For each local model, a region of validity or receptive field has to be selected that determines the data points taking part in the local regression (see Figure 2.4). In the incremental formulation of LWR, receptive fields are centred on new data points that are not yet covered by any of the other fields. The number of points associated with this local model is then defined by a distance metric, which is commonly parameterised as a Gaussian kernel, together with a cut-off threshold

21 Chapter 2. Learning for Control Figure 2.4: A linear model and its kernel approximating a local patch of the original one-dimensional function. Adapted from (Schaal and Atkeson, 1998). to enforce a minimum kernel support (Schaal et al., 2002). This distance metric also determines the weight of each point included in the local regression. For the case of a Gaussian receptive field centred at c k with distance metric D k, we associate the following weight with a query point x: w k = exp( 1 2 (x c k) T D k (x c k )) (2.1) In the traditional batch formulation of linear regression, the introduction of such a weighting scheme is analogous to replacing the least squares parameter estimate, (X T X) 1 X T Y, with the expression (X T W X) 1 X T W Y which includes the diagonal weighting matrix W. To enable real-time learning, incremental formulations for the local parameter updates have been developed. Together with methods for automated distance metric adaptation, these are presented e.g. in (Vijayakumar et al., 2005). In comparison to other global learning schemes, the localised nature of LWR has the positive effect that it is not prone to negative interference the addition of new local models does not affect the existing ones, even if they include data points already covered by other models (Schaal and Atkeson, 1998). Collaboration of any kind between the local models takes place only at look-up time when the K

22 Chapter 2. Learning for Control 16 individual model predictions ŷ k are combined in a normalised weighted sum: ŷ = k w kŷ k (2.2) k w k Locally weighted projection regression (LWPR) extends the presented LWR scheme by including mechanisms for learning in high-dimensional spaces. Each local model employs incremental partial least squares (PLS) as its regression method which performs local dimensionality reduction. Unlike the traditional formulation of principal component analysis for example, PLS takes the output dimension into account by selecting projection directions based on the inputoutput correlation. It has been demonstrated that this operation in joint inputoutput space can outperform other methods that select the projection direction based solely on the inputs. This is particularly true in the presence of input dimensions that are irrelevant for the regression (Vijayakumar et al., 2005) The Role of Exploration in Learning Control Learning the inverse dynamics of a complex movement system an example is a humanoid robot with 30 degrees of freedom requires the approximation of a high-dimensional function. In the case of the humanoid robot above, the learning problem consists of modelling the continuous inverse dynamics function g : R 90 R 30, mapping each joint s position, velocity and desired acceleration to a corresponding joint motor torque. It is intuitively clear that the required number of training examples to accurately model such a huge space in its entirety is too high to be feasible. Even if the data is distributed over locally lower-dimensional manifolds, as it is the case for data recorded from human movement for example, this space generally remains intractably large for complete exploration. A possible way out of this curse of dimensionality is to introduce task-specific learning and to focus exploration on the subspace where a particular task requires control. The question how exploratory movements of the manipulator should be directed in order to support the task-specific collection of data is therefore of great interest for high-dimensional problems of this kind. As simple approaches such as random drifting or oscillating motor signals are both inefficient and infeasible in

23 Chapter 2. Learning for Control 17 higher-dimensional problems, a common approach is to rely on a (human) teacher to guide the robot to places where data support needs to be acquired. Human guidance and imitation learning have been shown to speed up motor learning by reducing the state space that the robot needs to explore to areas around the demonstrated trajectory (Schaal, 1999). In the literature, different approaches have appeared under the notion of active learning that automate exploration to regions where the prediction or model uncertainty is maximally reduced. Since learning the dynamics of a movement system could benefit from similar directed exploration strategies, we give an overview of the field in the following chapter.

24 Chapter 3 Active Learning Supervised learning problems in motor control rely on a set of labelled training data which is generated by the movement system during a number of exploratory movements. As in problems of similar nature, the quality of the function approximation as well as the time required to learn it is to a large extent determined by the training data that was used to construct it. To gain a reliable approximation, one is generally interested in representative data samples that cover the taskspecific state space of the system. Furthermore, an efficient order of presentation of the data to the learner can lead to faster convergence of the result. This is due to the fact that randomly chosen training examples may fall into the space in which the learner has already gained sufficient familiarity and where the learner s generalisability is unlikely to increase. This problem worsens as learning proceeds and the average amount of novel information per query decreases (Hasenjäger, 2000). Methods to optimise data selection for most efficient learning have appeared in the literature as particular exploration strategies in model-based reinforcement learning or under the notion of active data selection in supervised and unsupervised learning. In the following, we present a selection of the work in this field and comment on the applicability to our domain of generating exploratory movements for sensorimotor systems. 18

25 Chapter 3. Active Learning Active Learning Implementations Approaches to active data selection can be based on heuristics or the optimisation of an objective function (Vijayakumar and Ogawa, 1999). The first group works with learner-specific expressions of uncertainty and directs exploration to regions of highest uncertainty. For a supervised classification problem this would be equivalent to selecting samples in input space that are close the current decision boundary. A perceptron learning task, for example, could benefit from selecting those candidate training examples that are predicted to lie around 0.5 in the [0, 1] output interval (Hasenjäger, 2000). An analogous heuristic for sampling close to the dividing hyperplane of a support vector machine is derived in (Schohn and Cohn, 2000). Function optimisation formulates uncertainty as an objective function. One can derive objective functions for both prediction uncertainty and model parameter uncertainty. In the Bayesian formulation of learning for example, the uncertainty about the model parameters could be based on the entropy of the posterior distribution of the model s parameter vector (MacKay, 1992). Using entropy- or variance-based objective functions is closely related to the D-optimality criterion in optimal experimental design (Fedorov, 1972). Minimising exploration in reinforcement learning can also be understood as an active learning problem. One generally distinguishes between directed and undirected exploration methods in this context. While the former use knowledge about the learner (e.g., a model of its prediction confidence) to direct exploration, the latter rely on a randomised switching mechanism between exploitation and exploration, such as the Boltzmann exploration rule (Thrun, 1992a). In the following, we will distinguish between exploration in supervised and reinforcement learning and introduce data selection methods for each in more detail Supervised Learning Cohn et al. (1994) present a method which selects queries that result in the

26 Chapter 3. Active Learning 20 smallest average expected variance of the learner. The authors demonstrate this approach with neural networks, mixtures of Gaussians and locally weighted regression. In the following outline of the query algorithm we denote the learner s output given a query x and training data D as ŷ(x; D) and the true value as y(x). The general idea is to query at point x (and, therefore, to add ( x, y( x)) to the training set D), if and only if inclusion of this point in the training set reduces the variance of the learner s output most. Choosing the variance as a minimisation criterion is arbitrary just as the learner s bias it represents an approximation to the true error. Later work by the same author selects points so as to reduce the bias (Cohn, 1996). Given the learner s output ŷ at a point x, we obtain an expression for the output variance at that point as usual: σ 2 ŷ(x) = E D [ (ŷ(x; D) E D [ŷ(x; D)]) 2 ] (3.1) E D is the expectation with respect to all possible training sets D of fixed size (in theory). With inclusion of a newly resolved training example ( x, y( x)) in D, we can expect the learner s output variance at points x to change. As stated informally above, we seek the query point x (without yet knowing the true query outcome y( x)) that minimises the expected value of σŷ 2 (x) when integrated over all x. The following procedure is suggested: 1. Assume we have an estimate of σŷ 2 (x) before the query. 2. Assume we can estimate the outcome of a query at x by approximating the conditional distribution P (y( x) x) (the exact procedure of estimating this distribution is specific to the learner). 3. Obtain the expected learner output variance for points x when choosing to query at x, i.e., ] E D [σ ŷ(x) x 2 = σŷ(x)p 2 (ỹ x) dỹ (3.2) ỹ where ỹ denotes y( x), the true query outcome at x. 4. Integrate the newly obtained expected variance over all x (e.g. via Monte Carlo approximation with 64 reference points drawn from the input distribution).

27 Chapter 3. Active Learning Repeat for different candidates x and select the one resulting in minimal expected variance (on a Sparc 10, 64 candidate x evaluations according to the procedure above take about 0.3 seconds for locally weighted regression). In order for this method to match the true generalisation error, the learner has to be bias-free. Despite the fact that this is generally not true, the authors demonstrate for the case of a simulated 2-degree-of-freedom robot arm that the number of experiments to learn the device s kinematics can be reduced significantly over random sampling. Further improvements are reported for the case of all-bias query selection (Cohn, 1996). For the case of neural networks, Vijayakumar et al. (1998) describe a scheme that reduces both bias and variance in a two-stage process. MacKay (1992) describes an approach for active learning with neural networks that is grounded in Bayesian statistics. Active learning is guided by the amount of information that a query conveys about the model parameters θ. Expected information is measured by the change in entropy between prior and posterior distributions of θ. In the course of the paper it is discussed that maximising information over the whole input space may result in the undesired effect that data points are continuously sampled at the edges of the space where the expected gain in information is largest. An objective function is suggested that restricts the information gain criterion to particular regions of interest. Similar to the case of (Cohn et al., 1994), there exists an implicit unbiasedness assumption, i.e., that the chosen model and probability distributions are correct. Both methods also rely on the search and comparison of candidate queries x which may be a computationally demanding task in high-dimensional learning problems in itself. Sollich (1997) describes a practical implementation of an active learning scheme for binary classification problems with neural networks. Here, the query by committe algorithm is used to encode expected information gain as the amount of disagreement between a set of learned neural networks. Given a set of training data, the committee is initialised with 2k networks that agree with the given data.

28 Chapter 3. Active Learning 22 Maximum information gain is achieved for successive training samples where k networks predict the first class and the remaining k models the other class Reinforcement Learning Reinforcement learning differs from supervised learning as it through the optimisation of a scalar reward function learns a task-specific policy for an agent that interacts with an environment commonly represented by a Markov decision process. A formulation of the task of learning control along a desired trajectory as a reinforcement learning problem is as follows: the state matches our earlier definition and is given by the continuous position, velocity, and desired acceleration vectors. The action space is also continuous and corresponds to the torque vectors that can be applied to the robot. The reward is given by how closely the manipulator movements resemble the desired trajectory. Throughout learning, we maintain a (LWPR) world model that is trained from experience in a supervised fashion as before. We can now choose to follow the LWPR predictions greedily or to explore random or randomly perturbed actions in order to improve both world model and policy. While executing the trajectory-following trials (episodes), we may want to reduce exploration over time as the model becomes more accurate. This formulation is merely theoretical as research into continuous state and action spaces in reinforcement learning is still ongoing (Smith, 2002). Nonetheless, it illustrates the point that our problem of learning task-specific control could benefit from exploration methods in reinforcement learning. Thrun (1992b) distinguishes between directed and undirected exploration methods. Undirected methods perform randomised action selection, frequently relying on the current world model to favour actions for which a higher reward is predicted. A possible analogy in our domain of motor control would be to maintain a LWPR forward model and evaluate a number of candidate motor torques for their closeness to the desired trajectory. A random action selector would then favour motor torques with closer outcomes over the other candidates.

29 Chapter 3. Active Learning 23 Directed exploration employs heuristics or other principled approaches to maximise the expected knowledge gain with each query. For finite domains, more selection weight could be given to those actions that have been selected less frequently or in the case of changing environments less recently (Wiering, 1999). In our continuous context, another heuristic based on the prediction error is more practical: Moore (1990) describes a way to approximate the prediction confidence for the case of the k-nearest neighbour learning algorithm. Exploration is then directed to regions of lowest learner confidence. Similarly, Schmidhuber (1991) and Thrun and Möller (1991) encode incompetence about the environment in a neural network. This network is operated in parallel to the world model, predicts the accuracy of the world model and is used to direct exploration to regions of low predicted accuracy. 3.2 Data Generation for Learning Control We have seen that the attempt to maximise knowledge gain through query selection is a recurring theme in active learning and directed exploration. Knowledge gain was equated with a reduction in uncertainty about either the parameters or the generalisation performance of the model. The following factors require consideration when we wish to develop an analogous scheme for learning the inverse dynamics of a high-dimensional movement system: The Bayesian formulation of active learning is rooted in parametric statistics and assumes that the chosen model class is correct. Nonparametric learners like LWPR, however, encode a large family of model classes and the the number of parameters grows as local models are created. Due to the lack of a Bayesian formulation of LWPR it is not intuitively clear how methods from this field could find application in our problem domain. It is conceivable to use Bayesian data selection methods for each local linear model. As noted in (MacKay, 1992) for the example of a straight line, however, this may only confirm the natural result that querying at the largest possible x will result in maximum information gain about the linear model.

30 Chapter 3. Active Learning 24 Given a formulation of the output variance of a LWPR model, it is feasible to adapt the variance-minimisation method of Cohn et al. (1994) to our scenario while neglecting the bias altogether. Query candidates x could be constrained to lie on the trajectory and the expected output variance only be computed along points x on the trajectory. The main problem in our scenario is difficulty of implementation and computational cost. The author s reference implementation requires 0.3 seconds for 64 candidate x evaluations in a two-dimensional toy problem (with Monte Carlo integral evaluation at another 64 reference points). The high-dimensionality in our domain may make the search for good candidate solutions even more computationally demanding. A more viable solution for real-time control may be to use a simpler method but evaluate more candidates at the same time. A particular requirement in our case is that we want to be able to restrict learning of the target function to a desired input subspace. As detailed in (MacKay, 1992), the active learner may otherwise suggest points at the uncertain (albeit uninteresting) extreme edges of the input space. This is possible with an alternative objective function or by restricting the variance calculation in (Cohn et al., 1994) to points of the desired input space. A major complication lies in the fact that training data is generated by the movement system itself in our problem scenario. The statistically optimal query selection techniques above assume an oracle that can provide an immediate response to a selected candidate query x. For the inverse dynamics learning problem this is equivalent to providing the torque to achieve a desired acceleration from any given system state. This is unrealistic as one of the prerequisites is being able to achieve the query state which might in turn require more exploration. Along the same lines, a lunar robot that is supposed to explore the backside of the moon has to arrive there first (Thrun, 1995). Hence, directed exploration methods from reinforcement learning may appear more suitable for our problem. Reinforcement learning achieves convergence to an optimal policy through

31 Chapter 3. Active Learning 25 the execution of numerous trials. This may cause the manipulator to divert far from the intended trajectory during early trials. In such a case we have to make sure that the controlled system cannot enter dangerous or otherwise unwanted states. In summary, our active data selection scheme should be sufficiently fast for real-time operation, support a nonparametric model like LWPR, and should be able to decide when more exploration is required to reach a desired query location. In the following chapter we detail our implementation of an active learning scheme for learning the inverse dynamics with LWPR. As described in the literature on directed exploration in reinforcement learning, we exploit the learner s knowledge of its own knowledge and use confidence bounds around the learned LWPR model to estimate when the controller should switch from exploitation of the current model to exploration. We also describe the software applications that were developed as part of this thesis to collect the experimental results described in a later chapter.

32 Chapter 4 Exploring Exploration This chapter details our approach at exploring the exploration problem for learning control. We first give a description of the exploration scheme that we developed to address the requirements outlined in the previous chapter. We then go on to describe two testbed implementations that demonstrate real-time learning with LWPR and our exploration method. The first problem is to learn the inverse dynamics mapping g : R 3 R 1 for a simulated Newtonian particle on a non-linear surface. The second one is the analogous problem for the case g : R 6 R 2 in order to move a simulated planar two-joint robot arm along a desired trajectory. As part of this dissertation, two OpenGL simulations were developed in the C++ programming language in order to visualise the effects of exploration in the two test scenarios above. Both have been designed as a framework to allow comparisons of different learner and exploration components in the respective scenarios. 4.1 Active Learning with LWPR Inspired by the active learning approaches outlined in the previous chapter, we treat exploration as a necessity that should be minimised to achieve taskspecific control. Concretely, we adapt methods for dealing with the exploitationexploration trade-off in reinforcement learning to our problem scenario. Rather than relying on a separately trained competence network, however, we estimate 26

33 Chapter 4. Exploring Exploration 27 the learner s ignorance directly via the predictive variance of the current LWPR model. Based on this variance calculation, the generalisation error of the learner is approximated by the size of the confidence intervals around a query point and a threshold value is used to decide where in input space the current LWPR model has lost its validity. The algorithm is based on the notion of setpoints which denote fixed points in space around which data is aggregated in order to reduce the size of the confidence intervals. Switching to exploration based on estimated model uncertainty can be thought of as a deterministic controller with a cautionary term in dual control (Schneider, 1996). Our approach is directly inspired by the shifting setpoint algorithm in (Atkeson et al., 1997b) which is presented and contrasted against our approach in the following section The Shifting Setpoint Algorithm The shifting setpoint algorithm (SSA) realises task-specific data collection for a resettable robot trained via memory-based locally weighted regression (LWR). The outline of the algorithm is as follows (Schaal and Atkeson, 1994): 1. Given the current state x i = (x i, ẋ i ) of the robot and a goal x goal for that state, we execute and record the results of a single random control action and reset the robot afterwards to the initial position. These one-step random actions are repeated for a specified time. 2. We train a corresponding one-step LWR forward model to predict states x i+1 from the initial state x i and the recorded torques τ. 3. The setpoint for the initial stage is defined to be the tuple (x s, τ s, x s+1 ) that is predicted with the narrowest prediction interval among all other tuples (x i, τ, x i+1 ). 4. We construct and store an optimal LQ controller τ = K(x x s ) + τ s that returns the optimal command τ to get from an initial position x to the point

34 Chapter 4. Exploring Exploration 28 x s+1. Using this controller, we execute further one-step actions to increase the data cloud around the setpoint. 5. We now shift the setpoint s output x s+1 towards the current goal x goal via a gradient algorithm. At the same time, we update the LWR model and the LQ controller associated with the current setpoint. We terminate shifting as soon as the confidence in our predictions falls below a threshold. 6. As soon as this is the case, we execute further one-step actions with the current LQ controller to increase the data support around the termination point. This is iterated with step 5 until we get sufficiently close to the current goal state x goal. 7. We then update the goal to the next desired goal state (e.g., on a trajectory) and start over with step 1. Once we reach the overall goal we have a number of LQ controllers that may be used to move from the first to the final goal state. Since the robot is reset after each random exploratory command, there is never more than one random action taking place in an area where only insufficient data has been collected. This reduces the risk of entering unsafe states and focuses exploration to close regions around the goal trajectory Our Exploration Algorithm: Description Similarly to the case above, our exploration scheme is goal directed and driven by uncertainty. As detailed in the previous chapter, goal-directed exploration and task-specific control learning are necessary for real-time sensorimotor learning of high-dimensional movement systems. This case is illustrated in (Schaal, 1999) for a humanoid robot with 30 degrees-of-freedom (DOFs) that expects a control signal for each DOF at every instant in time: even when the control signals are discretised into only three possibilities (such as forward, backward and none ), there are still 3 30 possible signals in each state a space whose complete exploration is infeasible.

35 Chapter 4. Exploring Exploration 29 Unlike SSA, however, we do not construct a series of LQ controllers but rely on the predictions from the single LWPR world model to drive the manipulator to the goal. This model is trained incrementally during movements all data that is generated during exploratory or controlled motions is used to continuously revise the model. The incremental update rules for the LWPR learning algorithm can be found e.g. in (Vijayakumar et al., 2005). It is conceivable that data collection at every instant in time is not possible in a real-time robot implementation due to computational constraints. In such a case, a lower sampling frequency would be selected and data collection carried out at the supported rate. In order to avoid getting stuck with a partially learned inverse dynamics model, our controller has to decide when further exploration is required. As in the SSA case, our decisions are based on the learner s confidence in its own predictions its knowledge of its own knowledge. Concretely: At every time step, we determine prediction and prediction confidence for the current query point x q. If the confidence is above a certain threshold, we apply the resolved torques to all joints. Alternatively, we define a setpoint at x q 1, i.e., the last point that we trusted our own predictions. If we are in exploration mode, we execute a number of directed exploratory actions around the current setpoint x q 1. For the exploratory actions there exist multiple possibilities. First, the number of actions can either be fixed or variable until the confidence rises above the required threshold. The former does not require constant monitoring of the confidence bounds but potentially collects more data points than the latter method. Second, we may distinguish the type of exploration that we perform. An option is to collect a random data cloud around the setpoint x q 1. Similar to SSA, we have to guarantee that these random actions do not cause too large manipulator deviations into unexplored space. Accordingly, we restrict each random action in its magnitude in order to cause only small displacements from the setpoint and follow all random control actions with the attempt to move the manipulator back

36 Chapter 4. Exploring Exploration 30 to the current setpoint. For robots that cannot be reset directly, this can be achieved with PID control or with a simple controller type based solely on the closest local linear LWPR model in input space. We employ this exploration model in the simulation of the Newtonian particle. Rather than collecting a random data cloud around the setpoint, the availability of a PID controller allows another option to direct exploration. As discussed in chapter 1, high-gain PID control is generally not suitable for use with humanoid robots as it is directly averse to the compliance requirement for these systems. However, once we arrive at a setpoint where the low quality of the inverse model makes further accurate, compliant movements impossible, we can decide to use high-gain PID control to direct exploration along the rest of the trajectory. The rationale for this option is that we are continuously aware of our own incompetence as soon as we reach a setpoint we may issue a warning that the following exploratory motions are not compliant and use a high-gain PID controller to complete the desired trajectory. Completing the trajectory has the advantage that data collection is focused more on the task-specific requirements: instead of a random cloud of points around the setpoint, we sample points on (or close to) the desired trajectory, which may result in faster convergence of the LWPR inverse dynamics approximation for this task. We employ this exploration model for the simulated planar two-joint robot arm. Both SSA and our exploration algorithm share a conservative approach to exploration that avoids exploration in irrelevant or potentially dangerous locations in space. In the following section we give more details of the algorithm, including a derivation of the LWPR confidence bounds Our Exploration Algorithm: Details A single learning trial of our algorithm can be summarised by the following 6 steps. Trials are repeated until the normalised mean squared error (nmse) measure falls below a desired threshold. 1. Given an initial state x 0 = (x 0, ẋ 0 ) and a final goal x goal of the robot, generate

37 Chapter 4. Exploring Exploration 31 a random data cloud of k points around x 0 dynamics model g. to initialise the LWPR inverse 2. Given the current state x i of the robot and a desired acceleration ẍ i, perform the model look-up τ = g(x i, ẋ i, ẍ i ). 3. Calculate model prediction variance σ pred and confidence interval I c = τ ±σ pred. 4. If the interval I c is smaller than a threshold θ, apply τ, update the LWPR model with the new experience of the outcome of τ s application and continue with step Set the current exploration setpoint s = x i 1. Generate a random data cloud of fixed or variable size around s. Alternatively, complete the trajectory from x i to x goal with PID control. For both cases, update the LWPR model with the gathered experiences (x, ẋ, ẍ, τ). 6. As long as we have not reached x goal, continue with step 2. The derivation of the LWPR confidence bounds in step 2 is detailed in the following section. The presentation is along the lines of (Vijayakumar et al., 2005) Derivation of the LWPR Confidence Bounds For each local linear regression model k and query point x q, we postulate the following data-generating process: y q,k = y q + ɛ 1 + ɛ 2,k (4.1) This corresponds to the weighted linear regression model in (Gelman et al., 1995) with two separate noise processes ɛ 1 N(0, σ 2 /w k ) and ɛ 2,k N(0, σpred,k 2 /w k) and is depicted in Figure 4.1. We can show that this noise model is consistent with the formulation of the LWPR prediction ŷ in equation (2.2) on page 16. In order to do so, we use a

38 Chapter 4. Exploring Exploration 32 σ 2 pred,1 /w 1 y ^ 1 y q σ 2 /w k σ 2 pred,2 /w 2 y ^ 2 x q Figure 4.1: Heteroscedastic variances from two different sources for each local model. Adapted from (Vijayakumar, 2003). heteroscedastic average to derive the mean prediction for a query x q under our noise model. Model predictions are labelled with the hat (ˆ) symbol. w k w k ( ) k w ky q,k σ 2 + σpred,k 2 k w k ŷ q = k ( y σ 2 + σpred,k 2 q,k )/ k k w kŷ q,k k w k (4.2) which assumes that (σ 2 + σpred,k 2 ) is approximately constant. The result is indeed equivalent to equation 2.2. We can now derive the LWPR predictive variance at the point x q as follows: σpred 2 = E{yq} 2 (E{y q }) 2 k = E{( w ky q,k k w ) 2 } (E{y q }) 2 k 1 = ( k w k) E{( w 2 k y q ) 2 + ( w k ɛ 1 ) 2 + ( w k ɛ 2,k ) 2 } ŷq 2 k k k = 1 ( k w k) 2 E{( k w k ɛ 1 ) 2 + ( k w k ɛ 2,k ) 2 } (4.3) Using the simplification that E{x 2 } = (E{x}) 2 + var(x) and that the mean of the noise processes ɛ 1 and ɛ 2,k is zero, we obtain: σ 2 pred = = = 1 ( k w k) 2 var( k 1 ( k w k) ( 2 k k w kσ 2 ( k w k) + 2 wk 2 σ 2 1 w k ɛ 1 ) + ( k w k) var( 2 k w k ɛ 2,k ) σ 2 wk 2 pred,k ) w k 1 ) + w k ( k w k) ( 2 k k w kσpred,k 2 ( k w (4.4) k) 2

39 Chapter 4. Exploring Exploration 33 We have now obtained an expression for the predictive variance of the LWPR model. It includes σpred,k 2, i.e., the predictive variance of each local model k, and σ 2, the global variance over all models. Vijayakumar et al. (2005) approximate the latter as σ 2 = k w k(ŷ q ŷ k,q ) 2 / k w k and derive an incremental update rule for σpred,k 2. The exact expression can be found in the paper. The confidence interval that we employ in our implementation is based on one standard deviation and follows intuitively from equation 4.4 as: I c = ŷ q ± σ pred (4.5) The implementation of the Newtonian particle and the planar two-joint robot arm use this formulation of the confidence around the LWPR model predictions to realise the exploration algorithm presented earlier in this chapter. In the following section, we give more details about both simulated scenarios and their implementation. 4.2 Simulation Particle Simulation The particle that we simulate has a pre-defined mass and sits on a non-linear surface with simulated viscous drag. The controller may exert a thrust in horizontal direction to have the particle reach a designated goal position x goal and goal velocity ẋ goal. Accordingly, the simulated movement system has one degree-of-freedom and its dynamics are given by the differential equation mẍ = df cẋ + u (4.6) dx where f denotes the surface function, m the particle mass, c a coefficient of drag and u the thrust (or torque) generated by the controller. Our implementation uses a simple Euler integrator to solve this differential equation at every instant of simulation time.

40 Chapter 4. Exploring Exploration Figure 4.2: The non-linear particle surface. For the non-linear surface we arbitrarily choose the periodic function depicted in Figure 4.2 which is given by the following equation f(x) = 4 exp( (1 x) 2 ) + sin x (4.7) Given a goal state (x goal, ẋ goal ), the current state of the particle (x, ẋ) and a desired duration t, we generate a simple bang-bang trajectory that achieves the goal state from the current state in t time steps as shown in Figure 4.3. The characteristic of this trajectory type is that it produces a constant acceleration for a fixed amount of time and then a negative acceleration of the same magnitude for the remaining time steps. The maximum magnitude of the acceleration on this trajectory is lowest among all other possible trajectories that reach (x goal, ẋ goal ) from (x, ẋ) in t time steps (Moore, 1990). Because of the low computational cost to generate a bang-bang trajectory, we recalculate it at every time step to account for possible tracking errors. This means that we do not pre-compute the required torques once but instead continuously query the LWPR model for the command that achieves the current desired acceleration on the current trajectory. As detailed in chapter 1, this procedure is one implementation of open-loop computed torque control. It is elementary algebra to derive the analytical solution to the inverse dy-

41 Chapter 4. Exploring Exploration 35 x(t) v(t) a(t) Figure 4.3: An example bang-bang trajectory to achieve the the goal (x goal, ẋ goal ) = (60.8, 2.2) from (x, ẋ) = (0, 1) in twenty time steps. Adapted from (Moore, 1990). namics problem g : (x, ẋ, ẍ) u by rearranging and solving for u in equation 4.6. The procedure that we use in this simulation, however, is to collect training points during manipulator movements to update the LWPR approximation to g. Data collection is directed by our exploration algorithm as detailed in the previous section. As soon as the confidence bounds grow larger than a threshold, we collect a random data cloud of fixed size around the setpoint and continue approaching the target. The aim of this simulation can be stated as learning the inverse dynamics sufficiently well with limited exploration in order to be able to reach the goal state. A picture of the simulation is given in Figure Robot Arm Simulation For the simulated planar two-joint robot arm we extend the control space to a non-trivial size of six dimensions. At all simulation time steps, the controller obtains the suggested torques from the learned LWPR inverse dynamics mapping g : (θ 1, θ 1, θ 1, θ 2, θ 2, θ 2 ) (τ 1, τ 2 ) where θ, θ, and θ refer to joint angle, angular velocity, and desired angular acceleration, respectively. Unlike the simulated

42 Chapter 4. Exploring Exploration 36 Figure 4.4: The OpenGL planar robot arm simulation. Figure 4.5: The OpenGL particle simulation. The left exploration window shows the explored points in position-velocity-acceleration space together with the confidence bounds in real-time.

43 Chapter 4. Exploring Exploration 37 particle, however, we do not apply the obtained torques τ 1 and τ 2 in an openloop fashion but instead use a more realistic composite controller with a low-gain PID feedback element. The aim of this simulation differs from the previous one as we pre-compute the desired accelerations along a single, fixed trajectory before we initiate any movements. We plan this trajectory with a bell-shaped velocity profile which is computationally more complex than generating the bang-bang trajectory described previously. Rather than recomputing the trajectory at every time step, we attempt to track the original trajectory as closely as possible. Similarly to the previous simulation, we still use a form of computed torque control as we continuously improve our inverse dynamics approximation throughout the experiment. The exploration strategy for this experiment corresponds to the second incarnation of our algorithm described in the previous section. Rather than collecting a random data cloud around a setpoint, we finish the trajectory with high-gain PID control as soon as the confidence bounds grow too large. From this we hope to collect training points that are more relevant for the successive attempts at tracking the trajectory. The simulation is built using the Open Dynamics Engine (ODE) 1 to model the arm as two rigid bodies connected by a hinge joint. ODE employs a more sophisticated numerical integrator than the Euler method to simulate the movement of the arm in an accurate and stable fashion. A picture of the running simulation is given in Figure 4.4 on the previous page. 1 For more details, refer to

44 Chapter 5 Experimental Results In this chapter we present our evaluation of the exploration algorithm we detailed in the previous chapter. Both scenarios the simulated Newtonian particle and the planar two-joint robot arm are discussed in an individual section as the focus of each experiment is slightly different. The first experiments with the Newtonian particle are thought to demonstrate the operation of the algorithm. As the input space is only three-dimensional, we can give intuitive visualisations of the results. The focus of these experiments is to reach a goal state with as few trials as possible while at the same time reducing the amount of explored data points. Performance comparisons are made between a simple random exploration scheme and our exploration algorithm. The second class of experiments with the simulated robot arm takes place in a non-trivial six-dimensional input space. Unlike the case above, we pre-compute a single trajectory and demonstrate the tracking performance of our algorithm. Due to the higher dimensionality of the input space, focussing exploration to the right areas becomes even more relevant. The LWPR algorithm requires a number of initialisation parameters, most importantly an initial value for the distance metric D (cf. equation (2.1) on page 15). In our experiments, the shape and size of each receptive field is revised during learning by optimising the local leave-one-out cross validation error (Schaal and Atkeson, 1998). Different initialisations of D largely affect the convergence properties and the number of local models that are generated during learning. 38

45 Chapter 5. Experimental Results 39 For both experiments we give a brief overview of our parameter choice for the respective LWPR inverse dynamics models. 5.1 Particle Simulation In the example scenario that we detail in this section, an open-loop controller based solely on the inverse dynamics model has to move the particle over the non-linear surface to the target spot. We arbitrarily select x = 6 as the start and x = 6 as the target position on the surface depicted in Figure 4.2. In the following, we present the workings of our algorithm on a typical and representative simulation run 1. For the configuration of the LWPR on-line learning algorithm we select the following parameters and keep the remaining ones at their default values: The initial value of the distance metric is set to 1.0. This was found empirically to give a good compromise between the number of constructed models and the performance of the system. We enable adaptation of the distance metric during learning and select second order meta learning for faster convergence. The distance metric learning rate alpha is set to The learning system starts with two local projections but may decide to add more based on an on a mean-squared-error criterion. Normalisation by dividing each input dimension with a constant value is disabled. With these parameter choices, the learned controller is able to navigate the particle to the goal state in a single learning trial. We visualise the outcome of this experimental run in a variety of ways on the following page. 1 This run can be reproduced by setting the seed property in the configuration file to a value of

the particle position and local model number over time (right). Figure 5.

46 Chapter 5. Experimental Results 40 Figure 5.1: A density plot of the explored position-velocity space (left) and a visualisation of the particle position and local model number over time (right). Figure 5.2: The learned inverse dynamics model g : (x, ẋ, ẍ = 0) τ after reaching the goal state (left) and the true dynamics over the whole input space (right).

47 Chapter 5. Experimental Results all actions controlled actions Figure 5.3: Non-infinite confidence bounds around the learned inverse dynamics model g : (x, ẋ, ẍ = 0) τ (left) and the nmse of the model predictions (right) Observations and Discussion In order to initialise the LWPR model, the controller sends 100 random thrust signals (drawn from a Gaussian random variable with mean zero and standard deviation one) to the particle. It can be seen from Figure 5.1b that these signals cause the particle to slip down the surface away from the goal position. After this random series of commands we switch to an open-loop controller based on the current LWPR model of the particle dynamics. This approximation consists of four local models and must now attempt to stabilise and direct the particle to the goal position. Figure 5.1a details the exploratory phases of this movement towards the goal as a density plot in position-velocity space: darker regions signify more exploration than brighter ones. The LWPR approximation is continuously revised during both controlled and exploratory movements. New training experiences are sampled after advancing the system state via Euler integration at discrete time increments of t = A confidence threshold θ = 2.0 is used to decide when a switch between controlled and exploratory actions should be performed. With this strategy, the number of local models reaches its first peak around x = 4 when 9 models have been constructed. Between x = 1 and x = 3 the

48 Chapter 5. Experimental Results Figure 5.4: The derivative of the surface function df/dx = (8 8x) exp( (1 x) 2 ) + cos x. system undergoes most exploration in this learning trial. As a result, 20 local models are constructed in this interval alone. An intuitive explanation can be given when we analyse the particle dynamics in more detail. Figure 5.4 shows the derivative of the surface function depicted in Figure 4.2. To solve the particle dynamics analytically we rearrange equation (4.6) to obtain: u = mẍ + df + cẋ (5.1) dx We can see that the terms denoting velocity and acceleration only cause a translation of the derivative in y-direction (cf. Figure 5.2b). Therefore, if we naively assume constant velocity and accelerations in an interval, the non-linearities of the particle dynamics in that interval directly follow from Figure 5.4. We can see that the derivative is highly non-linear in our interval [ 1, 3]. Combined with the implementation of our exploration algorithm which adds 100 random points around each setpoint, it is intuitively clear that a large number of local models will be fit in that region in input space to prevent oversmoothing. In our simulation run, no further exploration beyond this interval is initiated by our algorithm in order to be able to reach the goal state x = 6. The inverse dynamics approximation that we converge to after the single learning trial is displayed in Figure 5.2a together with the confidence intervals in Fig-

49 Chapter 5. Experimental Results 43 ure 5.3a. Note that in order to display the function g : R 3 R 1, we fix the last input dimension to 0. Of the 4380 training examples that were recorded during particle movement, 900 are due to directed exploration initiated by our exploration algorithm (corresponding to a total of 9 setpoints between start and goal state). Figure 5.3b uses the normalised mean square error (nmse) criterion 1 Nσy 2 i=1 N (y i ŷ i ) 2 (5.2) to give an evaluation of the learned inverse dynamics. Although our task was ultimately to reach the goal state while reducing learning trials and explored points, we can compare the model predictions against the analytical solution of the inverse dynamics problem. Our evaluation computes the nmse measure for only those task-specific points that actually occurred during the movement from the start to the goal state. We give two separate measures in Figure 5.3b, the first including the 900 exploratory actions and the second considering only the remaining controlled actions. It is conceivable that the performance over the entire input space is much worse than the reported results. This is an intended side effect of our algorithm which restricts exploration to a task-specific subspace of the complete control space. Figure 5.5 substantiates our claim that the single simulation run above holds Figure 5.5: The explored position-velocity space after 10 simulation runs.

Chapter 5. Experimental Results 44 Figure 5.6: A density plot of the randomly explored position-action space. up well as a representative example of our algorithm s workings in the particle scenario.

50 Chapter 5. Experimental Results 44 Figure 5.6: A density plot of the randomly explored position-action space. up well as a representative example of our algorithm s workings in the particle scenario. In this diagram we depict the exploration density within position-velocity space after executing ten independent simulation runs. The results show a similar pattern as before: besides the 100 initial exploratory actions the main focus of exploration can be found in regions of high non-linearity where most local models are aggregated. We now make a brief attempt to compare the performance of our algorithm against a random exploration scheme. The random implementation is similar to the simplest scheme outlined in (Moore, 1991) with alternating perform and experiment trials. For our scenario, we choose to make a single exploratory action (drawn from a Gaussian random variable with mean zero and standard deviation one) after every third controlled action. We also introduce the following modification: whenever we observe that the particle is stuck on the surface or overshoots the target position, we manually retrain the model with 100 points in that area of the input space and invoke another trial. Figure 5.6 shows the explored position-action space after four learning trials. In total, 3760 random control signals were issued and 57 local models constructed before the particle reaches the goal state (x, ẋ) = (6, 0). The following observa-

Learning Inverse Dynamics: a Comparison

Learning Inverse Dynamics: a Comparison Duy Nguyen-Tuong, Jan Peters, Matthias Seeger, Bernhard Schölkopf Max Planck Institute for Biological Cybernetics Spemannstraße 38, 72076 Tübingen - Germany Abstract.