Reinforcement Control via Heuristic Dynamic Programming. K. Wendy Tang and Govardhan Srikant. and

Similar documents
Neuro-Remodeling via Backpropagation of Utility. ABSTRACT Backpropagation of utility is one of the many methods for neuro-control.

Center for Automation and Autonomous Complex Systems. Computer Science Department, Tulane University. New Orleans, LA June 5, 1991.

Neuro-Fuzzy Inverse Forward Models

Techniques. IDSIA, Istituto Dalle Molle di Studi sull'intelligenza Articiale. Phone: Fax:

A modular neural network architecture for inverse kinematics model learning

Skill. Robot/ Controller

Edge detection based on single layer CNN simulator using RK6(4)

TENTH WORLD CONGRESS ON THE THEORY OF MACHINES AND MECHANISMS Oulu, Finland, June 20{24, 1999 THE EFFECT OF DATA-SET CARDINALITY ON THE DESIGN AND STR

Dynamic Analysis of Manipulator Arm for 6-legged Robot

y 2 x 1 Simulator Controller client (C,Java...) Client Ball position joint 2 link 2 link 1 0 joint 1 link 0 (virtual) State of the ball

Inverse KKT Motion Optimization: A Newton Method to Efficiently Extract Task Spaces and Cost Parameters from Demonstrations

Classifier C-Net. 2D Projected Images of 3D Objects. 2D Projected Images of 3D Objects. Model I. Model II

+ + Kinematics. Trajectory Generation and Inverse Kinematics. .. x. V, G, F

MODIFIED KALMAN FILTER BASED METHOD FOR TRAINING STATE-RECURRENT MULTILAYER PERCEPTRONS

A NOUVELLE MOTION STATE-FEEDBACK CONTROL SCHEME FOR RIGID ROBOTIC MANIPULATORS

Document Image Restoration Using Binary Morphological Filters. Jisheng Liang, Robert M. Haralick. Seattle, Washington Ihsin T.

Extensive research has been conducted, aimed at developing

Single and Multiple Frame Video Trac. Radu Drossu 1. Zoran Obradovic 1; C. Raghavendra 1

Networks for Control. California Institute of Technology. Pasadena, CA Abstract

Neuro Fuzzy Controller for Position Control of Robot Arm

Using Local Trajectory Optimizers To Speed Up Global. Christopher G. Atkeson. Department of Brain and Cognitive Sciences and

joint 3 link 3 link 2 joint 1

Research on time optimal trajectory planning of 7-DOF manipulator based on genetic algorithm

A Population-Based Learning Algorithm Which Learns Both. Architectures and Weights of Neural Networks y. Yong Liu and Xin Yao

This paper describes and evaluates the Dual Reinforcement Q-Routing algorithm (DRQ-Routing)

Beyond Feedforward Models Trained by Backpropagation: a Practical Training Tool for a More. Efficient Universal Approximator

Parallel Robots. Mechanics and Control H AMID D. TAG HI RAD. CRC Press. Taylor & Francis Group. Taylor & Francis Croup, Boca Raton London NewYoric

Automatic Control Industrial robotics

Optimization of a two-link Robotic Manipulator

Dr. Qadri Hamarsheh Supervised Learning in Neural Networks (Part 1) learning algorithm Δwkj wkj Theoretically practically

A B. A: sigmoid B: EBA (x0=0.03) C: EBA (x0=0.05) U

The only known methods for solving this problem optimally are enumerative in nature, with branch-and-bound being the most ecient. However, such algori

Inverse Kinematics (part 1) CSE169: Computer Animation Instructor: Steve Rotenberg UCSD, Winter 2018

0.1. alpha(n): learning rate

Vision-based Manipulator Navigation. using Mixtures of RBF Neural Networks. Wolfram Blase, Josef Pauli, and Jorg Bruske

Cancer Biology 2017;7(3) A New Method for Position Control of a 2-DOF Robot Arm Using Neuro Fuzzy Controller

Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound. Lecture 12: Deep Reinforcement Learning

A New Orthogonal Multiprocessor and its Application to Image. Processing. L. A. Sousa M. S. Piedade DEEC IST/INESC. R.

Khoral Research, Inc. Khoros is a powerful, integrated system which allows users to perform a variety

PSO based Adaptive Force Controller for 6 DOF Robot Manipulators

1 Trajectories. Class Notes, Trajectory Planning, COMS4733. Figure 1: Robot control system.

Time Optimal Trajectories for Bounded Velocity Differential Drive Robots

Improving Trajectory Tracking Performance of Robotic Manipulator Using Neural Online Torque Compensator

6. NEURAL NETWORK BASED PATH PLANNING ALGORITHM 6.1 INTRODUCTION

DEVELOPMENT OF NEURAL NETWORK TRAINING METHODOLOGY FOR MODELING NONLINEAR SYSTEMS WITH APPLICATION TO THE PREDICTION OF THE REFRACTIVE INDEX

Department of Electrical Engineering, Keio University Hiyoshi Kouhoku-ku Yokohama 223, Japan

Supervised Learning in Neural Networks (Part 2)

Prototyping a Three-link Robot Manipulator

UNSTRUCTURED GRIDS ON NURBS SURFACES. The surface grid can be generated either in a parameter. surfaces. Generating grids in a parameter space is

A Connectionist Learning Control Architecture for Navigation

AUTONOMOUS PLANETARY ROVER CONTROL USING INVERSE SIMULATION

Model-Free Value Iteration Solution for Dynamic Graphical Games

The Fly & Anti-Fly Missile

r[2] = M[x]; M[x] = r[2]; r[2] = M[x]; M[x] = r[2];

Perceptrons and Backpropagation. Fabio Zachert Cognitive Modelling WiSe 2014/15

Brainstormers Team Description

Spline-based Trajectory Optimization for Autonomous Vehicles with Ackerman drive

3 Model-Based Adaptive Critic

Solution for Euler Equations Lagrangian and Eulerian Descriptions

Data Mining. Neural Networks

In Homework 1, you determined the inverse dynamics model of the spinbot robot to be

Time Series Forecasting Methodology for Multiple Step Ahead Prediction

Neuro-based adaptive internal model control for robot manipulators

Supervised Learning with Neural Networks. We now look at how an agent might learn to solve a general problem by seeing examples.

Natural Actor-Critic. Authors: Jan Peters and Stefan Schaal Neurocomputing, Cognitive robotics 2008/2009 Wouter Klijn

Network. Department of Statistics. University of California, Berkeley. January, Abstract

Cecilia Laschi The BioRobotics Institute Scuola Superiore Sant Anna, Pisa

LEARNING AND APPROXIMATE DYNAMIC PROGRAMMING. Scaling Up to the Real World

Meta- Heuristic based Optimization Algorithms: A Comparative Study of Genetic Algorithm and Particle Swarm Optimization

Robots are built to accomplish complex and difficult tasks that require highly non-linear motions.

Tracking Minimum Distances between Curved Objects with Parametric Surfaces in Real Time

Reinforcement Learning for Appearance Based Visual Servoing in Robotic Manipulation

perspective, logic programs do have a notion of control ow, and the in terms of the central control ow the program embodies.

University of Maryland. fzzj, basili, Empirical studies (Desurvire, 1994) (Jeries, Miller, USABILITY INSPECTION

Self-Learning Fuzzy Controllers Based on Temporal Back Propagation

A Brief Introduction to Reinforcement Learning

PROJECTION MODELING SIMPLIFICATION MARKER EXTRACTION DECISION. Image #k Partition #k

START HINDERNIS ZIEL KOSTEN

Residual Advantage Learning Applied to a Differential Game

THE NEURAL NETWORKS: APPLICATION AND OPTIMIZATION APPLICATION OF LEVENBERG-MARQUARDT ALGORITHM FOR TIFINAGH CHARACTER RECOGNITION

This leads to our algorithm which is outlined in Section III, along with a tabular summary of it's performance on several benchmarks. The last section

Accelerating the convergence speed of neural networks learning methods using least squares

APPLICATION OF THE FUZZY MIN-MAX NEURAL NETWORK CLASSIFIER TO PROBLEMS WITH CONTINUOUS AND DISCRETE ATTRIBUTES

AMS527: Numerical Analysis II

Analysis of Decision Boundaries Generated by Constructive Neural Network Learning Algorithms

Exercise 2: Hopeld Networks

1. Introduction 1 2. Mathematical Representation of Robots

reasonable to store in a software implementation, it is likely to be a signicant burden in a low-cost hardware implementation. We describe in this pap

Figure 1: Representation of moving images using layers Once a set of ane models has been found, similar models are grouped based in a mean-square dist

An Algorithm For Training Multilayer Perceptron (MLP) For Image Reconstruction Using Neural Network Without Overfitting.

Adaptive Methods for Distributed Video Presentation. Oregon Graduate Institute of Science and Technology. fcrispin, scen, walpole,

Argha Roy* Dept. of CSE Netaji Subhash Engg. College West Bengal, India.

SEMANTIC COMPUTING. Lecture 8: Introduction to Deep Learning. TU Dresden, 7 December Dagmar Gromann International Center For Computational Logic

A unified motion planning method for a multifunctional underwater robot

Object classes. recall (%)

Two-link Mobile Manipulator Model

Learning to Act. using Real-Time Dynamic Programming. Andrew G. Barto Steven J. Bradtke. Satinder P. Singh. Department of Computer Science

THE CLASSICAL method for training a multilayer feedforward

An Evaluation of Information Retrieval Accuracy. with Simulated OCR Output. K. Taghva z, and J. Borsack z. University of Massachusetts, Amherst

Theoretical Foundations of SBSE. Xin Yao CERCIA, School of Computer Science University of Birmingham

Transcription:

Reinforcement Control via Heuristic Dynamic Programming K. Wendy Tang and Govardhan Srikant wtang@ee.sunysb.edu and gsrikant@ee.sunysb.edu Department of Electrical Engineering SUNY at Stony Brook, Stony Brook, NY 794-35. ABSTRACT Heuristic Dynamic Programming (HDP) is the simplest kind of Adaptive Critic which is a powerful form of reinforcement control []. It can be used to maximize or minimize any utility function, such as total energy or trajectory error, of a system over time in a noisy environment. Unlike supervised learning, adaptive critic design does not require the desired control signals be known. Instead, feedback is obtained based on a critic network which learns the relationship between a set of control signals and the corresponding strategic utility function. It is an approximation of dynamic programming []. A simple Heuristic Dynamic Programing (HDP) system involves two subnetworks, the Action network and the Critic network. Each of these networks includes a feedforward and a feedback component. A ow chart for the interaction of these components is included. To further illustrate the algorithm, we use HDP for the control of a simple, -D planar robot. Introduction Neurocontrol is dened as the use of neural networks to emit control signals for dynamic systems. There are ve basic schemes: the supervised control, direct inverse control, neural adaptive control, Backpropagation of Utility, and Adaptive Critic Designs. Werbos [] provided a detailed summary of the ve schemes including the pros and cons of each method. Of these techniques, both Backpropagation of Utility and Adaptive Critic are capable of maximizing a utility function. In an earlier report [3], we have demonstrated the application of Backpropagation of Utility. In this paper, we concentrate on Adaptive Critic Design. Adaptive Critic is more powerful than Backpropagation of Utility as it is capable of optimal control in a noisy, nonlinear, or nonstationary environment [4]. Werbos [5] recently showed how a critic network can be used to solve the generalized maze problem, a dicult optimization problem, using a recurrent critic design. Within the scope of Adaptive Critic Designs, there are dierent levels of applications, including Heuristic Dynamic Programming (HDP) and the more advanced Dual Heuristic Programing (DHP) [4]. In a recent article, Prokhorov and Wunsch [6] demonstrated applications of the more advanced DHP and its variations. In this paper, our objective is to illustrate the application of Heuristic Dynamic Programming (HDP). We show that this simpler design is a cost-ecient alternative for less complex problems. This article is organized as follows: Section is a description of the Heuristic Dynamic Programming (HDP). Section 3 illustrates how the algorithm is used for the control of a -D planar robot. Finally, conclusions and a summary are included in Section 4. Heuristic Dynamic Programming - An Adaptive Critic Design The objective of the Adaptive Critic design is to provide a set of control signals to a dynamic system to maximize a utility function over time. The utility function can be total energy, cost-eciency, The authors acknowledge and appreciate discussions with Paul Werbos. This research was supported by the National Science Foundation under Grant No. ECS-966655.

smoothness of a trajectory, etc. For expository convenience, we assume the notation u(t) for the control signal, and U(t) for the utility function which is usually a function of the system state. The simplest form of Adaptive Critic design is Heuristic Dynamic Programming (HDP). A simple HDP system is composed of two subnetworks, an Action network and a Critic network. Such a system is also sometimes referred to as Action Dependent HDP or simply ADHDP. The Action network is responsible for providing the control signal to maximize or minimize the utility function. This goal is achieved through adaptation of the internal weights of the Action network. Such adaptation is accomplished through backpropagation of various signals. For each iteration, there are feedforward and feedback components. In the feedforward mode, the Action network outputs a series of control signals, u(t); t = ; : : :; T whereas adaptation of the internal weights is accomplished through the feedback mode. The Critic network provides the relationship between a given set of control signals and the strategic utility function, J(t). Its function is two folded: (i) in the feedforward mode, it predicts the strategic utility function for a given set of control signals from the action network; and (ii) in the feedback mode, it provides feedback to the action network on how to change its control signal to achieve maximum or minimum utility. The basic idea is that once the critic network learns the relationship between a given set of control signals and the corresponding strategic utility function, it can be used to provide proper feedback to the action network to generate the desired control signals. Figure shows a block-diagram representation of the system. The dashed lines represent the feedback mode, or derivative calculations. In other words, the application of HDP involves repeated iterations of two steps: (i) training of the Critic Network to learn the relationship between a given set of control signals and the corresponding strategic utility function; and (ii) training of Action network to generate a set of control signals. The interaction of these components are described in a ow chart in Figure 8. For practical application of HDP, it is often necessary that we start with some initial control signals. These control signals can be generated by the classical PID controller or through Backpropagation of Utility [3]. Once such initial control signals are generated, an Action network is trained to provide these initial signals and an utility function is computed from these signals. Then the HDP algorithm begins by repeated iterations of several components as presented in Figure 8 and to be described in the next paragraph. Such iterations repeat until the utility function is less than or greater than a threshold value, depending on whether we are minimizing or maximizing the utility function. Referring to Figure 8, for each iteration of the HDP, the forward component of the Action network is used to generate a set of control signals, u(t); t = ; ; : : :T?. The detailed description of the forward component can be found in [3, 7] and is not repeated here. These initial control signals are used to compute the current utility and create a target strategic utility function J for the critic network. More specically, J(t) = U(t) + mj(t + ); t = T? ; : : :; () where m < is a scalar and U(t) is the utility for time t and J(T? ) = U(T? ). Once a target J is created, the Critic network is trained to match its output with J. When such training is complete, the Critic network can be used to provide the derivatives of J through backpropagation. This is achieved through a simple modication of the backward components. Pseudo code for such feedback is included in [3, 7] and is not repeated here. This feedback of the derivative of J is then used in the backward components of Action network to nd the weight gradient. Once the weight gradient is computed, standard weight adaptation algorithms such as Delta-Bar-Delta [8] or the RPROP [9] can be used to update the internal weights of the Action network. The Action network can

then be used to generate a new set of control signals and the algorithm repeats until the utility is less than or bigger than a prescribed threshold value. 3 An Example: -D Planar Robot Control As an example, we consider a simple planar manipulator with two rotational joints (Figure ). We assume, without loss of generality, that the robot links can be represented as point-masses concentrated at the end of the link. The link masses and lengths are respectively: m = m = : kg and d = d = m. This simple dynamic system is governed by the dynamic equations: d(t) = [(m + m )d + m d + m d d cos( (t))] (t) + [m d + m d d cos( (t))] (t)? m d d sin( (t)) _ (t) _ (t)? m d d sin( (t)) _ (t) + (m + m )gd sin( (t)) + m gd sin( (t) + (t)) d(t) = [m d + m d d cos( (t))] (t) + m d (t)? m d d sin( (t)) _ (t) _ (t)? m d d sin( (t)) _ (t) + m gd sin( (t) + (t)) () where d; d are the desired joint forces needed to drive the manipulator to the prescribed trajectory, and g = 9:8m=s is the gravitational constant. We consider that initially, at time t = second, the state of the manipulator is (t = ) = _ (t = ) = (t = ) = (t = ) = _ (t = ) = (t = ) = : The HDP's task is to generate a series of control signals u (t) = (t); u (t) = (t); t = t; t; : : : ; t f = T t = 5 seconds (t = :5; T = ) to drive the manipulator from the initial conguration = for both joints to f = (t = t f ) = 6 ; 9 for and respectively. The desired trajectory for both joints is specied by the quintic polynomials []: d (t) = + ( f? )(t=t f ) 3? 5( f? )(t=t f ) 4 + 6( f? )(t=t f ) 5 _ d (t) = 3( f? )(t =t 3 f )? 6( f? )(t 3 =t 4 f ) + 3( f? )(t 4 =t 5 f ) d (t) = 6( f? )(t=t 3 f )? 8( f? )(t =t 4 f ) + ( f? )(t 3 =t 5 f ) (3) Figures 3-5 show the desired trajectory for both joints. We used Backpropagation of Utility [3] to generate a set of initial forces. These initial forces ( ; ) and the corresponding desired forces ( d; d) are shown in Figure 6. We emphasize that the desired forces are used here for comparison only. Their values are not used in HDP learning. After approximately -5 iterations, the nal forces obtained from HDP becomes very close to the desired values. Figure 7 plots the absolute errors of these forces (j? dj, j? dj < :5). For further investigation, we have applied the HDP algorithm to a second trajectory that drives the two joints to f = 3 ; 6, respectively. Again, the initial forces are generated by Backpropagation of Utility. Figures 9- show the inital and desired forces and the nal force error after HDP learning. In our implementation of HDP, we found that it is helpful to use multiple critic networks for dierent time horizons. We used Critic networks, each learning ten dierent sampling points with two additional overlapping points at the begining and end of a time period. Furthermore, in our implementation of the target J function (Equation ), we set m =. Werbos has suggested that for more complex systems, it is necessary to set m = initially and then gradually relax m to a non-zero value []. 3

4 Conclusions Among the ve forms of neuro-control schemes, Backpropagation of Utility and Adaptive Critic Designs (ADCs) are potentially powerful techniques as they are capable of maximizing and minimizing a utility function over time. Heuristic Dynamic Programming (HDP) is the simplest ADC. In our previous work [3], we have demonstrated how to use Backpropagation of Utility for a simple control problem. However, its main drawback is that it requires the exact emulation of the dynamic system by a neural network, the model network [3]. Adaptive Critic Designs, as an approximation of dynamic programming, are capable of optimal control in a noisy environment. In this paper we showed that the simple HDP combined with Backpropagation of Utility is a costeective alternative for less complex systems. Our simple HDP design, also known as the \Action Dependent" variation of HDP or ADHDP is composed of two subnetworks, the Action net and the Critic net. While the former generates control signals for the system, the latter provides a judgement or feedback to evaluate how eective the generated controls signals are. Such feedback, in the form of derivative of the judgement, is used to update the internal weights of the Action net and eventually drive the control signals to their desired values. As an illustration, we used the HDP design for control of a -D planar robot. Albeit a simple case, the -D example involves various dynamic eects and coupling of joint forces. In our previous attempt of using Backpropagation of Utility for controlling such system, we are unable to adequately emulate the system without prolonged training of the model network. In this paper, we demonstrated that even though the model net used by Backpropagation of Utility is not adequate, the HDP design is capable of driving the control signals to the desired values through proper feedback of reward and punishment, the strategic utility function J (Equation ). A ow chart is provided to illustrate the interaction of the various components of the HDP design. References [] Paul J. Werbos. Approximate Dynamic Programming for Real-Time Control and Neural Modeling. In White D, A and D.A. Sofge, editors, Handbook of Intelligent Control, pages 493{55. Van Nostrand Reinhold, 99. [] Paul J. Werbos. Neurocontrol and Supervised Learning: an Overview and Evaluation. In D.A. White and D.A. Sofge, editors, Handbook of Intelligent Control, pages 65{89. Van Nostrand Reinhold, 99. [3] K. W. Tang and Girish Pingle. \Neuro-Remodeling via Backpropagation of Utility.". Journal of Mathematical Modeling and Scientic Computing, May 996. Accepted for Publication. [4] Paul J. Werbos. A Menu of Designs for Reinforcement Learning Over Time. In R.S. Sutton W. T Miller, III and P.J. Werbos, editors, Neural Networks for Control, pages 67{95. MIT Press, 99. [5] Paul J. Werbos and X. Pang. \Generalized Maze Navigation: SRN Critic Solve What Feedforward or Hebbian Nets Cannot". In World Congress on Neural Networks, pages 88{93, San Diego, CA, September 5-8 996. [6] D.V. Prokhorov and D.C. Wunsch II. \Advanced Adaptive Critic Designs". In World Congress on Neural Networks, San Diego, CA, September 5-8 996. [7] Paul J. Werbos. \Backpropagation Through Time: What It Does and How to Do It". Proceedings of the IEEE, 78():55{56, October 99. 4

[8] Robert A. Jacobs. \Increased Rates of Convergence Through Learning Rate Adaptation". Neural Networks, :95{37, 988. [9] M. Riedmiller and H. Braun. \A Direct Adaptive Method for Faster Backpropagation Learning: The RPROP Algorithm". In Proceedings of the IEEE International Conference on Neural Networks, San Francisco, CA, April 993. [] J. J. Craig. Introduction to Robotics, mechanics and Control. Addison-Wesley Publishing Co., New York, NY, 986. [] Paul J. Werbos. Verbal communication, 996. Action Network X (t)system State d Control Signal u(t) e e Error u(t) Critic Network J(t) Error θ d mass θ d mass Figure : A Simple HDP System Figure : A Two Link Manipulator 9 4 Joint Position (degree) 8 7 6 5 4 3 Joint Velocity (degree/sec) 3 3 4 5 Figure 3: Desired Position Trajectory for Joints and. 3 4 5 Figure 4: Desired Velocity Trajectory for Joints and. Joint Acceleration (degree/sec ) 5 5 5-5 - -5 - Force Error (Newton).... -5 3 4 5 Figure 5: Desired Accelertion Trajectory for Joints and. e-5.5.5.5 3 3.5 4 4.5 5 Figure 7: Resultant Force Error After HDP. 5

.5 Force (Newton).5.5 Initial Force Desired Force Force (Newton).9.8.7.6.5.4.3 Initial Force Desired Force.. -.5 3 4 5 Figure 6: Initial and Desired Forces 3 4 5 S T A R T Initialization Initial Force Yes Is Utility < threshold? No Forward Action Force (Newton).5.5 Desired Force Desired Force Initial Force generate control signal u (t) Compute Utility U(t) Create Target J(t) = U(t) + m J(t+) -.5 3 4 5 Figure 9: Initial and Desired Forces for Second Trajectory Train Critic to Match J Backward Action provide feedback for Action Net weight gradient F_W Delta Bar Delta Rule to update Action weight Force Error (Newton)... S T O P Figure 8: Flow Chart for HDP Training..5.5.5 3 3.5 4 4.5 5 Figure : Force Error for Second Trajectory 6