Neuro-Dynamic Programming An Overview

Similar documents
A Series of Lectures on Approximate Dynamic Programming

Reinforcement Learning and Optimal Control. ASU, CSE 691, Winter 2019

Rollout Algorithms for Stochastic Scheduling Problems

6.231 DYNAMIC PROGRAMMING LECTURE 14 LECTURE OUTLINE

Markov Decision Processes and Reinforcement Learning

Reinforcement Learning (INF11010) Lecture 6: Dynamic Programming for Reinforcement Learning (extended)

Problem characteristics. Dynamic Optimization. Examples. Policies vs. Trajectories. Planning using dynamic optimization. Dynamic Optimization Issues

Approximate Linear Programming for Average-Cost Dynamic Programming

0.1. alpha(n): learning rate

Slides credited from Dr. David Silver & Hung-Yi Lee

Numerical schemes for Hamilton-Jacobi equations, control problems and games

Using Local Trajectory Optimizers To Speed Up Global. Christopher G. Atkeson. Department of Brain and Cognitive Sciences and

Rollout Algorithms for Constrained Dynamic Programming 1

Stable Linear Approximations to Dynamic Programming for Stochastic Control Problems with Local Transitions

3 Model-Based Adaptive Critic

Q-learning with linear function approximation

Introduction to Reinforcement Learning. J. Zico Kolter Carnegie Mellon University

A MONTE CARLO ROLLOUT ALGORITHM FOR STOCK CONTROL

Numerical Dynamic Programming. Jesus Fernandez-Villaverde University of Pennsylvania

Biased Aggregation, Rollout, and Enhanced Policy Improvement for Reinforcement Learning

Deep Reinforcement Learning

Robotics: Science and Systems

Markov Decision Processes (MDPs) (cont.)

ADAPTIVE TILE CODING METHODS FOR THE GENERALIZATION OF VALUE FUNCTIONS IN THE RL STATE SPACE A THESIS SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL

Module 1 Lecture Notes 2. Optimization Problem and Model Formulation

Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound. Lecture 12: Deep Reinforcement Learning

Neuro-fuzzy Learning of Strategies for Optimal Control Problems

Epistemic/Non-probabilistic Uncertainty Propagation Using Fuzzy Sets

Marco Wiering Intelligent Systems Group Utrecht University

Gradient Reinforcement Learning of POMDP Policy Graphs

AC : USING A SCRIPTING LANGUAGE FOR DYNAMIC PROGRAMMING

Reinforcement Learning in Discrete and Continuous Domains Applied to Ship Trajectory Generation

Numerical Robustness. The implementation of adaptive filtering algorithms on a digital computer, which inevitably operates using finite word-lengths,

Optimization of ideal racing line

Reinforcement Learning: A brief introduction. Mihaela van der Schaar

Detection and Mitigation of Cyber-Attacks using Game Theory

On-line search for solving Markov Decision Processes via heuristic sampling

CHAPTER 1 INTRODUCTION

15-780: MarkovDecisionProcesses

10703 Deep Reinforcement Learning and Control

OR 674 DYNAMIC PROGRAMMING Rajesh Ganesan, Associate Professor Systems Engineering and Operations Research George Mason University

Simulation-Based Approximate Dynamic Programming for Near-Optimal Control of Re-entrant Line Manufacturing Models

Sensitivity Analysis. Nathaniel Osgood. NCSU/UNC Agent-Based Modeling Bootcamp August 4-8, 2014

GAMES Webinar: Rendering Tutorial 2. Monte Carlo Methods. Shuang Zhao

In Homework 1, you determined the inverse dynamics model of the spinbot robot to be

LEARNING AND APPROXIMATE DYNAMIC PROGRAMMING. Scaling Up to the Real World

The Cross-Entropy Method for Mathematical Programming

Feature-Based Aggregation and Deep Reinforcement Learning: A Survey and Some New Implementations

Model Reduction for Variable-Fidelity Optimization Frameworks

Mixture Models and the EM Algorithm

Data Mining Chapter 8: Search and Optimization Methods Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

A Brief Introduction to Reinforcement Learning

Optimization. Industrial AI Lab.

Combining Deep Reinforcement Learning and Safety Based Control for Autonomous Driving

Rollout Sampling Policy Iteration for Decentralized POMDPs

Planning and Control: Markov Decision Processes

Outline. Robust MPC and multiparametric convex programming. A. Bemporad C. Filippi. Motivation: Robust MPC. Multiparametric convex programming

Adaptive QoS Control Beyond Embedded Systems

Stuck in Traffic (SiT) Attacks

Natural Actor-Critic. Authors: Jan Peters and Stefan Schaal Neurocomputing, Cognitive robotics 2008/2009 Wouter Klijn


4.12 Generalization. In back-propagation learning, as many training examples as possible are typically used.

Using Machine Learning to Optimize Storage Systems

Modeling with Uncertainty Interval Computations Using Fuzzy Sets

Effective probabilistic stopping rules for randomized metaheuristics: GRASP implementations

Reproducibility in Stochastic Simulation

Origins of Operations Research: World War II

Lazy Approximation for Solving Continuous Finite-Horizon MDPs

OPTIMIZATION METHODS

Residual Advantage Learning Applied to a Differential Game

I R TECHNICAL RESEARCH REPORT. Simulation-Based Algorithms for Average Cost Markov Decision Processes. by Ying He, Michael C. Fu, Steven I.

Machine Learning for Software Engineering

Generalization in Reinforcement Learning: Safely Approximating the Value Function

Trajectory Optimization for. Robots

Nonlinear Programming

The Future of Real-Time Energy Management Systems For Transmission System Operations

Recommender Systems New Approaches with Netflix Dataset

Planning and Reinforcement Learning through Approximate Inference and Aggregate Simulation

Designing Robust Control by Sliding Mode Control Technique

Directional Sensor Control for Maximizing Information Gain

Unsupervised Learning: Clustering

Computational Methods in Statistics with Applications A Numerical Point of View. Large Data Sets. L. Eldén. March 2016

Piecewise Quadratic Optimal Control

Convergence of reinforcement learning with general function approximators

Rollout Algorithms for Discrete Optimization: A Survey

Using Free Energies to Represent Q-values in a Multiagent Reinforcement Learning Task

Apprenticeship Learning for Reinforcement Learning. with application to RC helicopter flight Ritwik Anand, Nick Haliday, Audrey Huang

Data Mining Practical Machine Learning Tools and Techniques. Slides for Chapter 6 of Data Mining by I. H. Witten and E. Frank

Value Iteration. Reinforcement Learning: Introduction to Machine Learning. Matt Gormley Lecture 23 Apr. 10, 2019

Statistical Matching using Fractional Imputation

Generalization in Reinforcement Learning: Successful Examples Using Sparse Coarse Coding

The Embedded Systems Design Challenge. EPFL Verimag

MOBILE sensor networks can be deployed to efficiently

Monte Carlo Tree Search PAH 2015

Introduction to Optimization

Approximation Algorithms for Planning Under Uncertainty. Andrey Kolobov Computer Science and Engineering University of Washington, Seattle

Outline. No Free Lunch Theorems SMTWTP. Outline DM812 METAHEURISTICS

Performance Comparison of Sarsa(λ) and Watkin s Q(λ) Algorithms

Function Approximation. Pieter Abbeel UC Berkeley EECS

THE RECURSIVE LOGIT MODEL TUTORIAL

Transcription:

1 Neuro-Dynamic Programming An Overview Dimitri Bertsekas Dept. of Electrical Engineering and Computer Science M.I.T. May 2006

2 BELLMAN AND THE DUAL CURSES Dynamic Programming (DP) is very broadly applicable, but it suffers from: Curse of dimensionality Curse of modeling We address complexity by using approximations (based loosely on parametric/neural architectures) Unlimited applications in planning, resource allocation, stochastic control, discrete optimization Application is an art but guided by substantial theory

3 OUTLINE Main NDP framework Discussion of two classes of methods: Actor-critic methods/lspe Rollout algorithms Connection between rollout and Model Predictive Control (MPC) Book references: Neuro-Dynamic Programming (Bertsekas + Tsitsiklis) Reinforcement Learning (Sutton + Barto) Dynamic Programming: 3rd Edition (Bertsekas) Papers can be downloaded from http://web.mit.edu/dimitrib/www/home.html

4 DYNAMIC PROGRAMMING / DECISION AND CONTROL Main ingredients: Dynamic system; state evolving in discrete time Decision/control applied at each time Cost is incurred at each time There may be noise & model uncertainty There is state feedback used to determine the control Decision/ Control System State Feedback Loop

5 ESSENTIAL TRADEOFF CAPTURED BY DP Decisions are made in stages The decision at each stage: Determines the present stage cost Affects the context within which future decisions are made At each stage we must trade: Low present stage cost Undesirability of high future costs

6 KEY DP RESULT: BELLMAN S EQUATION Optimal decision at the current state minimizes the expected value of Current stage cost + Future stages cost starting from the next state (using opt. policy) Extensive mathematical methodology Applies to both discrete and continuous systems (and hybrids) Dual curses of dimensionality/modeling

7 KEY NDP IDEA Use one-step lookahead with an approximate cost At the current state select decision that minimizes the expected value of Current stage cost + Approximate future stages cost starting from the next state Important issues: How to construct the approximate cost of a state How to understand and control the effects of approximation

8 METHODS TO COMPUTE AN APPROXIMATE COST Parametric approximation algorithms (off-line) Use a functional approximation to the optimal cost function Select the weights of the approximation - connection with neural networks One possibility: Hand-tuning, and trial and error Systematic DP-related policy and value iteration methods (TD-Lambda, Q-learning, LSPE, LSTD, etc) - simulation and least squares fit Rollout algorithms (on-line) Simulate the system under some (good heuristic) policy starting from the state of interest. Use the cost of the heuristic (or a lower bound) as cost approximation

9 SIMULATION AND LEARNING Simulation (learning by experience): used to compute the (approximate) cost-to-go is a key distinctive aspect of NDP Important advantage: A detailed model of the system not necessary - use a simulator instead In case of parametric approximation: off-line learning In case of a rollout algorithm: on-line learning is used (we learn only the cost values needed by on-line simulation)

10 PARAMETRIC APPROXIMATION: CHESS PARADIGM Chess playing computer programs State = board position Score of position: Important features appropriately weighted Feature Extraction Features: Material balance Mobility Safety etc Scoring Function Score of position Position Evaluator

11 TRAINING In chess: Weights are hand-tuned In more sophisticated methods: Weights are determined by using simulation-based training algorithms TD(λ), Q-Learning, Least Squares Policy Evaluation (LSPE), Least Squares Temporal Differences (LSTD), extended Kalman filtering, etc All of these methods are based on DP ideas of policy iteration and value iteration

12 POLICY IMPROVEMENT PRINCIPLE Given a current policy, define a new policy as follows: At each state minimize Current stage cost + cost-to-go of current policy (starting from the next state) Policy improvement result: New policy has improved performance over current policy If the cost-to-go is approximate, the improvement is approximate Oscillation around the optimal; error bounds

13 ACTOR/CRITIC SYSTEMS Metaphor for policy improvement/evaluation Actor implements current policy Critic evaluates the performance; passes feedback to the actor Actor changes policy System Simulator Actor Performance Simulation Data Critic Performance Evaluation Approximate Scoring Function Decision Decision Generator State Feedback from critic Scoring Function

14 POLICY EVALUATION BY VALUE ITERATION Value iteration to evaluate the cost of a fixed policy: J t+1 = T(J t ), where T is the DP mapping Value iteration with linear function approximation: Φr t+1 = ΠT(Φr t ) where Φ is a matrix of basis functions/features and Π is projection w/ respect to steady-state distribution norm Remarkable Fact: ΠT is a contraction for discounted and other problems

15 LSPE: SIMULATION-BASED IMPLEMENTATION Simulation-based implementation of Φr t+1 = ΠT(Φr t ) with an infinitely long trajectory, and least squares Φr t+1 = ΠT(Φr t ) + Diminishing simulation noise Interesting convergence theory (see papers at www site) Use of the steady-state distribution norm is critical Optimal convergence rate; much better than TD(lambda)

16 SUMMARY OF ACTOR-CRITIC SYSTEMS A lot of mathematical analysis, insight, and practical experience are now available There is solid theory for: Methods w/ exact (lookup table) cost representations Policy evaluation methods with linear function aprpoximation [TD(lambda), LSPE, LSTD] In approximate policy iteration, typically, improved policies are obtained early, then the method oscillates On-line computation is small Training is challenging and time-consuming Less suitable when problem data changes frequently

17 ROLLOUT POLICIES: BACKGAMMON PARADIGM On-line (approximate) cost-to-go calculation by simulation of some base policy (heuristic) Rollout: action w/ best simulation results Rollout is one-step policy iteration Possible Moves Av. Score by Monte-Carlo Simulation Av. Score by Monte-Carlo Simulation Av. Score by Monte-Carlo Simulation Av. Score by Monte-Carlo Simulation

18 COST IMPROVEMENT PROPERTY Generic result: Rollout improves on Base A special case of policy iteration/policy improvement Extension to multiple base heuristics: From each next state, run multiple heuristics Use as value of the next state the best heuristic value Cost improvement: The rollout algorithm performs at least as well as each of the base heuristics Interesting fact: The classical open-loop feedback control policy is a special case of rollout (base heuristic is the optimal open-loop policy) In practice, substantial improvements over the base heuristic(s) have been observed Major drawback: Extensive Monte-Carlo simulation

19 STOCHASTIC PROBLEMS Major issue: Computational burden of Monte- Carlo simulation Motivation to use approximate Monte-Carlo Approximate Monte-Carlo by certainty equivalence: Assume future unknown quantities are fixed at some typical values Advantage : Single simulation run per next state, but some loss of optimality Extension to multiple scenarios (see Bertsekas and Castanon, 1997)

20 ROLLOUT ALGORITHM PROPERTIES Forward looking (the heuristic runs to the end) Self-correcting (the heuristic is reapplied at each time step) Suitable for on-line use Suitable for replanning Suitable for situations where the problem data are a priori unknown Substantial positive experience with many types of optimization problems, including combinatorial (e.g., scheduling)

21 DETERMINISTIC PROBLEMS ONLY ONE simulation trajectory needed Use heuristic(s) for approximate cost-to-go calculation At each state, consider all possible next states, and run the heuristic(s) from each Select the next state with best heuristic cost Straightforward to implement Cost improvement results are sharper (Bertsekas, Tsitsiklis, Wu, 1997, Bertsekas 2005) Extension to constrained problems

22 MODEL PREDICTIVE CONTROL Motivation: Deal with state/control constraints Basic MPC framework Deterministic discrete time system x k+1 = f(x k,u k ) Control contraint U, state constraint X Quadratic cost per stage: x Qx+u Ru MPC operation: At the typical state x Drive the state to 0 in m stages with minimum quadratic cost, while observing the constraints Use the 1st component of the m-stage optimal control sequence, discard the rest Repeat at the next state

23 ADVANTAGES OF MPC It can deal explicitly with state and control constraints It can be implemented using standard deterministic optimal control methodology Key result: The resulting (suboptimal) closedloop system is stable (under a constrained controllability assumption - Keerthi/Gilbert, 1988) Connection with infinite-time reachability Extension to problems with set-membership description of uncertainty

24 CONNECTION OF MPC AND ROLLOUT MPC <==> Rollout with suitable base heuristic Heuristic: Apply the (m-1)-stage policy that drives the state to 0 with minimum cost Stability of MPC <==> Cost improvement of rollout Base heuristic stable ==> Rollout policy is also stable

25 EXTENSIONS The relation with rollout suggests more general MPC schemes: Nontraditional control and/or state constraints Set-membership disturbances The success of MPC should encourage the use of rollout

26 RESTRICTED STRUCTURE POLICIES General suboptimal control scheme At each time step: Impose restrictions on future information or control Optimize the future under these restrictions Use 1st component of the restricted policy Recompute at the next step Special cases: Rollout, MPC: Restrictions on future control Open-loop feedback control: Restrictions on future information Main result for the suboptimal policy so obtained: It has better performance than the restricted policy

27 CONCLUDING REMARKS NDP is a broadly applicable methodology; addresses optimization problems that are intractable in other ways Many off-line and on-line methods to choose from Interesting theory No need for a detailed model; a simulator suffices Computational requirements are substantial Successful application is an art Rollout has been the most consistently successful methodology Rollout has interesting connections with other successful methodologies such as MPC and open-loop feedback control