Planning and Control: Markov Decision Processes

Similar documents
Markov Decision Processes. (Slides from Mausam)

Partially Observable Markov Decision Processes. Mausam (slides by Dieter Fox)

Reinforcement Learning: A brief introduction. Mihaela van der Schaar

Partially Observable Markov Decision Processes. Silvia Cruciani João Carvalho

15-780: MarkovDecisionProcesses

Probabilistic Robotics

Reinforcement Learning (INF11010) Lecture 6: Dynamic Programming for Reinforcement Learning (extended)

Monte Carlo Tree Search PAH 2015

Markov Decision Processes and Reinforcement Learning

Introduction to Reinforcement Learning. J. Zico Kolter Carnegie Mellon University

Autonomous Robotics 6905

A Brief Introduction to Reinforcement Learning

Gradient Reinforcement Learning of POMDP Policy Graphs

CS 188: Artificial Intelligence Spring Recap Search I

Markov Decision Processes (MDPs) (cont.)

Recap Search I. CS 188: Artificial Intelligence Spring Recap Search II. Example State Space Graph. Search Problems.

An Improved Policy Iteratioll Algorithm for Partially Observable MDPs

CS 687 Jana Kosecka. Reinforcement Learning Continuous State MDP s Value Function approximation

A fast point-based algorithm for POMDPs

Reinforcement Learning (2)

Algorithms for Solving RL: Temporal Difference Learning (TD) Reinforcement Learning Lecture 10

CMU-Q Lecture 4: Path Planning. Teacher: Gianni A. Di Caro

10703 Deep Reinforcement Learning and Control

Value Iteration. Reinforcement Learning: Introduction to Machine Learning. Matt Gormley Lecture 23 Apr. 10, 2019

Perseus: randomized point-based value iteration for POMDPs

Task-Oriented Planning for Manipulating Articulated Mechanisms under Model Uncertainty. Venkatraman Narayanan and Maxim Likhachev

Artificial Intelligence

Predictive Autonomous Robot Navigation

What is Search For? CSE 473: Artificial Intelligence. Example: N-Queens. Example: N-Queens. Example: Map-Coloring 4/7/17

Name: UW CSE 473 Midterm, Fall 2014

Symbolic LAO* Search for Factored Markov Decision Processes

Apprenticeship Learning for Reinforcement Learning. with application to RC helicopter flight Ritwik Anand, Nick Haliday, Audrey Huang

Planning for Markov Decision Processes with Sparse Stochasticity

Planning. CPS 570 Ron Parr. Some Actual Planning Applications. Used to fulfill mission objectives in Nasa s Deep Space One (Remote Agent)

Incremental methods for computing bounds in partially observable Markov decision processes

An Approach to State Aggregation for POMDPs

Marco Wiering Intelligent Systems Group Utrecht University

Probabilistic Planning for Behavior-Based Robots

Introduction to Mobile Robotics Path and Motion Planning. Wolfram Burgard, Cyrill Stachniss, Maren Bennewitz, Diego Tipaldi, Luciano Spinello

COS Lecture 13 Autonomous Robot Navigation

Space-Progressive Value Iteration: An Anytime Algorithm for a Class of POMDPs

Function Approximation. Pieter Abbeel UC Berkeley EECS

Reinforcement Learning of Traffic Light Controllers under Partial Observability

Localization and Map Building

CSE151 Assignment 2 Markov Decision Processes in the Grid World

Exam Topics. Search in Discrete State Spaces. What is intelligence? Adversarial Search. Which Algorithm? 6/1/2012

Spring Localization II. Roland Siegwart, Margarita Chli, Juan Nieto, Nick Lawrance. ASL Autonomous Systems Lab. Autonomous Mobile Robots

arxiv: v1 [cs.cv] 2 Sep 2018

Reinforcement Learning and Optimal Control. ASU, CSE 691, Winter 2019

6.141: Robotics systems and science Lecture 10: Motion Planning III

Monte Carlo Tree Search

6.141: Robotics systems and science Lecture 10: Implementing Motion Planning

Generalized and Bounded Policy Iteration for Interactive POMDPs

Heuristic Search Value Iteration Trey Smith. Presenter: Guillermo Vázquez November 2007

Forward Search Value Iteration For POMDPs

STA 4273H: Statistical Machine Learning

CS 378: Computer Game Technology

Unsupervised Learning: Clustering

Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound. Lecture 12: Deep Reinforcement Learning

Pascal De Beck-Courcelle. Master in Applied Science. Electrical and Computer Engineering

This chapter explains two techniques which are frequently used throughout

Final Exam. Introduction to Artificial Intelligence. CS 188 Spring 2010 INSTRUCTIONS. You have 3 hours.

DIT411/TIN175, Artificial Intelligence. Peter Ljunglöf. 19 January, 2018

Chapter 3: Search. c D. Poole, A. Mackworth 2010, W. Menzel 2015 Artificial Intelligence, Chapter 3, Page 1

CS 188: Artificial Intelligence Fall 2008

Announcements. CS 188: Artificial Intelligence Fall Large Scale: Problems with A* What is Search For? Example: N-Queens

When Network Embedding meets Reinforcement Learning?

CS 188: Artificial Intelligence

Search-based Planning with Motion Primitives. Maxim Likhachev Carnegie Mellon University

Solving Factored POMDPs with Linear Value Functions

Supplementary material: Planning with Abstract Markov Decision Processes

Class Overview. Introduction to Artificial Intelligence COMP 3501 / COMP Lecture 2: Search. Problem Solving Agents

Spring Localization II. Roland Siegwart, Margarita Chli, Martin Rufli. ASL Autonomous Systems Lab. Autonomous Mobile Robots

Sung-Eui Yoon ( 윤성의 )

Slides credited from Dr. David Silver & Hung-Yi Lee

Introduction to Mobile Robotics Path Planning and Collision Avoidance

Deep Reinforcement Learning

Distributed and Asynchronous Policy Iteration for Bounded Parameter Markov Decision Processes

Applying Metric-Trees to Belief-Point POMDPs

Class Overview. Introduction to Artificial Intelligence COMP 3501 / COMP Lecture 2. Problem Solving Agents. Problem Solving Agents: Assumptions

Using Domain-Configurable Search Control for Probabilistic Planning

Using Reinforcement Learning to Optimize Storage Decisions Ravi Khadiwala Cleversafe

Solving Concurrent Markov Decision Processes

Learning and Solving Partially Observable Markov Decision Processes

Robot Motion Planning

Configuration Space of a Robot

REINFORCEMENT LEARNING: MDP APPLIED TO AUTONOMOUS NAVIGATION

Probabilistic Planning with Markov Decision Processes. Andrey Kolobov and Mausam Computer Science and Engineering University of Washington, Seattle

Machine Learning. Sourangshu Bhattacharya

10-701/15-781, Fall 2006, Final

servable Incremental methods for computing Markov decision

Introduction to Mobile Robotics Path Planning and Collision Avoidance. Wolfram Burgard, Maren Bennewitz, Diego Tipaldi, Luciano Spinello

Robotic Behaviors. Potential Field Methods

Master Thesis. Simulation Based Planning for Partially Observable Markov Decision Processes with Continuous Observation Spaces

Cost Functions in Machine Learning

Announcements. Homework 4. Project 3. Due tonight at 11:59pm. Due 3/8 at 4:00pm

Hierarchical Average Reward Reinforcement Learning

Multi-Robot Planning for Perception of Multiple Regions of Interest

Approximate path planning. Computational Geometry csci3250 Laura Toma Bowdoin College

Hierarchical Average Reward Reinforcement Learning Mohammad Ghavamzadeh Sridhar Mahadevan CMPSCI Technical Report

Transcription:

CSE-571 AI-based Mobile Robotics Planning and Control: Markov Decision Processes

Planning Static vs. Dynamic Predictable vs. Unpredictable Fully vs. Partially Observable Perfect vs. Noisy Environment What action next? Discrete vs. Continuous Outcomes Deterministic vs. Stochastic Percepts Actions Full vs. Partial satisfaction

Full Classical Planning Static Predictable Environment Fully Observable What action next? Discrete Deterministic Perfect Percepts Actions

Full Stochastic Planning Static Unpredictable Environment Fully Observable Perfect What action next? Discrete Stochastic Percepts Actions

Deterministic, fully observable

Stochastic, Fully Observable

Stochastic, Partially Observable

Markov Decision Process (MDP) S: A set of states A: A set of actions Pr(s s,a): transition model C(s,a,s ): cost model G: set of goals s 0 : start state : discount factor R(s,a,s ): reward model

Role of Discount Factor ( ) Keep the total reward/total cost finite useful for infinite horizon problems sometimes indefinite horizon: if there are deadends Intuition (economics): Money today is worth more than money tomorrow. Total reward: r 1 + r 2 + 2 r 3 + Total cost: c 1 + c 2 + 2 c 3 +

Objective of a Fully Observable MDP Find a policy : S A which optimises minimises maximises maximises discounted or undiscount. expected cost to reach a goal expected reward expected (reward-cost) given a horizon finite infinite indefinite assuming full observability

Examples of MDPs Goal-directed, Indefinite Horizon, Cost Minimisation MDP <S, A, Pr, C, G, s 0 > Infinite Horizon, Discounted Reward Maximisation MDP <S, A, Pr, R, > Reward = t t r t Goal-directed, Finite Horizon, Prob. Maximisation MDP <S, A, Pr, G, s 0, T>

Bellman Equations for MDP 1 <S, A, Pr, C, G, s 0 > Define J*(s) {optimal cost} as the minimum expected cost to reach a goal from this state. J* should satisfy the following equation: Q*(s,a)

Bellman Equations for MDP 2 <S, A, Pr, R, s 0, > Define V*(s) {optimal value} as the maximum expected discounted reward from this state. V* should satisfy the following equation:

Bellman Backup Given an estimate of V* function (say V n ) Backup V n function at state s calculate a new estimate (V n+1 ) : Q n+1 (s,a) : value/cost of the strategy: execute action a in s, execute n subsequently n = argmax a Ap(s) Q n (s,a) (greedy action)

Bellman Backup a greedy = a 1 max a 1 s 1 V 0 = 20 Q 1 (s,a 1 ) = 20 + 5 Q 1 (s,a 2 ) = 20 + 0.9 2 + 0.1 3 Q 1 (s,a 3 ) = 4 + 3 V 1 = 25 s 0 20 a? 2 s 2 V 0 = 2 a 3 s 3 V 0 = 3

Value iteration [Bellman 57] assign an arbitrary assignment of V 0 to each non-goal state. repeat for all states s compute V n+1 (s) by Bellman backup at s. until max s V n+1 (s) V n (s) < Iteration n+1 Residual(s) -convergence

Complexity of value iteration One iteration takes O( A S 2 ) time. Number of iterations required poly( S, A,1/(1-γ)) Overall: the algorithm is polynomial in state space thus exponential in number of state variables.

Policy Computation Optimal policy is stationary and time-independent. for infinite/indefinite horizon problems Policy Evaluation A system of linear equations in S variables.

Markov Decision Process (MDP) r=1 0.7 s 2 0.1 0.9 0.01 r=0 s 1 0.3 0.3 0.3 s 3 r=20 0.99 0.4 r=0 0.2 s 4 s 5 0.8 r=-10

Value Function and Policy Value residual and policy residual

Changing the Search Space Value Iteration Search in value space Compute the resulting policy Policy Iteration [Howard 60] Search in policy space Compute the resulting value

Policy iteration [Howard 60] assign an arbitrary assignment of 0 to each state. repeat compute V n+1 : the evaluation of n for all states s compute n+1 (s): argmax a2 Ap(s) Q n+1 (s,a) until n+1 = n Advantage Modified Policy Iteration costly: O(n 3 ) approximate by value iteration using fixed policy searching in a finite (policy) space as opposed to uncountably infinite (value) space convergence faster. all other properties follow!

LP Formulation minimise s2s V*(s) under constraints: for every s, a V*(s) R(s) + s 2S Pr(s a,s)v*(s ) A big LP. So other tricks used to solve it!

Hybrid MDPs Hybrid Markov decision process: Markov state = (n, x), where n is the discrete component (set of fluents) and. Bellman s equation: V t 1 n ( x) = max Pr( n' n, x, a) a An ( x) n' N x' X Pr( x' x l n, x, a, n') t R ( x' ) V ( x' dx' n ' n' )

Hybrid MDPs Hybrid Markov decision process: Markov state = (n, x), where n is the discrete component (set of fluents) and. Bellman s equation: V t 1 n ( x) = max Pr( n' n, x, a) a An ( x) n' N x' X Pr( x' x l n, x, a, n') t R ( x' ) V ( x' dx' n ' n' )

Convolutions discrete-discrete constant-discrete [Feng et.al. 04] constant-constant [Li&Littman 05]

probability density function Result of convolutions value function discrete constant linear discrete discrete constant linear constant constant linear quadratic linear linear quadratic cubic

Value Iteration for Motion Planning (assumes knowledge of robot s location)

Frontier-based Exploration Every unknown location is a target point.

Manipulator Control Arm with two joints Configuration space

Manipulator Control Path State space Configuration space

Manipulator Control Path State space Configuration space

Collision Avoidance via Planning Potential field methods have local minima Perform efficient path planning in the local perceptual space Path costs depend on length and closeness to obstacles [Konolige, Gradient method]

Paths and Costs Path is list of points P={p 1, p 2, p k } p k is only point in goal set Cost of path is separable into intrinsic cost at each point along with adjacency cost of moving from one point to the next F ) ( P) = I( pi ) A( pi, pi 1 i i Adjacency cost typically Euclidean distance Intrinsic cost typically occupancy, distance to obstacle

Navigation Function Assignment of potential field value to every element in configuration space [Latombe, 91]. Goal set is always downhill, no local minima. Navigation function of a point is cost of minimal cost path that starts at that point. N k = min P k F( P k )

Computation of Navigation Function Initialization Points in goal set 0 cost All other points infinite cost Active list goal set Repeat Take point from active list and update neighbors If cost changes, add the point to the active list Until active list is empty

Challenges Where do we get the state space from? Where do we get the model from? What happens when the world is slightly different? Where does reward come from? Continuous state variables Continuous action space

How to solve larger problems? If deterministic problem Use dijkstra s algorithm If no back-edge Use backward Bellman updates Prioritize Bellman updates to maximize information flow If known initial state Use dynamic programming + heuristic search LAO*, RTDP and variants Divide an MDP into sub-mdps are solve the hierarchy Aggregate states with similar values Relational MDPs

Approximations: n-step lookahead n=1 : greedy 1 (s) = argmax a R(s,a) n-step lookahead n (s) = argmax a V n (s)

Approximation: Incremental approaches deterministic relaxation Deterministic planner plan Stochastic simulation Identify weakness Solve/Merge

Approximations: Planning and Replanning deterministic relaxation Deterministic planner plan send the state reached Execute the action

CSE-571 AI-based Mobile Robotics Planning and Control: SA-1 (1) Reinforcement Learning (2) Partially Observable Markov Decision Processes

Reinforcement Learning Still have an MDP Still looking for policy New twist: don t know Pr and/or R i.e. don t know which states are good And what actions do Must actually try actions and states out to learn

Model based methods Visit different states, perform different actions Estimate Pr and R Once model built, do planning using V.I. or other methods Cons: require _huge_ amounts of data

Model free methods TD learning Directly learn Q*(s,a) values sample = R(s,a,s ) + max a Q n (s,a ) Nudge the old estimate towards the new sample Q n+1 (s,a) Ã (1- )Q n (s,a) + [sample]

Properties Converges to optimal if If you explore enough If you make learning rate ( ) small enough But not decrease it too quickly

Exploration vs. Exploitation -greedy Each time step flip a coin With prob, action randomly With prob 1- take the current greedy action Lower over time to increase exploitation as more learning has happened

Q-learning Problems Too many states to visit during learning Q(s,a) is a BIG table We want to generalize from small set of training examples Solutions Value function approximators Policy approximators Hierarchical Reinforcement Learning

Task Hierarchy: MAXQ Decomposition [Dietterich 00] Fetch Root Deliver Children of a task are unordered Take Navigate(loc) Give Extend-arm Grab Release Extend-arm Move s Move w Move e Move n

MAXQ Decomposition Augment the state s by adding the subtask i: [s,i]. Define C([s,i],j) as the reward received in i after j finishes. Q([s,Fetch],Navigate(prr)) = V([s,Navigate(prr)])+C([s,Fetch],Navigate(prr)) Reward received while navigating Reward received after navigation Express V in terms of C Learn C, instead of learning Q

MAXQ Decomposition (contd) State Abstraction Finding irrelevant actions Finding funnel actions

POMDPs: Recall example

Partially Observable Markov Decision Processes

POMDPs In POMDPs we apply the very same idea as in MDPs. Since the state is not observable, the agent has to make its decisions based on the belief state which is a posterior distribution over states. Let b be the belief of the agent about the state under consideration. POMDPs compute a value function over belief space: 29.11.2007 CSE-571- AI-based Mobile Robotics 54

Problems Each belief is a probability distribution, thus, each value in a POMDP is a function of an entire probability distribution. This is problematic, since probability distributions are continuous. Additionally, we have to deal with the huge complexity of belief spaces. For finite worlds with finite state, action, and measurement spaces and finite horizons, however, we can effectively represent the value functions by piecewise linear functions. 29.11.2007 CSE-571- AI-based Mobile Robotics 55

An Illustrative Example measurements state x 1 action u 3 state x 2 measurements z 1 z 2 0.7 0.3 0.2 x u3 1 0.8 0.8 0.2 u 3 x 2 u1 u2 u1 u2 100 100 100 50 0.3 0.7 z 1 z 2 actions u 1, u 2 payoff payoff 29.11.2007 CSE-571- AI-based Mobile Robotics 56

The Parameters of the Example The actions u 1 and u 2 are terminal actions. The action u 3 is a sensing action that potentially leads to a state transition. The horizon is finite and =1. 29.11.2007 CSE-571- AI-based Mobile Robotics 57

Payoff in POMDPs In MDPs, the payoff (or return) depended on the state of the system. In POMDPs, however, the true state is not exactly known. Therefore, we compute the expected payoff by integrating over all states: 29.11.2007 CSE-571- AI-based Mobile Robotics 58

Payoffs in Our Example (1) If we are totally certain that we are in state x 1 and execute action u 1, we receive a reward of -100 If, on the other hand, we definitely know that we are in x 2 and execute u 1, the reward is +100. In between it is the linear combination of the extreme values weighted by the probabilities 29.11.2007 CSE-571- AI-based Mobile Robotics 59

Payoffs in Our Example (2) 29.11.2007 CSE-571- AI-based Mobile Robotics 60

The Resulting Policy for T=1 Given we have a finite POMDP with T=1, we would use V 1 (b) to determine the optimal policy. In our example, the optimal policy for T=1 is This is the upper thick graph in the diagram. 29.11.2007 CSE-571- AI-based Mobile Robotics 61

Piecewise Linearity, Convexity The resulting value function V 1 (b) is the maximum of the three functions at each point It is piecewise linear and convex. 29.11.2007 CSE-571- AI-based Mobile Robotics 62

Pruning If we carefully consider V 1 (b), we see that only the first two components contribute. The third component can therefore safely be pruned away from V 1 (b). 29.11.2007 CSE-571- AI-based Mobile Robotics 63

Increasing the Time Horizon Assume the robot can make an observation before deciding on an action. 29.11.2007 V 1 (b) CSE-571- AI-based Mobile Robotics 64

Increasing the Time Horizon Assume the robot can make an observation before deciding on an action. Suppose the robot perceives z 1 for which p(z 1 x 1 )=0.7 and p(z 1 x 2 )=0.3. Given the observation z 1 we update the belief using Bayes rule. p' p' 1 2 p( z = = 1 0.7 p1 p( z ) 1 0.3(1 p p( z ) 1 ) = 0.7 p 1 1 ) 0.3(1 p 1 ) = 0.4 p 1 0.3 29.11.2007 CSE-571- AI-based Mobile Robotics 65

Value Function V 1 (b) b (b z 1 ) 29.11.2007 V 1 (b z 1 ) CSE-571- AI-based Mobile Robotics 66

Increasing the Time Horizon Assume the robot can make an observation before deciding on an action. Suppose the robot perceives z 1 for which p(z 1 x 1 )=0.7 and p(z 1 x 2 )=0.3. Given the observation z 1 we update the belief using Bayes rule. Thus V 1 (b z 1 ) is given by 29.11.2007 CSE-571- AI-based Mobile Robotics 67

29.11.2007 CSE-571- AI-based Mobile Robotics 68 Expected Value after Measuring Since we do not know in advance what the next measurement will be, we have to compute the expected belief = = = = = = = 2 1 1 1 1 2 1 1 1 1 2 1 1 1 1 ) ( ) ( ) ( ) ( ) ( ) ( )] ( [ ) ( i i i i i i i i i z p x z p V z p p x z p V z p z b V z p z b V E b V

Expected Value after Measuring Since we do not know in advance what the next measurement will be, we have to compute the expected belief 29.11.2007 CSE-571- AI-based Mobile Robotics 69

Resulting Value Function The four possible combinations yield the following function which then can be simplified and pruned. 29.11.2007 CSE-571- AI-based Mobile Robotics 70

Value Function p(z 1 ) V 1 (b z 1 ) b (b z 1 ) 29.11.2007 p(z 2 ) V 2 (b z 2 ) \bar{v} 1 (b) CSE-571- AI-based Mobile Robotics 71

State Transitions (Prediction) When the agent selects u 3 its state potentially changes. When computing the value function, we have to take these potential state changes into account. 29.11.2007 CSE-571- AI-based Mobile Robotics 72

Resulting Value Function after executing u 3 Taking the state transitions into account, we finally obtain. 29.11.2007 CSE-571- AI-based Mobile Robotics 73

Value Function after executing u 3 \bar{v}1(b) 29.11.2007 \bar{v} 1 (b u 3 ) CSE-571- AI-based Mobile Robotics 74

Value Function for T=2 Taking into account that the agent can either directly perform u 1 or u 2 or first u 3 and then u 1 or u 2, we obtain (after pruning) 29.11.2007 CSE-571- AI-based Mobile Robotics 75

Graphical Representation of V 2 (b) u 1 optimal u 2 optimal unclear outcome of measuring is important here 29.11.2007 CSE-571- AI-based Mobile Robotics 76

Deep Horizons and Pruning We have now completed a full backup in belief space. This process can be applied recursively. The value functions for T=10 and T=20 are 29.11.2007 CSE-571- AI-based Mobile Robotics 77

Deep Horizons and Pruning 29.11.2007 CSE-571- AI-based Mobile Robotics 78

29.11.2007 CSE-571- AI-based Mobile Robotics 79

Why Pruning is Essential Each update introduces additional linear components to V. Each measurement squares the number of linear components. Thus, an unpruned value function for T=20 includes more than 10 547,864 linear functions. At T=30 we have 10 561,012,337 linear functions. The pruned value functions at T=20, in comparison, contains only 12 linear components. The combinatorial explosion of linear components in the value function are the major reason why POMDPs are impractical for most applications. 29.11.2007 CSE-571- AI-based Mobile Robotics 80

POMDP Summary POMDPs compute the optimal action in partially observable, stochastic domains. For finite horizon problems, the resulting value functions are piecewise linear and convex. In each iteration the number of linear constraints grows exponentially. POMDPs so far have only been applied successfully to very small state spaces with small numbers of possible observations and actions. 29.11.2007 CSE-571- AI-based Mobile Robotics 81

POMDP Approximations Point-based value iteration QMDPs AMDPs 29.11.2007 CSE-571- AI-based Mobile Robotics 82

Point-based Value Iteration Maintains a set of example beliefs Only considers constraints that maximize value function for at least one of the examples 29.11.2007 CSE-571- AI-based Mobile Robotics 83

Point-based Value Iteration Value functions for T=30 Exact value function 29.11.2007 PBVI CSE-571- AI-based Mobile Robotics 84

Example Application 29.11.2007 CSE-571- AI-based Mobile Robotics 85

Example Application 29.11.2007 CSE-571- AI-based Mobile Robotics 86

QMDPs QMDPs only consider state uncertainty in the first step After that, the world becomes fully observable. 29.11.2007 CSE-571- AI-based Mobile Robotics 87

29.11.2007 CSE-571- AI-based Mobile Robotics 88 = = N j i j j i i x u x p x V u x r u x Q 1 ), ( ) ( ), ( ), ( = N j i i u u x Q p 1 ), ( arg max

Augmented MDPs Augmentation adds uncertainty component to state space, e.g. b arg max b( x) = = x, Hb( x) b( x)log b( x) dx Hb( x) Planning is performed by MDP in augmented state space Transition, observation and payoff models have to be learned 29.11.2007 CSE-571- AI-based Mobile Robotics 89