Partially Observable Markov Decision Processes. Mausam (slides by Dieter Fox)

Similar documents
Markov Decision Processes. (Slides from Mausam)

Planning and Control: Markov Decision Processes

Partially Observable Markov Decision Processes. Silvia Cruciani João Carvalho

Probabilistic Robotics

Incremental Policy Generation for Finite-Horizon DEC-POMDPs

15-780: MarkovDecisionProcesses

Generalized and Bounded Policy Iteration for Interactive POMDPs

Markov Decision Processes and Reinforcement Learning

Dynamic Programming Approximations for Partially Observable Stochastic Games

Two-player Games ZUI 2016/2017

Incremental methods for computing bounds in partially observable Markov decision processes

A fast point-based algorithm for POMDPs

Monte Carlo Tree Search PAH 2015

Marco Wiering Intelligent Systems Group Utrecht University

Artificial Intelligence

Perseus: randomized point-based value iteration for POMDPs

10703 Deep Reinforcement Learning and Control

Name: UW CSE 473 Midterm, Fall 2014

CS 687 Jana Kosecka. Reinforcement Learning Continuous State MDP s Value Function approximation

Space-Progressive Value Iteration: An Anytime Algorithm for a Class of POMDPs

Apprenticeship Learning for Reinforcement Learning. with application to RC helicopter flight Ritwik Anand, Nick Haliday, Audrey Huang

REINFORCEMENT LEARNING: MDP APPLIED TO AUTONOMOUS NAVIGATION

ICRA 2012 Tutorial on Reinforcement Learning I. Introduction

Reinforcement Learning of Traffic Light Controllers under Partial Observability

An Improved Policy Iteratioll Algorithm for Partially Observable MDPs

A Brief Introduction to Reinforcement Learning

PUMA: Planning under Uncertainty with Macro-Actions

An Approach to State Aggregation for POMDPs

Monte Carlo Tree Search

Heuristic Policy Iteration for Infinite-Horizon Decentralized POMDPs

Bounded Dynamic Programming for Decentralized POMDPs

Autonomous Robotics 6905

CSE151 Assignment 2 Markov Decision Processes in the Grid World

Predictive Autonomous Robot Navigation

Solving Factored POMDPs with Linear Value Functions

Reinforcement Learning and Optimal Control. ASU, CSE 691, Winter 2019

Diagnose and Decide: An Optimal Bayesian Approach

Planning for Markov Decision Processes with Sparse Stochasticity

Robot Localization by Stochastic Vision Based Device

Two-player Games ZUI 2012/2013

Monte Carlo Localization using Dynamically Expanding Occupancy Grids. Karan M. Gupta

Multiple Agents. Why can t we all just get along? (Rodney King) CS 3793/5233 Artificial Intelligence Multiple Agents 1

Applying Metric-Trees to Belief-Point POMDPs

Probabilistic Planning for Behavior-Based Robots

Learning and Solving Partially Observable Markov Decision Processes

A Series of Lectures on Approximate Dynamic Programming

Sample-Based Policy Iteration for Constrained DEC-POMDPs

Policy Graph Pruning and Optimization in Monte Carlo Value Iteration for Continuous-State POMDPs

Hidden Markov Model Multiarm Bandits: A Methodology for Beam Scheduling in Multitarget Tracking

Robot Mapping. A Short Introduction to the Bayes Filter and Related Models. Gian Diego Tipaldi, Wolfram Burgard

Prioritizing Point-Based POMDP Solvers

Master Thesis. Simulation Based Planning for Partially Observable Markov Decision Processes with Continuous Observation Spaces

This chapter explains two techniques which are frequently used throughout

Q-learning with linear function approximation

Finding optimal configurations Adversarial search

Flexible Manufacturing Line with Multiple Robotic Cells

Introduction to Fall 2014 Artificial Intelligence Midterm Solutions

Decision Making under Uncertainty

Forward Search Value Iteration For POMDPs

CS 188: Artificial Intelligence Spring Recap Search I

Performance analysis of POMDP for tcp good put improvement in cognitive radio network

Recap Search I. CS 188: Artificial Intelligence Spring Recap Search II. Example State Space Graph. Search Problems.

Evaluating Effects of Two Alternative Filters for the Incremental Pruning Algorithm on Quality of POMDP Exact Solutions

Adversarial Policy Switching with Application to RTS Games

Point-based value iteration: An anytime algorithm for POMDPs

Probabilistic Robotics

arxiv: v1 [cs.cv] 2 Sep 2018

Markov Based Localization Device for a Mobile Robot

Formal models and algorithms for decentralized decision making under uncertainty

Introduction to Artificial Intelligence Midterm 1. CS 188 Spring You have approximately 2 hours.

Achieving Goals in Decentralized POMDPs

10-701/15-781, Fall 2006, Final

Quadruped Robots and Legged Locomotion

Perseus: Randomized Point-based Value Iteration for POMDPs

3 SOLVING PROBLEMS BY SEARCHING

servable Incremental methods for computing Markov decision

Scene Grammars, Factor Graphs, and Belief Propagation

Intelligent Agents. AI Slides (4e) c Lin

Introduction to AI Spring 2006 Dan Klein Midterm Solutions

Intelligent agents. As seen from Russell & Norvig perspective. Slides from Russell & Norvig book, revised by Andrea Roli

Optimal Limited Contingency Planning

Planning. CPS 570 Ron Parr. Some Actual Planning Applications. Used to fulfill mission objectives in Nasa s Deep Space One (Remote Agent)

Scene Grammars, Factor Graphs, and Belief Propagation

Final Exam. Introduction to Artificial Intelligence. CS 188 Spring 2010 INSTRUCTIONS. You have 3 hours.

Decision-Making in a Partially Observable Environment Using Monte Carlo Tree Search Methods

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 59, NO. 10, OCTOBER

Localization, Mapping and Exploration with Multiple Robots. Dr. Daisy Tang

Computer Vision 2 Lecture 8

Solving Large-Scale and Sparse-Reward DEC-POMDPs with Correlation-MDPs

Localization and Map Building

ARTIFICIAL INTELLIGENCE (CSC9YE ) LECTURES 2 AND 3: PROBLEM SOLVING

Planning with Continuous Actions in Partially Observable Environments

Algorithms for Solving RL: Temporal Difference Learning (TD) Reinforcement Learning Lecture 10

Exam Topics. Search in Discrete State Spaces. What is intelligence? Adversarial Search. Which Algorithm? 6/1/2012

Active Sensing as Bayes-Optimal Sequential Decision-Making

A Bayesian Nonparametric Approach to Modeling Motion Patterns

What is Learning? CS 343: Artificial Intelligence Machine Learning. Raymond J. Mooney. Problem Solving / Planning / Control.

Point-Based Backup for Decentralized POMDPs: Complexity and New Algorithms

Approximate Solutions For Partially Observable Stochastic Games with Common Payoffs

Ensemble Learning: An Introduction. Adapted from Slides by Tan, Steinbach, Kumar

Transcription:

Partially Observable Markov Decision Processes Mausam (slides by Dieter Fox)

Stochastic Planning: MDPs Static Environment Fully Observable Perfect What action next? Stochastic Instantaneous Percepts Actions 2

Partially Observable MDPs Static Environment Partially Observable Noisy What action next? Stochastic Instantaneous Percepts Actions 3

Stochastic, Fully Observable 4

Stochastic, Partially Observable 5

POMDPs In POMDPs we apply the very same idea as in MDPs. Since the state is not observable, the agent has to make its decisions based on the belief state which is a posterior distribution over states. Let b be the belief of the agent about the current state POMDPs compute a value function over belief space: a a γ b, a 6

POMDPs Each belief is a probability distribution, value fn is a function of an entire probability distribution. Problematic, since probability distributions are continuous. Also, we have to deal with huge complexity of belief spaces. For finite worlds with finite state, action, and observation spaces and finite horizons, we can represent the value functions by piecewise linear functions. 7

Applications Robotic control helicopter maneuvering, autonomous vehicles Mars rover - path planning, oversubscription planning elevator planning Game playing - backgammon, tetris, checkers Neuroscience Computational Finance, Sequential Auctions Assisting elderly in simple tasks Spoken dialog management Communication Networks switching, routing, flow control War planning, evacuation planning 8

An Illustrative Example measurements state x action u 3 state x 2 measurements z z 2 0.7 0.3 0.2 u x 3 u 2 0.8 0.8 0.2 u 3 u x 2 u 2 00 00 00 50 u 0.3 0.7 z z 2 actions u, u 2 payoff payoff 9

The Parameters of the Example The actions u and u 2 are terminal actions. The action u 3 is a sensing action that potentially leads to a state transition. The horizon is finite and =. 0

Payoff in POMDPs In MDPs, the payoff (or return) depended on the state of the system. In POMDPs, however, the true state is not exactly known. Therefore, we compute the expected payoff by integrating over all states:

Payoffs in Our Example () If we are totally certain that we are in state x and execute action u, we receive a reward of -00 If, on the other hand, we definitely know that we are in x 2 and execute u, the reward is +00. In between it is the linear combination of the extreme values weighted by the probabilities 2

Payoffs in Our Example (2) 3

The Resulting Policy for T= Given we have a finite POMDP with T=, we would use V (b) to determine the optimal policy. In our example, the optimal policy for T= is This is the upper thick graph in the diagram. 4

Piecewise Linearity, Convexity The resulting value function V (b) is the maximum of the three functions at each point It is piecewise linear and convex. 5

Pruning If we carefully consider V (b), we see that only the first two components contribute. The third component can therefore safely be pruned away from V (b). 6

Increasing the Time Horizon Assume the robot can make an observation before deciding on an action. V (b) 7

Increasing the Time Horizon Assume the robot can make an observation before deciding on an action. Suppose the robot perceives z for which p(z x )=0.7 and p(z x 2 )=0.3. Given the observation z we update the belief using Bayes rule. p' p' 2 p( z 0.7 p p( z ) 0.3( p p( z ) ) 0.7 p ) 0.3( p ) 0.4 p 0.3 8

Value Function V (b) b (b z ) V (b z ) 9

Increasing the Time Horizon Assume the robot can make an observation before deciding on an action. Suppose the robot perceives z for which p(z x )=0.7 and p(z x 2 )=0.3. Given the observation z we update the belief using Bayes rule. Thus V (b z ) is given by 20

Expected Value after Measuring Since we do not know in advance what the next measurement will be, we have to compute the expected belief 2 2 2 ) ( ) ( ) ( ) ( ) ( ) ( )] ( [ ) ( i i i i i i i i i z p x z p V z p p x z p V z p z b V z p z b V E b V 2

Expected Value after Measuring Since we do not know in advance what the next measurement will be, we have to compute the expected belief 22

Resulting Value Function The four possible combinations yield the following function which then can be simplified and pruned. 23

Value Function p(z ) V (b z ) b (b z ) \bar{v} (b) p(z 2 ) V 2 (b z 2 ) 24

State Transitions (Prediction) When the agent selects u 3 its state potentially changes. When computing the value function, we have to take these potential state changes into account. 25

State Transitions (Prediction) When the agent selects u 3 its state potentially changes. When computing the value function, we have to take these potential state changes into account. 26

Resulting Value Function after executing u 3 Taking the state transitions into account, we finally obtain. 27

Value Function after executing u 3 \bar{v} (b) \bar{v} (b u 3 ) 28

Value Function for T=2 Taking into account that the agent can either directly perform u or u 2 or first u 3 and then u or u 2, we obtain (after pruning) 29

Graphical Representation of V 2 (b) u optimal u 2 optimal unclear outcome of measuring is important here 30

Deep Horizons and Pruning We have now completed a full backup in belief space. This process can be applied recursively. The value functions for T=0 and T=20 are 3

Deep Horizons and Pruning 32

33

Why Pruning is Essential Each update introduces additional linear components to V. Each measurement squares the number of linear components. Thus, an unpruned value function for T=20 includes more than 0 547,864 linear functions. At T=30 we have 0 56,02,337 linear functions. The pruned value functions at T=20, in comparison, contains only 2 linear components. The combinatorial explosion of linear components in the value function are the major reason why POMDPs are impractical for most applications. 34

POMDP Summary POMDPs compute the optimal action in partially observable, stochastic domains. For finite horizon problems, the resulting value functions are piecewise linear and convex. In each iteration the number of linear constraints grows exponentially. POMDPs so far have only been applied successfully to very small state spaces with small numbers of possible observations and actions. 35