Markov Decision Processes. (Slides from Mausam)

Similar documents
Partially Observable Markov Decision Processes. Mausam (slides by Dieter Fox)

Planning and Control: Markov Decision Processes

Reinforcement Learning: A brief introduction. Mihaela van der Schaar

Reinforcement Learning (INF11010) Lecture 6: Dynamic Programming for Reinforcement Learning (extended)

Markov Decision Processes and Reinforcement Learning

A Brief Introduction to Reinforcement Learning

15-780: MarkovDecisionProcesses

Value Iteration. Reinforcement Learning: Introduction to Machine Learning. Matt Gormley Lecture 23 Apr. 10, 2019

10703 Deep Reinforcement Learning and Control

CS 188: Artificial Intelligence Spring Recap Search I

Recap Search I. CS 188: Artificial Intelligence Spring Recap Search II. Example State Space Graph. Search Problems.

Marco Wiering Intelligent Systems Group Utrecht University

CS 687 Jana Kosecka. Reinforcement Learning Continuous State MDP s Value Function approximation

Reinforcement Learning (2)

Gradient Reinforcement Learning of POMDP Policy Graphs

Monte Carlo Tree Search PAH 2015

Partially Observable Markov Decision Processes. Silvia Cruciani João Carvalho

A fast point-based algorithm for POMDPs

Apprenticeship Learning for Reinforcement Learning. with application to RC helicopter flight Ritwik Anand, Nick Haliday, Audrey Huang

Probabilistic Planning with Markov Decision Processes. Andrey Kolobov and Mausam Computer Science and Engineering University of Washington, Seattle

Pascal De Beck-Courcelle. Master in Applied Science. Electrical and Computer Engineering

Markov Decision Processes (MDPs) (cont.)

ICRA 2012 Tutorial on Reinforcement Learning I. Introduction

Artificial Intelligence

Review of course COMP-251B winter 2010

Distributed and Asynchronous Policy Iteration for Bounded Parameter Markov Decision Processes

CSE151 Assignment 2 Markov Decision Processes in the Grid World

Slides credited from Dr. David Silver & Hung-Yi Lee

arxiv: v1 [cs.cv] 2 Sep 2018

Introduction to Reinforcement Learning. J. Zico Kolter Carnegie Mellon University

Exam Topics. Search in Discrete State Spaces. What is intelligence? Adversarial Search. Which Algorithm? 6/1/2012

An Improved Policy Iteratioll Algorithm for Partially Observable MDPs

CSEP 573: Artificial Intelligence

Algorithms for Solving RL: Temporal Difference Learning (TD) Reinforcement Learning Lecture 10

Two-player Games ZUI 2012/2013

Perseus: randomized point-based value iteration for POMDPs

Prioritizing Point-Based POMDP Solvers

Reinforcement Learning and Optimal Control. ASU, CSE 691, Winter 2019

Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound. Lecture 12: Deep Reinforcement Learning

Using Reinforcement Learning to Optimize Storage Decisions Ravi Khadiwala Cleversafe

Autonomous Robotics 6905

Heuristic Search Value Iteration Trey Smith. Presenter: Guillermo Vázquez November 2007

Forward Search Value Iteration For POMDPs

Chapter 3: Search. c D. Poole, A. Mackworth 2010, W. Menzel 2015 Artificial Intelligence, Chapter 3, Page 1

Modular Value Iteration Through Regional Decomposition

Diagnose and Decide: An Optimal Bayesian Approach

Q-learning with linear function approximation

Adversarial Policy Switching with Application to RTS Games

Achieving Goals in Decentralized POMDPs

Introduction to Deep Q-network

Incremental methods for computing bounds in partially observable Markov decision processes

REINFORCEMENT LEARNING: MDP APPLIED TO AUTONOMOUS NAVIGATION

When Network Embedding meets Reinforcement Learning?

Learning and Solving Partially Observable Markov Decision Processes

Two-player Games ZUI 2016/2017

Planning for Markov Decision Processes with Sparse Stochasticity

INTRODUCTION TO HEURISTIC SEARCH

Generalized and Bounded Policy Iteration for Interactive POMDPs

Name: UW CSE 473 Midterm, Fall 2014

Approximate Linear Programming for Average-Cost Dynamic Programming

Probabilistic Planning for Behavior-Based Robots

Solving Concurrent Markov Decision Processes

Hierarchically Optimal Average Reward Reinforcement Learning

Symbolic LAO* Search for Factored Markov Decision Processes

Reinforcement Learning of Traffic Light Controllers under Partial Observability

Artificial Intelligence

Strategies for simulating pedestrian navigation with multiple reinforcement learning agents

The exam is closed book, closed calculator, and closed notes except your one-page crib sheet.

UvA-DARE (Digital Academic Repository) Explorations in efficient reinforcement learning Wiering, M. Link to publication

CS 188: Artificial Intelligence. Recap Search I

Midterm I. Introduction to Artificial Intelligence. CS 188 Fall You have approximately 3 hours.

Monte Carlo Tree Search

Hierarchical Average Reward Reinforcement Learning

Perseus: Randomized Point-based Value Iteration for POMDPs

Concurrent Probabilistic Temporal Planning

Master Thesis. Simulation Based Planning for Partially Observable Markov Decision Processes with Continuous Observation Spaces

Midterm I. Introduction to Artificial Intelligence. CS 188 Fall You have approximately 3 hours.

Point-Based Value Iteration for Partially-Observed Boolean Dynamical Systems with Finite Observation Space

Performance analysis of POMDP for tcp good put improvement in cognitive radio network

Parsimonious Active Monitoring and Overlay Routing

521495A: Artificial Intelligence

Scott Sanner NICTA, Statistical Machine Learning Group, Canberra, Australia

Máté Lengyel, Peter Dayan: Hippocampal contributions to control: the third way

Class Overview. Introduction to Artificial Intelligence COMP 3501 / COMP Lecture 2: Search. Problem Solving Agents

Using Machine Learning Techniques for Autonomous Planning and Navigation with Groups of Unmanned Vehicles

Reinforcement Learning in Discrete and Continuous Domains Applied to Ship Trajectory Generation

Game Theory with Learning for Cybersecurity Monitoring

Announcements. Homework 4. Project 3. Due tonight at 11:59pm. Due 3/8 at 4:00pm

OR 674 DYNAMIC PROGRAMMING Rajesh Ganesan, Associate Professor Systems Engineering and Operations Research George Mason University

Space-Progressive Value Iteration: An Anytime Algorithm for a Class of POMDPs

Hierarchical Average Reward Reinforcement Learning

servable Incremental methods for computing Markov decision

Artificial Intelligence

Incremental Policy Generation for Finite-Horizon DEC-POMDPs

Formal models and algorithms for decentralized decision making under uncertainty

Planning and Acting in Uncertain Environments using Probabilistic Inference

CS154, Lecture 18: 1

Scalable Reward Learning from Demonstration

Function Approximation. Pieter Abbeel UC Berkeley EECS

Goal-HSVI: Heuristic Search Value Iteration for Goal-POMDPs

Transcription:

Markov Decision Processes (Slides from Mausam)

Machine Learning Operations Research Graph Theory Control Theory Markov Decision Process Economics Robotics Artificial Intelligence Neuroscience /Psychology model the sequential decision making of a rational agent.

A Statistician s view to MDPs Markov Chain s s One-step s u Decision Theory a sequential process one-step process models state transitions models choice autonomous process maximizes utility Markov Decision Process Markov chain + choice Decision theory + sequentiality s a u sequential process models state transitions models choice maximizes utility s

A Planning View Static vs. Dynamic Predictable vs. Unpredictable Fully vs. Partially Observable Environment What action next? Deterministic vs. Stochastic Perfect vs. Noisy Instantaneous vs. Durative Percepts Actions

Classical Planning Static Predictable Environment Fully Observable Perfect What action next? Deterministic Instantaneous Percepts Actions

Stochastic Planning: MDPs Static Unpredictable Environment Fully Observable Perfect What action next? Stochastic Instantaneous Percepts Actions

Markov Decision Process (MDP) S: A set of states A: A set of actions Pr(s s,a): transition model C(s,a,s ): cost model G: set of goals s 0 : start state : discount factor R(s,a,s ): reward model factored Factored MDP absorbing/ non-absorbing

Objective of an MDP Find a policy : S A which optimizes minimizes maximizes maximizes discounted or undiscount. expected cost to reach a goal expected reward expected (reward-cost) given a horizon finite infinite indefinite assuming full observability

Role of Discount Factor ( ) Keep the total reward/total cost finite useful for infinite horizon problems Intuition (economics): Money today is worth more than money tomorrow. Total reward: r 1 + r 2 + 2 r 3 + Total cost: c 1 + c 2 + 2 c 3 +

Examples of MDPs Goal-directed, Indefinite Horizon, Cost Minimization MDP <S, A, Pr, C, G, s 0 > Most often studied in planning, graph theory communities Infinite Horizon, Discounted Reward Maximization MDP <S, A, Pr, R, > Most often studied in machine learning, economics, operations research communities Goal-directed, Finite Horizon, Prob. Maximization MDP <S, A, Pr, G, s 0, T> Also studied in planning community most popular Oversubscription Planning: Non absorbing goals, Reward Max. MDP <S, A, Pr, G, R, s 0 > Relatively recent model

Bellman Equations for MDP 2 <S, A, Pr, R, s 0, > Define V*(s) {optimal value} as the maximum expected discounted reward from this state. V* should satisfy the following equation:

Bellman Backup (MDP 2 ) Given an estimate of V* function (say V n ) Backup V n function at state s calculate a new estimate (V n+1 ) : R V V ax Q n+1 (s,a) : value/cost of the strategy: execute action a in s, execute n subsequently n = argmax a Ap(s) Q n (s,a)

Bellman Backup V 1 = 6.5 ( ~1) s 0 a greedy = a 3 5 max a 1 a 2 s 1 V 0 = 0 Q 1 (s,a 1 ) = 2 + 0 Q 1 (s,a 2 ) = 5 + 0.9 1 + 0.1 2 Q 1 (s,a 3 ) = 4.5 + 2 s 2 V 0 = 1 a 3 s 3 V 0 = 2

Value iteration [Bellman 57] assign an arbitrary assignment of V 0 to each state. repeat for all states s compute V n+1 (s) by Bellman backup at s. until max s V n+1 (s) V n (s) < Iteration n+1 Residual(s) -convergence

Comments Decision-theoretic Algorithm Dynamic Programming Fixed Point Computation Probabilistic version of Bellman-Ford Algorithm for shortest path computation MDP 1 : Stochastic Shortest Path Problem Time Complexity one iteration: O( S 2 A ) number of iterations: poly( S, A, 1/(1- )) Space Complexity: O( S ) Factored MDPs exponential space, exponential time

Convergence Properties V n V* in the limit as n 1 -convergence: V n function is within of V* Optimality: current policy is within 2 /(1- ) of optimal Monotonicity V 0 p V * V n p V* (V n monotonic from below) V 0 p V * V n p V* (V n monotonic from above) otherwise V n non-monotonic

Policy Computation Optimal policy ax is stationary and time-independent. for infinite/indefinite horizon problems ax R V Policy Evaluation V R V A system of linear equations in S variables.

Changing the Search Space Value Iteration Search in value space Compute the resulting policy Policy Iteration Search in policy space Compute the resulting value

Policy iteration [Howard 60] assign an arbitrary assignment of 0 to each state. repeat costly: O(n 3 ) Policy Evaluation: compute V n+1 : the evaluation of n Policy Improvement: for all states s compute n+1 (s): argmax a2 Ap(s) Q n+1 (s,a) until n+1 = n Advantage searching in a finite (policy) space as opposed to uncountably infinite (value) space convergence faster. all other properties follow! Modified Policy Iteration approximate by value iteration using fixed policy

Modified Policy iteration assign an arbitrary assignment of 0 to each state. repeat Policy Evaluation: compute V n+1 the approx. evaluation of n Policy Improvement: for all states s compute n+1 (s): argmax a2 Ap(s) Q n+1 (s,a) until n+1 = n Advantage probably the most competitive synchronous dynamic programming algorithm.

Asynchronous Value Iteration States may be backed up in any order instead of an iteration by iteration As long as all states backed up infinitely often Asynchronous Value Iteration converges to optimal

Asynch VI: Prioritized Sweeping Why backup a state if values of successors same? Prefer backing a state whose successors had most change Priority Queue of (state, expected change in value) Backup in the order of priority After backing a state update priority queue for all predecessors

Reinforcement Learning

Reinforcement Learning Still have an MDP Still looking for policy New twist: don t know Pr and/or R i.e. don t know which states are good and what actions do Must actually try out actions to learn

Model based methods Visit different states, perform different actions Estimate Pr and R Once model built, do planning using V.I. or other methods Con: require _huge_ amounts of data

Model free methods Directly learn Q*(s,a) values sample = R(s,a,s ) + max a Q n (s,a ) Nudge the old estimate towards the new sample Q n+1 (s,a) (1- )Q n (s,a) + [sample]

Properties Converges to optimal if If you explore enough If you make learning rate ( ) small enough But not decrease it too quickly i (s,a,i) = i 2 (s,a,i) < where i is the number of visits to (s,a)

Model based vs. Model Free RL Model based estimate O( S 2 A ) parameters requires relatively larger data for learning can make use of background knowledge easily Model free estimate O( S A ) parameters requires relatively less data for learning

Exploration vs. Exploitation Exploration: choose actions that visit new states in order to obtain more data for better learning. Exploitation: choose actions that maximize the reward given current learnt model. -greedy Each time step flip a coin With prob, take an action randomly With prob 1- take the current greedy action Lower over time increase exploitation as more learning has happened

Q-learning Problems Too many states to visit during learning Q(s,a) is still a BIG table We want to generalize from small set of training examples Techniques Value function approximators Policy approximators Hierarchical Reinforcement Learning

Partially Observable Markov Decision Processes

Partially Observable MDPs Static Unpredictable Environment Partially Observable Noisy What action next? Stochastic Instantaneous Percepts Actions

POMDPs In POMDPs we apply the very same idea as in MDPs. Since the state is not observable, the agent has to make its decisions based on the belief state which is a posterior distribution over states. Let b be the belief of the agent about the current state POMDPs compute a value function over belief space: a a γ b, a

POMDPs Each belief is a probability distribution, value fn is a function of an entire probability distribution. Problematic, since probability distributions are continuous. Also, we have to deal with huge complexity of belief spaces. For finite worlds with finite state, action, and observation spaces and finite horizons, we can represent the value functions by piecewise linear functions.

Applications Robotic control helicopter maneuvering, autonomous vehicles Mars rover - path planning, oversubscription planning elevator planning Game playing - backgammon, tetris, checkers Neuroscience Computational Finance, Sequential Auctions Assisting elderly in simple tasks Spoken dialog management Communication Networks switching, routing, flow control War planning, evacuation planning