Planning and Control: Markov Decision Processes

CSE-571 AI-based Mobile Robotics Planning and Control: Markov Decision Processes

Planning Static vs. Dynamic Predictable vs. Unpredictable Fully vs. Partially Observable Perfect vs. Noisy Environment What action next? Discrete vs. Continuous Outcomes Deterministic vs. Stochastic Percepts Actions Full vs. Partial satisfaction

Full Classical Planning Static Predictable Environment Fully Observable What action next? Discrete Deterministic Perfect Percepts Actions

Full Stochastic Planning Static Unpredictable Environment Fully Observable Perfect What action next? Discrete Stochastic Percepts Actions

Deterministic, fully observable

Stochastic, Fully Observable

Stochastic, Partially Observable

Markov Decision Process (MDP) S: A set of states A: A set of actions Pr(s s,a): transition model C(s,a,s ): cost model G: set of goals s 0 : start state : discount factor R(s,a,s ): reward model

Role of Discount Factor ( ) Keep the total reward/total cost finite useful for infinite horizon problems sometimes indefinite horizon: if there are deadends Intuition (economics): Money today is worth more than money tomorrow. Total reward: r 1 + r 2 + 2 r 3 + Total cost: c 1 + c 2 + 2 c 3 +

Objective of a Fully Observable MDP Find a policy : S A which optimises minimises maximises maximises discounted or undiscount. expected cost to reach a goal expected reward expected (reward-cost) given a horizon finite infinite indefinite assuming full observability

Examples of MDPs Goal-directed, Indefinite Horizon, Cost Minimisation MDP <S, A, Pr, C, G, s 0 > Infinite Horizon, Discounted Reward Maximisation MDP <S, A, Pr, R, > Reward = t t r t Goal-directed, Finite Horizon, Prob. Maximisation MDP <S, A, Pr, G, s 0, T>

Bellman Equations for MDP 1 <S, A, Pr, C, G, s 0 > Define J*(s) {optimal cost} as the minimum expected cost to reach a goal from this state. J* should satisfy the following equation: Q*(s,a)

Bellman Equations for MDP 2 <S, A, Pr, R, s 0, > Define V*(s) {optimal value} as the maximum expected discounted reward from this state. V* should satisfy the following equation:

Bellman Backup Given an estimate of V* function (say V n ) Backup V n function at state s calculate a new estimate (V n+1 ) : Q n+1 (s,a) : value/cost of the strategy: execute action a in s, execute n subsequently n = argmax a Ap(s) Q n (s,a) (greedy action)

Bellman Backup a greedy = a 1 max a 1 s 1 V 0 = 20 Q 1 (s,a 1 ) = 20 + 5 Q 1 (s,a 2 ) = 20 + 0.9 2 + 0.1 3 Q 1 (s,a 3 ) = 4 + 3 V 1 = 25 s 0 20 a? 2 s 2 V 0 = 2 a 3 s 3 V 0 = 3

Value iteration [Bellman 57] assign an arbitrary assignment of V 0 to each non-goal state. repeat for all states s compute V n+1 (s) by Bellman backup at s. until max s V n+1 (s) V n (s) < Iteration n+1 Residual(s) -convergence

Complexity of value iteration One iteration takes O( A S 2 ) time. Number of iterations required poly( S, A,1/(1-γ)) Overall: the algorithm is polynomial in state space thus exponential in number of state variables.

Policy Computation Optimal policy is stationary and time-independent. for infinite/indefinite horizon problems Policy Evaluation A system of linear equations in S variables.

Markov Decision Process (MDP) r=1 0.7 s 2 0.1 0.9 0.01 r=0 s 1 0.3 0.3 0.3 s 3 r=20 0.99 0.4 r=0 0.2 s 4 s 5 0.8 r=-10

Value Function and Policy Value residual and policy residual

Changing the Search Space Value Iteration Search in value space Compute the resulting policy Policy Iteration [Howard 60] Search in policy space Compute the resulting value

Policy iteration [Howard 60] assign an arbitrary assignment of 0 to each state. repeat compute V n+1 : the evaluation of n for all states s compute n+1 (s): argmax a2 Ap(s) Q n+1 (s,a) until n+1 = n Advantage Modified Policy Iteration costly: O(n 3 ) approximate by value iteration using fixed policy searching in a finite (policy) space as opposed to uncountably infinite (value) space convergence faster. all other properties follow!

LP Formulation minimise s2s V*(s) under constraints: for every s, a V*(s) R(s) + s 2S Pr(s a,s)v*(s ) A big LP. So other tricks used to solve it!

Hybrid MDPs Hybrid Markov decision process: Markov state = (n, x), where n is the discrete component (set of fluents) and. Bellman s equation: V t 1 n ( x) = max Pr( n' n, x, a) a An ( x) n' N x' X Pr( x' x l n, x, a, n') t R ( x' ) V ( x' dx' n ' n' )

Convolutions discrete-discrete constant-discrete [Feng et.al. 04] constant-constant [Li&Littman 05]

probability density function Result of convolutions value function discrete constant linear discrete discrete constant linear constant constant linear quadratic linear linear quadratic cubic

Value Iteration for Motion Planning (assumes knowledge of robot s location)

Frontier-based Exploration Every unknown location is a target point.

Manipulator Control Arm with two joints Configuration space

Manipulator Control Path State space Configuration space

Collision Avoidance via Planning Potential field methods have local minima Perform efficient path planning in the local perceptual space Path costs depend on length and closeness to obstacles [Konolige, Gradient method]

Paths and Costs Path is list of points P={p 1, p 2, p k } p k is only point in goal set Cost of path is separable into intrinsic cost at each point along with adjacency cost of moving from one point to the next F ) ( P) = I( pi ) A( pi, pi 1 i i Adjacency cost typically Euclidean distance Intrinsic cost typically occupancy, distance to obstacle

Navigation Function Assignment of potential field value to every element in configuration space [Latombe, 91]. Goal set is always downhill, no local minima. Navigation function of a point is cost of minimal cost path that starts at that point. N k = min P k F( P k )

Computation of Navigation Function Initialization Points in goal set 0 cost All other points infinite cost Active list goal set Repeat Take point from active list and update neighbors If cost changes, add the point to the active list Until active list is empty

Challenges Where do we get the state space from? Where do we get the model from? What happens when the world is slightly different? Where does reward come from? Continuous state variables Continuous action space

How to solve larger problems? If deterministic problem Use dijkstra s algorithm If no back-edge Use backward Bellman updates Prioritize Bellman updates to maximize information flow If known initial state Use dynamic programming + heuristic search LAO*, RTDP and variants Divide an MDP into sub-mdps are solve the hierarchy Aggregate states with similar values Relational MDPs

Approximations: n-step lookahead n=1 : greedy 1 (s) = argmax a R(s,a) n-step lookahead n (s) = argmax a V n (s)

Approximation: Incremental approaches deterministic relaxation Deterministic planner plan Stochastic simulation Identify weakness Solve/Merge

Approximations: Planning and Replanning deterministic relaxation Deterministic planner plan send the state reached Execute the action

CSE-571 AI-based Mobile Robotics Planning and Control: SA-1 (1) Reinforcement Learning (2) Partially Observable Markov Decision Processes

Reinforcement Learning Still have an MDP Still looking for policy New twist: don t know Pr and/or R i.e. don t know which states are good And what actions do Must actually try actions and states out to learn

Model based methods Visit different states, perform different actions Estimate Pr and R Once model built, do planning using V.I. or other methods Cons: require _huge_ amounts of data

Model free methods TD learning Directly learn Q*(s,a) values sample = R(s,a,s ) + max a Q n (s,a ) Nudge the old estimate towards the new sample Q n+1 (s,a) Ã (1- )Q n (s,a) + [sample]

Properties Converges to optimal if If you explore enough If you make learning rate ( ) small enough But not decrease it too quickly

Exploration vs. Exploitation -greedy Each time step flip a coin With prob, action randomly With prob 1- take the current greedy action Lower over time to increase exploitation as more learning has happened

Q-learning Problems Too many states to visit during learning Q(s,a) is a BIG table We want to generalize from small set of training examples Solutions Value function approximators Policy approximators Hierarchical Reinforcement Learning

Task Hierarchy: MAXQ Decomposition [Dietterich 00] Fetch Root Deliver Children of a task are unordered Take Navigate(loc) Give Extend-arm Grab Release Extend-arm Move s Move w Move e Move n

MAXQ Decomposition Augment the state s by adding the subtask i: [s,i]. Define C([s,i],j) as the reward received in i after j finishes. Q([s,Fetch],Navigate(prr)) = V([s,Navigate(prr)])+C([s,Fetch],Navigate(prr)) Reward received while navigating Reward received after navigation Express V in terms of C Learn C, instead of learning Q

MAXQ Decomposition (contd) State Abstraction Finding irrelevant actions Finding funnel actions

POMDPs: Recall example

Partially Observable Markov Decision Processes

POMDPs In POMDPs we apply the very same idea as in MDPs. Since the state is not observable, the agent has to make its decisions based on the belief state which is a posterior distribution over states. Let b be the belief of the agent about the state under consideration. POMDPs compute a value function over belief space: 29.11.2007 CSE-571- AI-based Mobile Robotics 54

Problems Each belief is a probability distribution, thus, each value in a POMDP is a function of an entire probability distribution. This is problematic, since probability distributions are continuous. Additionally, we have to deal with the huge complexity of belief spaces. For finite worlds with finite state, action, and measurement spaces and finite horizons, however, we can effectively represent the value functions by piecewise linear functions. 29.11.2007 CSE-571- AI-based Mobile Robotics 55

An Illustrative Example measurements state x 1 action u 3 state x 2 measurements z 1 z 2 0.7 0.3 0.2 x u3 1 0.8 0.8 0.2 u 3 x 2 u1 u2 u1 u2 100 100 100 50 0.3 0.7 z 1 z 2 actions u 1, u 2 payoff payoff 29.11.2007 CSE-571- AI-based Mobile Robotics 56

The Parameters of the Example The actions u 1 and u 2 are terminal actions. The action u 3 is a sensing action that potentially leads to a state transition. The horizon is finite and =1. 29.11.2007 CSE-571- AI-based Mobile Robotics 57

Payoff in POMDPs In MDPs, the payoff (or return) depended on the state of the system. In POMDPs, however, the true state is not exactly known. Therefore, we compute the expected payoff by integrating over all states: 29.11.2007 CSE-571- AI-based Mobile Robotics 58

Payoffs in Our Example (1) If we are totally certain that we are in state x 1 and execute action u 1, we receive a reward of -100 If, on the other hand, we definitely know that we are in x 2 and execute u 1, the reward is +100. In between it is the linear combination of the extreme values weighted by the probabilities 29.11.2007 CSE-571- AI-based Mobile Robotics 59

Payoffs in Our Example (2) 29.11.2007 CSE-571- AI-based Mobile Robotics 60

The Resulting Policy for T=1 Given we have a finite POMDP with T=1, we would use V 1 (b) to determine the optimal policy. In our example, the optimal policy for T=1 is This is the upper thick graph in the diagram. 29.11.2007 CSE-571- AI-based Mobile Robotics 61

Piecewise Linearity, Convexity The resulting value function V 1 (b) is the maximum of the three functions at each point It is piecewise linear and convex. 29.11.2007 CSE-571- AI-based Mobile Robotics 62

Pruning If we carefully consider V 1 (b), we see that only the first two components contribute. The third component can therefore safely be pruned away from V 1 (b). 29.11.2007 CSE-571- AI-based Mobile Robotics 63

Increasing the Time Horizon Assume the robot can make an observation before deciding on an action. 29.11.2007 V 1 (b) CSE-571- AI-based Mobile Robotics 64

Increasing the Time Horizon Assume the robot can make an observation before deciding on an action. Suppose the robot perceives z 1 for which p(z 1 x 1 )=0.7 and p(z 1 x 2 )=0.3. Given the observation z 1 we update the belief using Bayes rule. p' p' 1 2 p( z = = 1 0.7 p1 p( z ) 1 0.3(1 p p( z ) 1 ) = 0.7 p 1 1 ) 0.3(1 p 1 ) = 0.4 p 1 0.3 29.11.2007 CSE-571- AI-based Mobile Robotics 65

Value Function V 1 (b) b (b z 1 ) 29.11.2007 V 1 (b z 1 ) CSE-571- AI-based Mobile Robotics 66

Increasing the Time Horizon Assume the robot can make an observation before deciding on an action. Suppose the robot perceives z 1 for which p(z 1 x 1 )=0.7 and p(z 1 x 2 )=0.3. Given the observation z 1 we update the belief using Bayes rule. Thus V 1 (b z 1 ) is given by 29.11.2007 CSE-571- AI-based Mobile Robotics 67

29.11.2007 CSE-571- AI-based Mobile Robotics 68 Expected Value after Measuring Since we do not know in advance what the next measurement will be, we have to compute the expected belief = = = = = = = 2 1 1 1 1 2 1 1 1 1 2 1 1 1 1 ) ( ) ( ) ( ) ( ) ( ) ( )] ( [ ) ( i i i i i i i i i z p x z p V z p p x z p V z p z b V z p z b V E b V

Expected Value after Measuring Since we do not know in advance what the next measurement will be, we have to compute the expected belief 29.11.2007 CSE-571- AI-based Mobile Robotics 69

Resulting Value Function The four possible combinations yield the following function which then can be simplified and pruned. 29.11.2007 CSE-571- AI-based Mobile Robotics 70

Value Function p(z 1 ) V 1 (b z 1 ) b (b z 1 ) 29.11.2007 p(z 2 ) V 2 (b z 2 ) \bar{v} 1 (b) CSE-571- AI-based Mobile Robotics 71

State Transitions (Prediction) When the agent selects u 3 its state potentially changes. When computing the value function, we have to take these potential state changes into account. 29.11.2007 CSE-571- AI-based Mobile Robotics 72

Resulting Value Function after executing u 3 Taking the state transitions into account, we finally obtain. 29.11.2007 CSE-571- AI-based Mobile Robotics 73

Value Function after executing u 3 \bar{v}1(b) 29.11.2007 \bar{v} 1 (b u 3 ) CSE-571- AI-based Mobile Robotics 74

Value Function for T=2 Taking into account that the agent can either directly perform u 1 or u 2 or first u 3 and then u 1 or u 2, we obtain (after pruning) 29.11.2007 CSE-571- AI-based Mobile Robotics 75

Graphical Representation of V 2 (b) u 1 optimal u 2 optimal unclear outcome of measuring is important here 29.11.2007 CSE-571- AI-based Mobile Robotics 76

Deep Horizons and Pruning We have now completed a full backup in belief space. This process can be applied recursively. The value functions for T=10 and T=20 are 29.11.2007 CSE-571- AI-based Mobile Robotics 77

Deep Horizons and Pruning 29.11.2007 CSE-571- AI-based Mobile Robotics 78

29.11.2007 CSE-571- AI-based Mobile Robotics 79

Why Pruning is Essential Each update introduces additional linear components to V. Each measurement squares the number of linear components. Thus, an unpruned value function for T=20 includes more than 10 547,864 linear functions. At T=30 we have 10 561,012,337 linear functions. The pruned value functions at T=20, in comparison, contains only 12 linear components. The combinatorial explosion of linear components in the value function are the major reason why POMDPs are impractical for most applications. 29.11.2007 CSE-571- AI-based Mobile Robotics 80

POMDP Summary POMDPs compute the optimal action in partially observable, stochastic domains. For finite horizon problems, the resulting value functions are piecewise linear and convex. In each iteration the number of linear constraints grows exponentially. POMDPs so far have only been applied successfully to very small state spaces with small numbers of possible observations and actions. 29.11.2007 CSE-571- AI-based Mobile Robotics 81

POMDP Approximations Point-based value iteration QMDPs AMDPs 29.11.2007 CSE-571- AI-based Mobile Robotics 82

Point-based Value Iteration Maintains a set of example beliefs Only considers constraints that maximize value function for at least one of the examples 29.11.2007 CSE-571- AI-based Mobile Robotics 83

Point-based Value Iteration Value functions for T=30 Exact value function 29.11.2007 PBVI CSE-571- AI-based Mobile Robotics 84

Example Application 29.11.2007 CSE-571- AI-based Mobile Robotics 85

Example Application 29.11.2007 CSE-571- AI-based Mobile Robotics 86

QMDPs QMDPs only consider state uncertainty in the first step After that, the world becomes fully observable. 29.11.2007 CSE-571- AI-based Mobile Robotics 87

29.11.2007 CSE-571- AI-based Mobile Robotics 88 = = N j i j j i i x u x p x V u x r u x Q 1 ), ( ) ( ), ( ), ( = N j i i u u x Q p 1 ), ( arg max

Augmented MDPs Augmentation adds uncertainty component to state space, e.g. b arg max b( x) = = x, Hb( x) b( x)log b( x) dx Hb( x) Planning is performed by MDP in augmented state space Transition, observation and payoff models have to be learned 29.11.2007 CSE-571- AI-based Mobile Robotics 89