Heuristic Search Value Iteration Trey Smith. Presenter: Guillermo Vázquez November 2007

Size: px

Start display at page:

Download "Heuristic Search Value Iteration Trey Smith. Presenter: Guillermo Vázquez November 2007"

Neal Ball
6 years ago
Views:

1 Heuristic Search Value Iteration Trey Smith Presenter: Guillermo Vázquez November 2007

2 What is HSVI? Heuristic Search Value Iteration is an algorithm that approximates POMDP solutions. HSVI stores an upper and a lower bound on the optimal value function V*. It selects belief points to update the upper and lower bounds, making the bounds closer to V*. The belief points to be updated are selected by heuristic techniques used to explore the POMDP's search graph.

3 HSVI's basic idea V U (b) is the upper bound V * (b) is the exact optimal value function V L (b) is the lower bound 0 b 1 b 2

4 HSVI's basic idea Locally updating at b V U (b) V * (b) V L (b) 0 b 1 b 2 0 b b 1 b 2

5 Why is HSVI a point-based algorithm? The main problem with exact value iteration algorithms is that it generates an exponential number of vectors in each iteration Say we have V' vectors that represent a value function at horizon t, in (the worst case) the next value function V will have V = A V' ^ O vectors where A is a set of actions and O is a set of observations Every iteration (update) results in an exponential growth in the vectors representing V.

6 Why is HSVI a point-based algorithm? (cont.) Exact value iteration algorithms plan for all beliefs in the belief simplex. But some beliefs are much less likely to be reached than others, and so it seems unnecessary to plan equally for all beliefs. Point-based value iteration algorithms focus on the most probable beliefs.

7 HSVI - Notation Lower bound denoted by Upper bound denoted by Define interval function V L V U Define the width (i.e. difference) of the interval function at b to be The width at b is the uncertainty at b V b =[V L b,v U b ] width V b =V U b V L b

8 HSVI Algorithm Outline Initialize bounds V L,V U While width V b 0 ε explore(b,ε,0) Return policy π function explore(b,ε,t){ if width V b εγ t return select an action a * and observation o * according to some search heuristics call explore( τ(b,a *,o * ),ε,t+1) perform a point-based update of } V at b

9 HSVI - Bounds The lower bound V L is represented by the usual set Γ of alpha vectors Updating the lower bound V L means adding a vector to the set Γ The upper bound V L is represented by a finite set Υ of belief/value points (b i,υ i ) Updating the upper bound V U means adding a new point to the set Υ

10 HSVI Lower Bound V L initialization The lower bound V L is initialized using the blind policy method suggested in [Hauskrecht, 1997] Compute all value functions for all one-action policies. A one action policy is to always select a particular action a. Such a method then gives a lower bound with A vectors V blind :=max {α 0, α 1,..., α A } The idea is that the least worst you can do is to always choose the safest (i.e maximum expected value) action.

11 Why use the Blind Policy Method? All POMDPs have blind policies The value function of a blind policy is easy to compute and linear, so the blind policy method generates a PWLC representation The class contains only A policies, so it is easy to evaluate them all. O( A S 3 )

12 V L of the Tiger Problem Using a discount factor γ=.95

13 HSVI Upper Bound V U initialization The upper bound V U is initialized using Fast Informed Bound (FBI) approximation [Hauskrecht, 2000] Solve underlying MDP problem, denoted by V MDP Use that vector to initialize each α a A Hauskrecht solves the upper bound V FIB based on the observation that it is equal to the optimal value function of a certain MDP with A O S states. This MDP can be constructed and solved A O S. HSVI uses a simple iterative approach to approximate V FIB This approximation keeps one vector α a for each action a α t 1 s =R s,a γ o max a ' s ' Pr s', o s,a α t a s

14 HSVI Upper Bound V U initialization Such a method then gives an upper bound with A vectors V FIB ={α 0, α 1,...,α A } When FIB iteration is stopped, each corner point corresponding to a state s is initialized to the maximum value The basic concept of this approach is to be optimistic about a solution (i.e. we can do better with more information) Simply solving the underlying MDP is too optimistic and produces a weak upper bound. FIB tries to give a tighter upper bound by not being too optimistic and taking into account some uncertainty.

15 V U of the Tiger Problem Only the endpoints of the upper bound are added to Υ

16 HSVI Algorithm Outline Initialize bounds While explore(b,ε,t) Return policy π V L,V U width V b 0 ε function explore(b,ε,t){ if width V b εγ t return select an action a * and observation o * according to the search heuristics call explore( τ(b,a *,o * ),ε,t) perform a point-based update of V at b

17 HSVI While loop While the width (i.e distance) of V U and V L at the given belief point b 0 is greater than a specified regret (precision) ε Repeatedly explore the search graph A trial starts at b 0 and explores forward. At each forward step, the current state is updated and a successor state is chosen via heuristics for picking an action a * and observation o *.

18 HSVI Search graph for Tiger Problem

19 HSVI Algorithm Outline Initialize bounds While explore(b,ε,t) Return policy π V L,V U width V b 0 ε function explore(b,ε,t) if width V b εγ t return select an action a * and observation o * according to the search heuristics call explore( τ(b,a *,o * ),ε,t) perform a point-based update of V at b

20 HSVI What does explore() do? The explore function selects action a * and observation o * to decide with child of current node b to visit next, the child node is τ(b,a *,o * ) (i.e the resulting belief state after doing action a * and seeing observation o * in state b) We formally define the regret of a policy π at belief b to be regret π,b =V π * b V π b That is, the regret is the difference between the optimal value at point b a and value at point b of policy π. Because we want to return a policy π with small regret, HSVI prioritizes the state updates that will most reduce the regret at b 0 (i.e reduce the uncertainty at b 0, denoted as the ) width V b 0

21 HSVI How to select action a*? Define the interval function Q b, a =[Q V U b, a Q V L b, a ] We greedily select action a * such that a * =argmax a Q V U b, a We greedily choose a *, the idea is that actions that currently seem to perform well are more likely to be part of an optimal policy Thus selectiong such actions will lead HSVI to update states whose values are relevant to good policies. This is sometimes called the IE-MAX heuristic [Kaelbling, 1995] Q

22 HSVI How to select action o*? HSVI uses the weighted excess uncertainty heuristic Excess uncertainty at belief b with depth t in the search tree is defined to be excess b,t =width V b εγ t excess uncertainty has the property that if all the children of a node b have negative excess uncertainty, then after an update b will also have negative excess uncertainty. Negative excess uncertainty at the root implies the desired convergence to ε This heuristic is designed to focus attention on the child node with the greatest contribution to excess uncertainty at the parent o * =argmax [ Pr o b, a * excess τ b, a *, o,t 1 ] o

23 HSVI Example run on the Tiger Problem

24 HSVI Convergence to V * It can be proved that if the upper bound V U and the lower bound V L are uniformly improvable (as the bounds presented here are) they converge to the true value function V * V 0 L b V 1 L b... V L b V * V U b... V 1 U b V 0 U b

25 HSVI Example run on the Tiger Problem

26 HSVI Resulting policy graph The five alpha vectors of the previous graph result in this policy graph Note that this policy is for the starting belief =[0.5,0.5]

27 HSVI Resulting policy graph We see that the policy graph generated by the alpha vectors give by HSVI for the tiger problem for the starting belief =[0.5,0.5] with a discount factor of.95 is a subset of the policy graph computed by exact methods

28 HSVI Some Results

29 References [Hauskrecht, 1997] Hauskrecht, M. (1997). Incremental methods for computing bounds in partially observable Markov decision processes. In Proc. of AAAI, pages [Hauskrecht, 2000] Hauskrecht, M. (2000). Value-function approximations for partially observable Markov decision processes,journal of Artificial Intelligence Research, 13: [Pineau et al., 2003] Pineau, J., Gordon, G., and Thrun, S. Pointbased value iteration: An anytime algorithm for POMDPs. In Proc. of IJCAI.b [Smith, 2007] Smith, T. Probabilistic Planning for Robot Exploration. PhD thesis, Carnegie Mellon University. [Smith and Simmons, 2004] Smith, T. and Simmons, R. Heuristic search value iteration for POMDPs. In Proc. of UAI. [Smith and Simmons, 2005] Smith, T. and Simmons, R. Pointbased POMDP algorithms: Improved analysis and implementation.. In Proc. of UAI.

A fast point-based algorithm for POMDPs

A fast point-based algorithm for POMDPs Nikos lassis Matthijs T. J. Spaan Informatics Institute, Faculty of Science, University of Amsterdam Kruislaan 43, 198 SJ Amsterdam, The Netherlands {vlassis,mtjspaan}@science.uva.nl