Discretized Approximations for POMDP with Average Cost

Size: px

Start display at page:

Download "Discretized Approximations for POMDP with Average Cost"

Baldric Underwood
5 years ago
Views:

1 Discretized Approximations for POMDP with Average Cost Hizhen Y Lab for Information and Decisions EECS Dept., MIT Cambridge, MA 0239 Dimitri P. Bertsekas Lab for Information and Decisions EECS Dept., MIT Cambridge, MA 0239 Abstract In this paper, we propose a new lower approximation scheme for POMDP with disconted and average cost criterion. The approximating fnctions are determined by their vales at a finite nmber of belief points, and can be compted efficiently sing vale iteration algorithms for finite-state MDP. While for disconted problems several lower approximation schemes have been proposed earlier, ors seems the first of its kind for average cost problems. We focs primarily on the average cost case, and we show that the corresponding approximation can be compted efficiently sing mlti-chain algorithms for finite-state MDP. We give a preliminary analysis showing that regardless of the existence of the optimal average cost J in the POMDP, the approximation obtained is a lower bond of the liminf optimal average cost fnction, and can also be sed to calclate an pper bond on the limsp optimal average cost fnction, as well as bonds on the cost of execting the stationary policy associated with the approximation. We show the convergence of the cost approximation, when the optimal average cost is constant and the optimal differential cost is continos. INTRODUCTION We consider discrete-time infinite horizon partially observable Markov Decision Processes (POMDP) with the state space S, the observation space Z and the control space U all being finite. Let X be the set of probability distribtions on S, called belief space, and g (s) be the per stage cost fnction. With the average cost criterion, we minimize over the policies π the average expected cost N Eπ { N t=0 g t (s t ) s 0 x}, as N goes to infinity, when the initial state s 0 follows the distribtion x. POMDPs with average cost criterion are sbstantially more difficlt to analyze than with disconted cost. Althogh there are optimality eqations whose soltion provides the optimal average cost fnction and a stationary optimal policy, in general there is no garantee that a soltion exists, and there are no finite comptation algorithms to obtain it. Therefore, discretized approximations are comptationally appealing as approximate soltions for average cost POMDP, since the problem of finite-state MDPs with average cost is well nderstood and can be solved with several commonly sed algorithms. We note that a discretization scheme for disconted POMDP that gives a lower approximation was first proposed by (Lovejoy, 99). It was later improved by (Zho and Hansen, 200). There have been no proposals of discretization schemes for average cost POMDP, to or knowledge. A conceptally different alternative to solve approximately average cost POMDP is the finite memory approach (Aberdeen and Baxter, 2002). In this approach, one seeks a policy that is average cost optimal within a class of finite state controllers. The advantage of the finite memory approach is that a sboptimal policy can be learned in a model-free fashion, i.e., with a simlator rather than an explicit transition probability model of the system. By contrast the discretization approaches of Lovejoy, and Zho and Hansen, as well as ors, reqire an exact mechanism for generating beliefs/conditional state distribtions, as the system is operating. We have recently become aware of the related work by (Ormoneit and Glynn, 2002) on MDP with continos state space and average cost. Or POMDP scheme can be viewed as a special case of their general approximation scheme. However, the lower approximation property is special to POMDP, and the corresponding asymptotic convergence reslts are also different in the two works.

2 The starting point for or discretization methodology is the disconted problem, for which we introdce a new lower approximation scheme, based on a fictitios optimistic controller that receives extra information abot the hidden states. The cost of this controller, a lower bond to the optimal cost, can be calclated sing finite-state MDP methods, and can be sed as an approximate cost-to-go fnction for a one-step lookahead scheme. We extend or approach to the average cost criterion, the discretized problem can be solved by mlti-chain algorithms for finite-state MDP. We show that the corresponding approximate cost is a lower bond to the optimal liminf average cost fnction, and can be sed to obtain an pper bond to the optimal limsp average cost fnction, as well as bonds on the cost of the stationary policy associated with the approximation. We show asymptotic convergence of the cost approximation of the discretization scheme, assming that the optimal average cost is constant and the optimal differential cost is continos. The paper is organized as follows. In Section 2, we consider discretized approximations in the disconted case, and introdce a new approximation scheme. We prove asymptotic convergence for two main discretization schemes. In Section 3, we extend discretized approximations to the average cost case, and give an analysis of error bonds and asymptotic convergence. Finally in Section 4, we present experimental reslts. De to space limitations, some of the proofs have been omitted. They can be fond in an expanded version of this paper (Y and Bertsekas, 2004), which also addresses some additional topics, inclding a general framework for deriving pper and lower approximation schemes for POMDP. 2 DISCOUNTED CASE We introdce a new approximation scheme and smmarize known discretized lower approximation schemes for the disconted case. The belief MDPs associated with them will be the basis for the lower approximation schemes in the average cost case. The reslts obtained here will also be sefl there. In the disconted case, we minimize the disconted cost E π { α t g t (s t ) s 0 x} t for a fixed α 0, ). The optimal cost fnction Jα(x) satisfies the Bellman eqation J α(x) = (TJ α)(x), (TJ)(x) = min U x g + αe z { J ( φ (x, z) )} ], and denotes transpose, g denotes the per stage cost vector, and φ (x, z) denotes the conditional distribtion of s after applying control and observing z. A few notations for expectations will be sed throghot the text. At places emphasis of the distribtion is necessary, we se the symbol E z x, {...}, which shold be read as z p(z x, )..., and is eqivalent to the conditional expectation E z {... x, }. 2. A NEW INEQUALITY The optimal cost J α( ) is concave, i.e., for any convex combination x = i γ i( x)x i, γ i ( x) 0 and i γ i( x) =, J α ( x) i γ i( x)j α (x i). Using this property with x = φ (x, z) in the Bellman eqation, we have the following ineqality that was proposed by (Zho and Hansen, 200) for a discretized cost approximation: Jα(x) { min x g +αe z x, γ ( i φ (x, z) ) J i α(x i ) } ]. () We introdce a new ineqality, which follows from concavity of E z x, {Jα (φ (x, z))} in x. Proposition For all x X, x i X and γ i (x) 0 sch that x = i γ i(x)x i, i γ i(x) =, the optimal cost Jα (x) satisfies J α (x) min x g +α γ { ( i(x)e z xi, J i α φ (x i, z) )}]. (2) We present here, however, an alternative proof, that ses the interpretation of a modified process in which there is additional information abot the randomness of the initial distribtion. This argment has the same spirit as region-observable-pomdp (Zhang and Li, 997), and can be generalized (Y and Bertsekas, 2004). Since Prop. implies concavity of Jα ( ), which is not sed in the proof, it can also be sed to establish concavity of Jα ( ) withot an indction argment. Proof: Consider a new process P, otherwise identical to the original POMDP, except that the initial distribtion of s 0 is generated by a mixtre of m distribtions x i marginally identical to x. By this we mean that there is a random variable q taking vales from to m with p(q = k x) = γ k (x), p(s 0 q = k) = x k (s 0 ). Assme q is not accessible to the controller. The optimal cost for this new process eqals Jα (x), and is This is so becase x g = P i γi(x)x ig and min P P min.

3 achieved by the policy π that is optimal in the original POMDP. Denote its action at x by a. We have Jα (x) =x g a + E { E π { α t g(s t, t ) x, a, z, q} x, a }. t= Let φ a (x, (z, q)) be the distribtion p(s x, a, (z, q)). As q and hence φ a (x, (z, q)) are inaccessible to π, by the optimality of Jα ( ), we have that in the last eqation E π { α t g(s t, t ) x, a, z, q} αjα( φa (x, (z, q)) ). t= Since φ a (x, (z, q)) = φ a (x i, z) given q = i, it follows that Jα(x) x { g a + αe (z,q) x,a J α ( φa (x, (z, q)) )} = x g a + α γ { ( i(x)e z xi,a J i α φa (x i, z) )} min x g +α γ { ( i(x)e z xi, J i α φ (x i, z) )}]. 2.2 DISCRETIZED APPROXIMATIONS We first smmarize known lower approximation schemes, and then prove asymptotic convergence for two main schemes corresponding to the ineqalities () and (2) Approximation Schemes Let G = {x i } be a finite set of beliefs sch that their convex hll is X. A simple choice is to discretize X into a reglar grid, so we refer to x i as grid points. By choosing different x i and γ i ( ) in the ineqalities () and (2), we obtain lower cost approximations that are fnctionally determined by their vales at a finite nmber of beliefs. Definition (-Discretization Scheme) Call (G, γ) an -discretization scheme G = {x i } is a set of n beliefs, γ = (γ ( ),..., γ n ( )) is a convex representation scheme sch that x = i γ i(x)x i for all x X, and is a scalar characterizing the fineness of the discretization, and defined by = max x X max x i G γ i(x)>0 x x i. Given (G, γ), let T Di, i =, 2, be the associated mappings corresponding to the right-hand sides of ineqalities () and (2), respectively: ( T D J)(x) = min x g + α i E z x,{ γi ( φ (x, z) )} J(x i ) ], ( T D2 J)(x) = min x g + α i γ i(x)e z xi,{ J ( φ (x i, z) )}]. (3) (4) Associated with these mappings are their niqe belief MDPs on the continos belief space X, which we will refer as the modified belief MDPs. The optimal cost fnctions J i in these modified belief MDPs satisfy, respectively, ( T Di Ji )(x) = J i (x) J α(x), x X, i =, 2. Both J i are fnctionally determined by their vales at a finite nmber of beliefs, which will be called spporting points, and whose set is denoted by C. In particlar, the fnction J can be compted by solving a corresponding finite-state MDP on C = G = {x i }, and the fnction J 2 can be compted by solving a corresponding finite-state MDP on C = {φ (x i, z) x i G, U, z Z}. 2 The comptation can ths be done efficiently by variants of vale iteration methods, or linear programming. Usally X is partitioned into convex regions and beliefs in a region are represented as the convex combinations of its vertices. The fnction J is then piecewise linear on each region, and the fnction J 2 is piecewise linear and concave on each region. To see the latter, let q(x i, ) = E z xi,{ J 2 ( φ (x i, z) ) }, and we have J 2 (x) = min x g + α s γ i(x)q(x i, )]. The simplest case for both mappings is when G consists of vertices of the belief simplex, i.e. G = {e s s S}, e s (s) = and e s (s ) = 0, s s, s, s S. Denote the corresponding mappings by T D 0 i, i =, 2, respectively, i.e., ( T D 0 J)(x) = min x g + α s p(s x, )J(e s) ], (5) ( T D 0 2 J)(x) = min x g + α s x(s)e z s,{j ( φ (e s, z) ) } ]. (6) The mapping T D 0 is the QMDP approximation, sggested by (Littman, Cassandra, and Kaelbling, 995), who have shown good reslts for certain applications. In the belief MDP associated with T D 0, the states will be observable after the initial step. In the belief MDP associated with T D 0 2, the previos state will be revealed at each stage. One can show that T D 0 2 gives a better 2 More precisely, C = {φ (x i, z) x i G, U, z Z, sch that p(z x i, ) > 0}.

4 approximation than T D 0 in both disconted and average cost cases. For the comparison of T Di in general, by concavity of Jα, one can relax the ineqality Jα T D2 Jα to obtain an ineqality of the same form as the ineqality Jα T D Jα. See (Y and Bertsekas, 2004) for these details. By concatenating mappings we obtain other discretized lower approximations. For example, T T Di, i =, 2; TI T D2, (7) T I denotes a region-observable-pomdp type of mapping (Zhang and Li, 997). In the concatenated mapping ( T I T D2 ) we only need grid points to be on lower dimensional spaces. Let T be any of the above mappings. Its associated modified belief MDP is not necessarily a POMDP model. It is straightforward to show the following, 3 by comparing the N-stage optimal cost of the modified MDP to that of the original POMDP. This reslt also holds for α =. Proposition 2 Let J 0 be a concave fnction on X. For any α 0, ], ( T N J 0 )(x) (T N J 0 )(x), Asymptotic Convergence x X, N. We will now provide a limiting theorem for T D and T D2 sing the niform continity property of J α ( ). We first give some conventional notations related to policies, to be sed throghot the paper. Let µ be a stationary policy, and J µ be its cost. We define the mapping T µ by (T µ J)(x) = x g µ(x) + αe z x,µ(x) {J ( φ µ(x) (x, z) ) }, and similarly for any control, we define T to be the mapping that has the same single control in place of µ(x) in T µ. Let T be either T D or T D2, and similarly let T µ and T correspond to a policy µ and control, respectively. The fnction J α(x) is continos on X. For any continos fnction v( ), E z x, {v ( φ (x, z) ) } is also continos on X. As X is compact, by the niform continity of corresponding fnctions, we have the following lemma. Lemma Let v( ) be a continos fnction on X. For any δ > 0, there exists > 0 sch that for any -discretization scheme (G, γ) with, (T v)(x) ( T v)(x) δ, x X, U, 3 Use indction and concavity, or alternatively an argment similar to the proof of Prop.. T is either T D or T D2 associated with (G, γ). By Lemma, and the standard error bonds J J T J J α and J µ J Tµ J J α (see e.g., (Bertsekas, 200)), we have the following limiting theorem, which states that the lower approximation and the cost of its look-ahead policy, as well as the cost of the optimal policy with respect to the modified belief MDP, all converge to the optimal cost of the original POMDP. Theorem Let (G k, γ k ) be a seqence of k - discretization schemes with k 0 as k. Let J k, µ k and µ k be sch that J k = T k Jk = T k, µk Jk, T µk Jk = T J k, T k is either T D or T D2 associated with (G k, γ k ). Then for any fixed α 0, ), J k J α, J µk J α, J µk J α, as k. 3 DISCRETIZED APPROXIMATIONS FOR AVERAGE COST CRITERION In average cost POMDP, the objective is to minimize the average cost N Eπ { N t=0 g(s t, t ) s 0 x 0 }, as N goes to infinity. For POMDP with average cost, in order that a stationary optimal policy exists, it is sfficient that the following fnctional eqations, in the belief MDP notation, J(x) = min E x x, {J( x)}, (8) J(x) + h(x) = min U(x) x g + E x x, {h( x)}], U(x) = argmin E x x, {J( x)}, admit a bonded soltion (J ( ), h ( )). The stationary policy that obtains the minimm is then optimal with its average cost being J (x). However, there are no finite comptation algorithms to obtain it. (For a general analysis of POMDP with average cost, see (Fernández-Gacherand, Arapostathis, and Marcs, 99) or the srvey by (Arapostathis et al., 993).) We now extend the application of the discretized approximations to the average cost case. First, note that solving the corresponding average cost problem in the discretized approach is mch easier. Let T be any of the mappings from Eq. (3)-(7) in Section For its associated modified belief MDP, writing ḡ (x) for cost per-stage, we have the following average cost optimal-

5 ity eqations: J(x) = min Ẽ x x, {J( x)}, (9) J(x) + h(x) = min U(x) ḡ (x) + Ẽ x x,{h( x)}], U(x) = argmin Ẽ x x, {J( x)}, and we se Ẽ to indicate that the expectation is taken with respect to the distribtions p( x x, ) of the modified MDP, which satisfy p( x x, ) = 0, (x, ), x C, with C being the finite set of spporting beliefs. There are bonded soltions ( J( ), h( )) to the optimality eqations (9) for the following reason: Every finitestate MDP problem admits a soltion to its average cost optimality eqations. Frthermore if x C, x is transient and nreachable from C, and the next belief x belongs to C nder any control in the modified MDP. It follows that the optimality eqations (9) restricted on {x} C are the optimality eqations for the finite-state MDP with C + states, so the soltion ( J( x), h( x)) exists for x x C with their vales on C independent of x. This is essentially the algorithm to solve J( ) and h( ) in two stages, and obtain an optimal stationary policy for the modified MDP. Concerns arise, however, abot sing any optimal policy for the modified MDP as sboptimal control in the original POMDP. Althogh all average cost optimal policies behave eqally optimally in the asymptotic sense, they do so in the modified MDP, in which all the states x C are transient. As an illstration, sppose for the completely observable MDP, the optimal average cost is constant over all states, then at any belief x C any control will have the same asymptotic average cost in the modified MDP corresponding to the QMDP approximation scheme. The sitation worsens, if even the completely observable MDP itself has a large nmber of states that are transient nder its optimal policies. We therefore speclate that for the modified MDP, we shold aim to compte policies with additional optimality garantees, relating to their finite-stage behaviors. Fortnately for finitestate MDPs, there are efficient algorithms for compting sch policies. In the following we present the algorithm, after a brief review of the related reslts for finite-state MDP, and give preliminary analysis of error bonds and asymptotic convergence. We show sfficient conditions for the convergence of cost approximation, assming that the optimal average cost of the POMDP is constant. 3. ALGORITHM We first briefly review related reslts for finite-state MDPs. Since average cost measres the asymptotic behavior of a policy, given two policies having the same average cost, one can incr significantly larger cost in finite steps than the other. The concept of n-discont optimality is sefl for differentiating between sch policies. It is also closely related to Blackwell optimality. A policy π is n-discont optimal if its cost in the disconted cases satisfy lim sp α ( α) n (J π α (s) J π α(s)) 0, s, π. By definition an (n + )-discont policy is also k- discont optimal for k =, 0,..., n. A policy is called Blackwell optimal, if it is optimal for all the disconted problems with discont factor α ᾱ, ) for some ᾱ <. For finite-state MDPs, a policy is Blackwell optimal if and only if it is -discont optimal. By contrast, any ( )-discont optimal policy is average cost optimal. For any finite-state MDP, there exist stationary average cost optimal policies and frthermore, stationary n-discont optimal and Blackwell optimal policies. In particlar, there exist fnctions J( ), h( ) and w k ( ), k = 0,...,n +, with w 0 = h sch that they satisfy the following nested eqations: J(s) + h(s) = w k (s) + w k (s) = J(s) = min U(s) E s s,{j( s)}, (0) min g (s) + E s s, {h( s)}], U (s) min E s s,{w k ( s)}, U k (s) U (s) = argmin E s s, {J( s)}, U(s) U 0 (s) = argmin g (s) + E s s, {h( s)}], U (s) U k (s) = arg min E s s, {w k ( s)}. U k (s) Any stationary policy that attains the minimm in the right-hand sides of the eqations in (0) is an n- discont optimal policy. For finite-state MDPs, a stationary n-discont optimal policy not only exists, bt can also be efficiently compted by mlti-chain algorithms. Frthermore, in order to obtain a Blackwell optimal policy, which is -discont optimal, it is sfficient to compte a (N 2)-discont optimal policy, N is the nmber of states in the finite-state MDP. We refer readers to (Pterman, 994) Chapter 0, especially Section 0.3 for details of the algorithm as well as theoretical analysis.

6 This leads to the following algorithm for compting an n-discont optimal policy for the modified MDP defined on the continos belief space. We first solve the average cost problem on C, then determine optimal controls on transient states x C. Note there are no conditions (sch as nichain) at all on this modified belief MDP. The algorithm solving the modified MDP. Compte an n-discont optimal soltion for the finite-state MDP problem associated with C. Let J (x i ), h(x i ), and w k (x i ), k =,...,n +, with x i C, be the corresponding fnctions obtained that satisfy Eq. (0) on C. 2. For any belief x, let the control set U n+ be compted at the last step of the seqence of optimizations: U = arg min Ẽ xi x,{ J (x i )}, U 0 = arg min U ḡ (x) + Ẽx i x,{ h(x i )}], U k = arg min U k Ẽ xi x,{ w k (x i )}, k n +. Let be any control in U n+, and let µ (x) =. Also if x C, define J (x) = Ẽx i x,{ J (x i )}, h(x) = ḡ (x) + Ẽx i x,{ h(x i )} J (x). With the above algorithm we obtain an (n )- discont optimal policy for the modified MDP. When n = C, we obtain an -discont optimal policy for the modified MDP, 4 since the algorithm essentially comptes a Blackwell optimal policy for every finitestate MDP restricted on {x} C, for all x. Ths, for the modified MDP, for any other policy π, and any x X, lim sp α ( α) n µ ( J α (x) J α(x)) π 0, n. It is also straightforward to see that J (x) = lim α ( α) J α(x), x X, () J α (x) are the optimal disconted costs for the modified MDP, and the convergence is niform over X, since J α(x) and J (x) are piecewise linear interpolations of the fnction vales on a finite set of beliefs. 4 Note that -discont optimality and Blackwell optimality are eqivalent for finite-state MDPs, however, they are not eqivalent in the case of a continos state space. In the modified MDP, althogh for each x there exists an α(x) (0, ) sch that µ (x) is optimal for all α-disconted problems with α(x) α <, we may have sp x α(x) = de to the continity of the belief space. 3.2 ANALYSIS OF ERROR BOUNDS We now show how to bond the optimal average cost of the original POMDP, and how to bond the cost of execting the sboptimal policy, that is optimal to the modified MDP, in the original POMDP. Let V π N (x) = Eπ { N t=0 ḡ t (x t ) x 0 = x} be the N- stage cost of a non-randomized policy π, which can be non-stationary, in the original POMDP. Let J (x)=inf liminf π N V N(x), π J+(x)=inf limsp π N V N(x). π It is straightforward to show 5 that J (x) J + (x), x X. We now show that J (x) J (x), x X. Proposition 3 The optimal average cost fnction J (x) of the modified MDP satisfies J (x) J (x), x X. Proof: Let VN (x) and Ṽ N (x) be the optimal N-stage cost fnction of the original POMDP, and of the modified belief MDP, respectively. By Prop. 2 in Section 2.2., we have Ṽ N (x) V N (x), N. Ths J (x) = liminf N Ṽ N (x) liminf N V N (x) J (x). Next we give a simple pper bond on J +( ). Theorem 2 The optimal liminf and limsp average cost fnctions satisfy J (x) J (x) J+(x) max J ( x) + δ, δ = max x X x C (T h)(x) J (x) h(x) ], and J (x), h(x) and C are defined as in the modified MDP. This statement is a conseqence of the following lemma, whose proof, omitted here, follows by bonding the expected cost per stage in the smmation of the N-stage cost. Lemma 2 Let J(x) and h(x) be any bonded fnctions on X, and µ be any stationary policy. Define 5 Since in the disconted case the corresponding lower approximation satisfies J α(x) J α(x), by Eq. () and a Taberian theorem, we have for the approximate average cost J (x) = lim α ( α) J α(x) lim inf α ( α)j α(x) inf lim sp V π N N(x) π = J+(x).

7 constants δ + and δ by δ + =max x X δ =min x X ḡµ(x) (x) + E x x,µ(x) {h( x)} J(x) h(x) ], ḡµ(x) (x) + E x x,µ(x) {h( x)} J(x) h(x) ]. Then V µ N (x), the N-stage cost of execting policy µ, satisfies β (x) + δ liminf limsp N V µ N (x) β + (x), β (x) are defined by β + (x) = max J( x), x D x µ N V µ N (x) β+ (x) + δ +, x X, β (x) = min J(x), x D x µ and D x µ denotes the set of beliefs reachable nder policy µ from x. Let µ be the stationary policy that is optimal for the modified MDP. We can se Lemma 2 to bond the liminf and limsp average cost of µ in the original POMDP. For example, if the optimal average cost JMDP of the completely observable MDP problem eqals the constant λ over all states, then we also have J (x) = λ, x X, for this modified MDP. The cost of execting the policy µ in the original POMDP can therefore be bonded by λ + δ liminf limsp N V µ N (x) N V µ N (x) λ + δ +. The qantities δ + and δ can be hard to calclate exactly in general, since J ( ) and h( ) obtained from the modified MDP are piecewise linear fnctions. The bonds may also be loose. On the other hand, these fnctions may indicate the strctre of the original problem, and help s to refine the discretization scheme in the approximation. 3.3 ANALYSIS OF ASYMPTOTIC CONVERGENCE Let (G, γ) be an -discretization scheme, and J and J α, be the optimal average cost and disconted cost, respectively, in the modified MDP associated with (G, γ) and either T D or T D2. Recall that in the disconted case (Theorem ) for a fixed discont factor α, we have asymptotic convergence to optimality: lim J α, (x) = Jα(x). We now address the qestion whether J J (x), as 0, when J (x) = J (x) = J + (x) exists. This qestion of asymptotic convergence nder the average cost criterion is hard to tackle for a cople of reasons. First of all, it is not clear when J (x) exists. (Fernández-Gacherand, Arapostathis, and Marcs, 99) have shown that nder certain conditions, (sch as the condition that Jα (x) J α ( x) is bonded for all α 0, ), and its relaxed variants,) the optimal average cost J (x) exists and eqals a constant λ over X, and frthermore λ = lim ( α)jα (x), x X. (2) α However, even when Eq. (2) holds, in general we have lim J (x) = lim lim α ( α) J α, (x) lim α lim ( α) J α, (x) = λ. To ensre that J λ, we therefore need stronger conditions than those that garantee the existence of λ. We now show that a sfficient condition is the continity of the optimal differential cost h ( ). Theorem 3 Sppose the average cost optimality eqations (8) admit a bonded soltion (J (x), h (x)) with J (x) eqal to a constant λ. Then, if the differential cost h (x) is continos on X, we have lim J (x) = λ, x X, and the convergence is niform, J is the optimal average cost fnction for the modified MDP corresponding to either T D or T D2 with an associated -discretization scheme (G, γ). Proof: Let µ be the optimal policy for the modified MDP associated with an -discretization scheme. Let T be the mapping corresponding to the modified MDP, defined by ( Tv)(x) = min ḡ (x) + Ẽ x x,{v( x)}]. Since h (x) is continos on X, by Lemma in Section 2.2.2, we have that for any δ > 0, there exists > 0 sch that for all -discretization schemes with <, (T µ h )(x) ( T µ h )(x) δ. (3) We now apply the reslt of Lemma 2 in the modified MDP with J = λ, h = h, and µ = µ. That is, by the same argment as in Lemma 2, we have J (x) = liminf N Ṽ µ N (x) λ + η, x X, η = min x X ( T µ h )(x) λ h (x) ]. Since λ + h (x) = (Th )(x) (T µ h )(x), and (T µ h )(x) ( T µ h )(x) δ by Eq. (3), we have ( T µ h )(x) λ h (x) δ.

8 Hence η δ, and J (x) λ δ for all, and x X, which proves the niform convergence of J to λ. Note that the ineqality J J is crcial in the preceding proof. Note also that the proof does not generalize to the case when J (x) is not constant. A fairly strong sfficient condition that garantees the existence of a constant J and a continos h is that Jα (x) is eqicontinos on X for all α 0, ). (For a proof see (Ross, 968) or Theorem 6.3 (iv) in (Arapostathis et al., 993)). 4 PRELIMINARY EXPERIMENTS We demonstrate or approach on a set of toy problems: Paint, Bridge-repair, and Shttle. The sizes of the problems are smmarized in Table. Their descriptions and parameters are as specified in A. Cassandra s POMDP File Repository ( and we define costs to be negative rewards when a problem has a reward model. Table : Sizes of Problems S U Z Paint Bridge Shttle We sed some simple grid patterns. One pattern, referred to as k-e, consists of k grid points on each edge, in addition to the vertices of the belief simplex. Another pattern, referred to as n-r, consists of n randomly chosen grid points, in addition to the vertices of the simplex. The combined pattern is referred to as k-e+n-r. Ths the grid pattern for QMDP approximation is 0-E, for instance, and 2-E+0-R is a combined pattern. The grid pattern then indces a partition of the belief space and a convex representation (interpolation) scheme, which we kept implicitly and compted by linear programming on-line. The algorithm for solving the modified finite-state MDP was implemented by solving a system of linear eqations for each policy iteration. This may not be the most efficient way. No higher than 5-discont optimal policies were compted, when the nmber of spporting points became large. Figre shows the average cost approximation of T D and T D2 with a few grid patterns for the problem Paint. In all cases we obtained a constant average cost for the modified MDP. The horizontal axis is labeled by the grid pattern, and the vertical axis is the approximate cost. The red crve is obtained by T D, 0 E E 2 E+0 R 3 E 3 E+0 R 3 E+00 R 4 E Figre : Average Cost Approximation for Problem Paint Using Varios Grid Patterns. Ble: TD2, Red: T D. and the ble crve T D2. As will be shown below, the approximation obtained by T D2 with 3-E is already near optimal. The policies generated by T D2 are not always better, however. We also notice, as indicated by the drop in the crves when sing grid pattern 4-E, that the improvement of cost approximation does not solely depend on the nmber of grid points, bt also on they are positioned. In Table 2 we smmarize the cost approximations obtained (colmn LB) and the simlated cost of the policies (colmn S. Policy) for the three problems. The approximation schemes obtaining LB vales in Table 2, as well as the policies simlated, are listed in Table 3. The colmn N. UB shows the nmerically compted pper bond of the optimal we calclate δ in Theorem 2 by sampling the vales of (T h)(x) h(x) J(x) at hndreds of beliefs generated randomly and taking the maximm over them. Ths the N. UB vales are nder-estimates of the exact pper bond. For both Paint and Shttle the nmber of trajectories simlated is 60, and for Bridge 000. Each trajectory has 500 steps starting from the same belief. The first nmber in S. Policy in Table 2 is the mean over the average cost of simlated trajectories, and the standard error listed as the second nmber is estimated from bootstrap samples we created 00 psedo-random samples by sampling from the empirical distribtion of the original sample and compted the standard deviation of the mean estimator over these 00 psedo-random samples. As shown in Table 2, we find that some policy from the discretized approximation with very coarse grids can already be comparable to the optimal. This is verified by simlating the policy (S. Policy) and comparing its average cost against the lower bond of the optimal

9 Table 2: Average Cost Approximations and Simlated Average Cost of Policies Problem LB N. UB S. Policy Paint ±0.002 Bridge ±.258 Shttle ±0.007 Table 3: Approximation Schemes in LB and Simlated Policies in Table 2 Problem LB S. Policy Paint TD2 w/ 3-E TD w/ -E Bridge TD2 w/ 0-E TD2 w/ 0-E Shttle TD,2 w/ 2-E T D w/ 2-E (LB), which in trn shows that the lower approximation is near optimal. We find that in some cases the pper bonds may be too loose to be informative. For example, in the problem Paint we know that there is a simple policy achieving zero average cost, therefore a near-zero pper bond does not tell mch abot the optimal. In the experiments we also observe that an approximation scheme with more grid points does not necessarily provide a better pper bond of the optimal. 5 CONCLUSION In this paper we have proposed a discretized lower approximation approach for POMDP with average cost. We have shown that the approximations can be compted efficiently sing mlti-chain algorithms for finite-state MDP, and they can be sed for bonding the optimal liminf and limsp average cost fnctions, as well as generating sboptimal policies. Ths, like the finite state controller approach, or approach also bypasses the difficlt analytic qestions sch as the existence of bonded soltions to the average cost optimality eqations. We have also introdced a new lower approximation scheme for both disconted and average cost cases, and shown asymptotic convergence of two main approximation schemes in the average cost case nder certain conditions. Acknowledgements References Aberdeen, D. and J. Baxter (2002). Internalstate policy-gradient algorithms for infinite-horizon POMDPs. Technical report, RSISE, Astralian National University. Arapostathis, A., V. S. Borkar, E. Fernández- Gacherand, M. K. Ghosh, and S. I. Marcs (993). Discrete-time controlled Markov processes with average cost criterion: a srvey. SIAM J. Control and Optimization 3(2): Bertsekas, D. P. (200). Dynamic Programming and Optimal Control, Vols. I, II. Athena Scientific, second edition. Fernández-Gacherand, E., A. Arapostathis, and S. I. Marcs (99). On the average cost optimality eqation and the strctre of optimal policies for partially observable Markov decision processes. Ann. Operations Research 29: Littman, M. L., A. R. Cassandra, and L. P. Kaelbling (995). Learning policies for partially observable environments: Scaling p. In Int. Conf. Machine Learning. Lovejoy, W. S. (99). Comptationally feasible bonds for partially observed Markov decision processes. Operations Research 39(): Ormoneit, D. and P. Glynn (2002). Kernel-based reinforcement learning in average-cost problems. IEEE Trans. Atomatic Control 47(0): Pterman, M. L. (994). Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, Inc. Ross, S. M. (968). Arbitrary state Markovian decision processes. Ann. Mathematical Statistics 39(6): Y, H. and D. P. Bertsekas (2004). Ineqalities and their applications in vale approximation for disconted and average cost POMDP. LIDS tech. report, MIT. to appear. Zhang, N. L. and W. Li (997). A model approximation scheme for planning in partially observable stochastic domains. J. Artificial Intelligence Research 7: Zho, R. and E. A. Hansen (200). An improved gridbased approximation algorithm for POMDPs. In Int. J. Conf. Artificial Intelligence. This work is spported by NSF Grant ECS We thank Leslie Kaelbling for helpfl discssions.

FINITE ELEMENT APPROXIMATION OF CONVECTION DIFFUSION PROBLEMS USING GRADED MESHES

FINITE ELEMENT APPROXIMATION OF CONVECTION DIFFUSION PROBLEMS USING GRADED MESHES RICARDO G. DURÁN AND ARIEL L. LOMBARDI Abstract. We consider the nmerical approximation of a model convection-diffsion